Author: Anil Gurbuz
Last Updated: 19 Apr 2019
Bootstrap is an iterative resampling technique with replacement to re-generate datasets from the existing data which let us quantify the uncertainity of any statistics of choice. There are implementations available in several machine learning libraries though I will implement this method from scratch to go over the process step by step and use it to quantify variability of the produced error of my kNN implementation.
Finally, I will investigate the effect of hyperparameters used in bootstrap on predictive performance.
# Load libraries and set global options
library(reshape2)
library(ggplot2)
library(corrplot)
options(warn=-1)
options(repr.plot.width=6, repr.plot.height=4)
# Load Datasets
train = read.csv('train.csv')
test = read.csv('test.csv')
# Extract labels and predictors
train.data = train[,-5]
train.label = train[,5]
test.data = test[,-5]
test.label = test[,5]
train.len = nrow(train.data)
# define a function that generates sample indixes based on bootstrap technique
boot <- function(orginal.size=100, sample.size=orginal.size, times=100 ){
indx = matrix(nrow = times, ncol = sample.size)
for (i in 1:times){
indx[i,] = sample(orginal.size, sample.size, replace = TRUE)
}
return(indx)
}
# Implement kNN
knn<-function(train.data, train.target, test.data, K=3){
# Count number of training set elements
train.len = nrow(train.data)
# Count number of test samples
test.len = nrow(test.data)
# Create a dataframe to store predictions
prediction = data.frame(nrow=test.len)
# Calculate distance between points
dist <- as.matrix(dist(rbind(test.data, train.data), method= 'euclidean'))[1:test.len, (test.len+1):(test.len+train.len)]
# for each test sample...
for (i in 1:test.len){
### ...find its K nearest neighbours from training sampels...
nn <- as.data.frame(sort(dist[i,], index.return = TRUE))[1:K,2]
###...and calculate the predictions using mean of nearest samples' labels
prediction[i,]<- mean(train.target[nn])
}
# return the predictions
return (prediction)
}
# fix the parameters
K = 20 # Maximum K for KNN
L = 100 # Number of datasets
N = 25 # Size of datasets
# generate bootstrap indices
boot.index = boot(nrow(train.data), N, L)
# a dataframe to track the error values in each case
miss = data.frame("K"=1:K, 'L'=1:L, 'test'=rep(0,L*K))
# THIS MAY TAKE A FEW MINUTES TO COMPLETE
## for every k values:
for (k in 1:K){
### for every datasets
for (l in 1:L){
#### calculate iteration index i
i = (k-1)*L+l
#### save sample indices that were selected by bootstrap
indx = boot.index[l,]
#### save the value of k and l
miss[i,"K"]=k
miss[i,'L']=l
#### calculate and record test error values
miss[i,'test']=sum((knn(train.data[indx,], train.label[indx], test.data, K=k) - test.label)^2)/nrow(test.data)
}
}
# plot error values for each K value
miss.m = melt(miss, id=c("K","L"))
names(miss.m) = c('K',"L","Type","Error")
ggplot(data = miss.m , aes (factor(K), Error, fill=Type)) +
geom_boxplot(outlier.shape = NA) +
labs(x= 'K') +
scale_color_discrete(guide = guide_legend(title=NULL)) +
ggtitle('Error vs. K (Box Plot)') + theme_minimal()
For every K value, we bootstrappped 100 training sets with 25 points in each. Hence, there are 100 different models for every K value. Each K value represents a distribution of 100 models prediction errors on test set, enabling us to see the average error of models and how much deviates the models' error with each other.
Models with small values of K ends up with error values that are close to each other. Also Models with small K values have smaller average error valuse.
Referencing the Error analysis with cross-validation method, we ended up with similar results. Using cross-validation, we observed that increasing K creates higher standard deviation and mean error among models.
# a dataframe to track all of the error values in each case
miss.all=data.frame(matrix(ncol=2, nrow=0))
colnames(miss.all)=c('L','test')
# THIS MAY TAKE A FEW MINUTES TO COMPLETE
# For each value of number of subsets...
for (l in seq(10,200, by=10)){
## generate bootstrap indices
boot.index = boot(train.len, 25, l)
## data frame just to store error values of current step of for loop
miss = data.frame('L'=l, 'test'=rep(0,l))
### For each bootsrapped sets...
for (each.set in 1:l){
#### save sample indices that were selected by bootstrap
indx=boot.index[each.set,]
#### calculate and record error values
miss[each.set,'test']=sum((knn(train.data[indx,], train.label[indx], test.data, K=10) - test.label)^2)/nrow(test.data)
}
## Concatenate error values of different values of number of subsets
miss.all = rbind(miss.all, miss)
}
# plot error values for different Times
miss.m = melt(miss.all, id='L')
ggplot(data = miss.m , aes (x=factor(L), value, fill=variable)) +
geom_boxplot(outlier.shape = NA) +
labs(x= 'Number of Subsets in Bootstrapping') +
scale_color_discrete(guide = guide_legend(title=NULL)) +
ggtitle('Avg. Error vs Number of Subsets') + theme_minimal()
Above, we created different number of subsets using bootstrapping method. Each subset includes 25 data points and K = 10 is used as a knn parameter to find the resulting error of each subset. For every subset, there is a model created and error is calculated. Therefore, increasing number of subsets also means increasing number of models.
Eventhough there are more number of models for higher number of subsets, uncertainity of the errors doesn't show a meaningfull change at all.
To conclude, the final precision is determined by the initial sample size which means increasing number of subsets doesn't help to decrease the uncertanity.