Implementation of Bootstrap & Error Analysis


Author: Anil Gurbuz

Last Updated: 19 Apr 2019


Bootstrap is an iterative resampling technique with replacement to re-generate datasets from the existing data which let us quantify the uncertainity of any statistics of choice. There are implementations available in several machine learning libraries though I will implement this method from scratch to go over the process step by step and use it to quantify variability of the produced error of my kNN implementation.

Finally, I will investigate the effect of hyperparameters used in bootstrap on predictive performance.

In [1]:
# Load libraries and set global options
library(reshape2)
library(ggplot2)
library(corrplot)
options(warn=-1)
options(repr.plot.width=6, repr.plot.height=4)

# Load Datasets
train = read.csv('train.csv')
test = read.csv('test.csv')

# Extract labels and predictors
train.data = train[,-5]
train.label = train[,5]
test.data = test[,-5]
test.label = test[,5]

train.len = nrow(train.data)
corrplot 0.84 loaded
In [2]:
# define a function that generates sample indixes based on bootstrap technique
boot <- function(orginal.size=100, sample.size=orginal.size, times=100 ){
    indx = matrix(nrow = times, ncol = sample.size)
    
    for (i in 1:times){
        indx[i,] = sample(orginal.size, sample.size, replace = TRUE)
    }
    
    return(indx)
}

# Implement kNN
knn<-function(train.data, train.target, test.data, K=3){
    
    # Count number of training set elements
    train.len = nrow(train.data) 
    
    # Count number of test samples
    test.len = nrow(test.data)
    
    # Create a dataframe to store predictions
    prediction = data.frame(nrow=test.len)
    
    # Calculate distance between points
    dist <- as.matrix(dist(rbind(test.data, train.data), method= 'euclidean'))[1:test.len, (test.len+1):(test.len+train.len)]
    
   # for each test sample...
   for (i in 1:test.len){
       
        ### ...find its K nearest neighbours from training sampels...
       nn <- as.data.frame(sort(dist[i,], index.return = TRUE))[1:K,2]
        
        ###...and calculate the predictions using mean of nearest samples' labels
       prediction[i,]<- mean(train.target[nn])
    }
    
    # return the predictions
    return (prediction)   
    
}
In [3]:
# fix the parameters
K = 20               # Maximum K for KNN 
L = 100              # Number of datasets
N = 25               # Size of datasets

# generate bootstrap indices
boot.index = boot(nrow(train.data), N, L)

# a dataframe to track the error values in each case
miss = data.frame("K"=1:K, 'L'=1:L, 'test'=rep(0,L*K))


# THIS MAY TAKE A FEW MINUTES TO COMPLETE
## for every k values:
for (k in 1:K){
    
    ### for every datasets
    for (l in 1:L){
        
        #### calculate iteration index i
        i = (k-1)*L+l
        
        #### save sample indices that were selected by bootstrap
        indx = boot.index[l,]
        
        #### save the value of k and l
        miss[i,"K"]=k
        miss[i,'L']=l
        
        #### calculate and record test error values
        miss[i,'test']=sum((knn(train.data[indx,], train.label[indx], test.data, K=k) - test.label)^2)/nrow(test.data)       
    }
}

# plot error values for each K value
miss.m = melt(miss, id=c("K","L"))
names(miss.m) = c('K',"L","Type","Error")
ggplot(data = miss.m , aes (factor(K), Error, fill=Type)) +
geom_boxplot(outlier.shape = NA) +
labs(x= 'K') +
scale_color_discrete(guide = guide_legend(title=NULL)) +
ggtitle('Error vs. K (Box Plot)') + theme_minimal()

Relationship Between Error, Uncertainity of Error and Values of K

According to graph above, it is obvious that increasing K results in increased error and uncertainity.

For every K value, we bootstrappped 100 training sets with 25 points in each. Hence, there are 100 different models for every K value. Each K value represents a distribution of 100 models prediction errors on test set, enabling us to see the average error of models and how much deviates the models' error with each other.

Models with small values of K ends up with error values that are close to each other. Also Models with small K values have smaller average error valuse.

Referencing the Error analysis with cross-validation method, we ended up with similar results. Using cross-validation, we observed that increasing K creates higher standard deviation and mean error among models.

Error Analysis with respect to Number of Subsets in Bootstrap

In [4]:
# a dataframe to track all of the error values in each case
miss.all=data.frame(matrix(ncol=2, nrow=0))
colnames(miss.all)=c('L','test')

# THIS MAY TAKE A FEW MINUTES TO COMPLETE
# For each value of number of subsets...
for (l in seq(10,200, by=10)){
    
    ## generate bootstrap indices
    boot.index = boot(train.len, 25, l)
    
    ## data frame just to store error values of current step of for loop
    miss = data.frame('L'=l, 'test'=rep(0,l))
    
    ### For each bootsrapped sets...
    for (each.set in 1:l){
        
        #### save sample indices that were selected by bootstrap
        indx=boot.index[each.set,]
        
        #### calculate and record error values
        miss[each.set,'test']=sum((knn(train.data[indx,], train.label[indx], test.data, K=10) - test.label)^2)/nrow(test.data) 
        
    }
    
    ## Concatenate error values of different values of number of subsets
    miss.all = rbind(miss.all, miss)   
}

# plot error values for different Times
miss.m  = melt(miss.all, id='L')
ggplot(data = miss.m , aes (x=factor(L), value, fill=variable)) +
geom_boxplot(outlier.shape = NA) +
labs(x= 'Number of Subsets in Bootstrapping') +
scale_color_discrete(guide = guide_legend(title=NULL)) +
ggtitle('Avg. Error vs Number of Subsets') + theme_minimal()

Relationship Between Error, Uncertainity of Error and Number of Subsets in Bootstrapping

According to graph above, there is no relationship between number of subsets in bootstapping and resulting mean error and variance of the errors.

Above, we created different number of subsets using bootstrapping method. Each subset includes 25 data points and K = 10 is used as a knn parameter to find the resulting error of each subset. For every subset, there is a model created and error is calculated. Therefore, increasing number of subsets also means increasing number of models.

Eventhough there are more number of models for higher number of subsets, uncertainity of the errors doesn't show a meaningfull change at all.

To conclude, the final precision is determined by the initial sample size which means increasing number of subsets doesn't help to decrease the uncertanity.