Ragularisation and Bias-Variance Analysis


Author: Anil Gurbuz

Last Updated: 20 Apr 2019


Ridge is a regularisation method which let's us control bias-variance trade off by controling the trade off between model complexity and goodness-of-fit of the model. This is achieved by adding the objective function a penalty term to penalise for high values of coefficients.

Objective function of a simple linear model; $$ \sum_{i=1}^{N} (y_i - \hat{y_i})$$

Objective function for Ridge;

$$ \sum_{i=1}^{N} (y_i - \hat{y_i}) + \lambda w^2$$

Whereas lambda is the hyperparameter that controls the amount of penalty due to increase in the linear coefficients $w$.

Optimising the objective function with complexity penalty, results in shrinkage of model coefficients and a simpler mode

In [1]:
# Load libraries
library(ggplot2)
library(reshape2)
options(warn=-1)
options(repr.plot.width=6, repr.plot.height=4)

# Load Datasets
train = read.csv('train.csv')
test = read.csv('test.csv')

# Seperate predictors and labels
train.data = train[,-5]
train.label = train[,5]
test.data = test[,-5]
test.label = test[,5]

# Normalise data
train.data=scale(train.data)
train.label=scale(train.label)
test.data=scale(test.data)
test.label=scale(test.label)

train.len = nrow(train)
test.len = nrow(test)

# Basic training configuration
lambda = 0.001  # Regularisation coefficient
eta = 0.01      # Learning rate
epsilon = mean(train.label)*0.1  # a threshold on the cost (to terminate the process)
max.epoch=20
In [2]:
# auxiliary function to calculate the predictions of the model
predict_func <- function(Phi,w){
    return (Phi%*%w)   
}

# auxiliary function to calculate a cost function
obj_func <- function(Phi, w, label, lambda){
    return (0.5*(mean((label - predict_func(Phi,w))^2)) + 0.5*lambda*(w%*%w))
}
In [3]:
sgd_model <- function(train.data, train.label, train.len, eta, lambda, max.epoch, epsilon){
    Phi <- as.matrix(cbind('X0'=1, train.data)) # add a column of 1 
    T <- train.label # rename just for conviniance
    tau.max <- max.epoch*train.len  # maximum number of iterations

    W <- matrix(,nrow=tau.max, ncol=ncol(Phi)) # be used to store the estimated oefficients
    W[1,] <- runif(ncol(Phi)) # initial weight
    
    
    # to be used to trace value of objective function in each iteration
    objective <- data.frame('tau'=1:tau.max)  
    objective[1,'Train_Error']= obj_func(Phi, W[1,], train.label, lambda)


    tau <- 1 # iteration counter
    terminate <- FALSE
    
    while(!terminate){
    # check termination criteria:
    terminate <- tau >= tau.max | obj_func(Phi, W[tau,],T,lambda)<=epsilon
    
    # shuffle data:
    train.index <- sample(1:train.len, train.len, replace = FALSE)
    Phi <- Phi[train.index,]
    T <- T[train.index]
    
    # for each datapoint:
    for (i in 1:train.len){
        # check termination criteria:
        if (tau >= tau.max | obj_func(Phi, W[tau,],T,lambda)<=epsilon) {terminate<-TRUE;break}
        
        t_pred = predict_func(Phi[i,], W[tau,])
        # for each coefficient:
        for (j in 1: ncol(W)){
            # update the coefficient:
            W[(tau+1),j] <- W[tau,j] - (eta) * (-(T[i]-t_pred)*(Phi[i,j]) + lambda*W[tau,j])
        }
        
        # update the counter:
        tau <- tau + 1        
        objective[tau, 'Train_Error'] <- obj_func(as.matrix(cbind(1, train.data)), W[tau,],train.label,lambda)
    }
}
    W <- W[1:tau,]
    W=as.data.frame(W)
    colnames(W) = c('x0','x1','x2','x3','x4')
    # Return all the values of parameters in each iteration and corresponding objective function value
    return(list(W,objective))

    
}
In [4]:
## Initialization
N = 100
L = 50
lambdas = seq(0,5,0.2)
indexes = matrix(nrow=L , ncol = N)

# Sample the sets
for (l in 1:L){
    indexes[l,] = sample(1:train.len, N, replace = TRUE)
    }


all.predictions = matrix(nrow=L*(length(lambdas)), ncol=nrow(test.data)+2) # To store all fo the predictions of each model
Phi.test <- as.matrix(cbind('X0'=1, test.data)) # test data appropriate for multiplication with model parameters
counter = 0 # To calculate the row index to store the results in all.predictions data frame

# For each values of lambda ...
for (lambda in lambdas){
    counter = counter +1
    
    # For each sampled set
    for (l in 1:L){
        
        # calculate the row index to store the results in all.predictions data frame
        row=(counter-1)*L +l
        
        # Index values of currently processed sampled set
        indx=indexes[l,]
        
        # Run the model
        sgd = sgd_model(train.data[indx,], train.label[indx,], length(indx), eta, lambda, max.epoch, mean(train.label[indx,])*0.1)   
        
        # Store parameters of the model
        sgd_W=as.matrix(as.data.frame(sgd[1]))
        
        # make predictions on test set using current model parameters
        prediction = sgd_W[nrow(sgd_W),] %*% t(Phi.test)
        
        # Store the predictions for different lambda and sampled sets
        all.predictions[row,1] = lambda
        all.predictions[row,2] = l
        all.predictions[row, -(1:2) ] = prediction
        
    
    }
}


# Rename the columns of all.predictions
colnames(all.predictions)=c('lambda', 'l',  paste('y',1:(nrow(test.data)), sep=''))
all.predictions=as.data.frame(all.predictions)

# Calculate average prediction among models for every point in test data
y.bar=aggregate(all.predictions,list(all.predictions$lambda),mean)
y.bar <- as.matrix(y.bar[,-c(1:3)])

# Matrices to store bias2, variance and error values for different values of lambda
error <- matrix(0,nrow=length(lambdas))
bias2 <- matrix(0,nrow=length(lambdas))
variance <- matrix(0,nrow=length(lambdas))

# For each distinct value of lambda
for (each in 1:length(lambdas)){
    
    # Calculate bias value for corresponding lambda
    bias2[each] <- mean((y.bar[each,] - test.label)^2)
}

# For each distinct value of lambda
for (lambda in lambdas){
    
    # Calculate variance value for corresponding lambda
    variance[(lambda*5+1),]=mean(colMeans((all.predictions[all.predictions$lambda==lambda,-c(1,2)]- y.bar[(lambda*5+1),])^2))   
    
    # Calculate error value for corresponding lambda
    error[(lambda*5+1),]=mean(colMeans((all.predictions[all.predictions$lambda==lambda,-c(1,2)] - test.label)^2))       
    
}

# calculate addition of bias2 and variance
bias_variance=bias2+variance

# Plot Lambda vs Error - Bias2 - Variance - Bias2+variance Graph
dat = as.data.frame(cbind(lambdas,error,bias2,variance,bias_variance))
colnames(dat) = c('Lambda','Avg. Error','Bias2','Variance','Bias2+Variance')
dat.m = melt(dat, id='Lambda')
ggplot(dat.m, aes(log(Lambda),value, color=variable)) + 
geom_line() +
labs(y='Values')+
theme_minimal()
In [5]:
cat('Minimum Average error occurs when Lambda =',dat[which.min(dat[,2]),'Lambda'])
cat('\nMinimum Bias2+Variance occurs when Lambda =',dat[which.min(dat[,5]),'Lambda'])
Minimum Average error occurs when Lambda = 5
Minimum Bias2+Variance occurs when Lambda = 5
According to graph above, best value for lambda is 5

Lambda is the coefficient that determines the amount of complexity penalty on top of error value. Therefore, if lambda is high, model tends to end up with a less complex state.

Bias and variance are the properties of models that are directly effected by the model complexity so by lamda too. If the model get more complex, its Bias value decreases and variance value increases. These bias and variance values are the 2 not constant part of the error function so error value is actually the function of bias and variance.

Therefore; Increasing lambda -> decreasing model complexity -> increasing bias and decreasing variance decreasing lambda -> increasing model complexity -> decreasing bias and increasing variance Error = f(bias,variance)

All in all, good model is the one with balanced values of bias2 and variance to end up with minimum value of bias2+variance. By looking at the graph above, we can easily observe the explained behavior of bias, variance, error and lambda relationship. Lambda = 5 is where the minimum bias2+variance and error occurs so best lamda value is 5.