Author: Anil Gurbuz
Last Updated: 17 Apr 2019
Gradient descent is one of the most widely used optimisation algorithms in machine learning models. This algorithm updates the model weights using the derivative of the cost function to end up with new weights that leads to a lower cost value.
Difference between BGD and SGD in terms of implementation is the frequency of the weight updates. SGD update the weights by using the gradients of one training example where as BGD uses average gradient of batch of training examples. This immediately implies that for same number of iterations SGD makes batch size times more weight updates than BGD.
In this notebook, I will implement both of those algorithms from scratch and also observe the difference between their convergence behaviour. Finally, I will explain reasons behind those different convergence behaviours.
# Load libraries
library(ggplot2)
library(reshape2)
options(warn=-1)
options(repr.plot.width=6, repr.plot.height=4)
# Load Datasets
train = read.csv('train.csv')
test = read.csv('test.csv')
# Seperate predictors and labels
train.data = train[,-5]
train.label = train[,5]
test.data = test[,-5]
test.label = test[,5]
# Normalise data
train.data=scale(train.data)
train.label=scale(train.label)
test.data=scale(test.data)
test.label=scale(test.label)
train.len = nrow(train)
test.len = nrow(test)
# Basic training configuration
lambda = 0.001 # Regularisation coefficient
eta = 0.01 # Learning rate
epsilon = mean(train.label)*0.1 # a threshold on the cost (to terminate the process)
max.epoch=20
# auxiliary function to calculate the predictions of the model
predict_func <- function(Phi,w){
return (Phi%*%w)
}
# auxiliary function to calculate a cost function
obj_func <- function(Phi, w, label, lambda){
return (0.5*(mean((label - predict_func(Phi,w))^2)) + 0.5*lambda*(w%*%w))
}
sgd_model <- function(train.data, train.label, train.len, eta, lambda, max.epoch, epsilon){
Phi <- as.matrix(cbind('X0'=1, train.data)) # add a column of 1
T <- train.label # rename just for conviniance
tau.max <- max.epoch*train.len # maximum number of iterations
W <- matrix(,nrow=tau.max, ncol=ncol(Phi)) # be used to store the estimated oefficients
W[1,] <- runif(ncol(Phi)) # initial weight
# to be used to trace value of objective function in each iteration
objective <- data.frame('tau'=1:tau.max)
objective[1,'Train_Error']= obj_func(Phi, W[1,], train.label, lambda)
tau <- 1 # iteration counter
terminate <- FALSE
while(!terminate){
# check termination criteria:
terminate <- tau >= tau.max | obj_func(Phi, W[tau,],T,lambda)<=epsilon
# shuffle data:
train.index <- sample(1:train.len, train.len, replace = FALSE)
Phi <- Phi[train.index,]
T <- T[train.index]
# for each datapoint:
for (i in 1:train.len){
# check termination criteria:
if (tau >= tau.max | obj_func(Phi, W[tau,],T,lambda)<=epsilon) {terminate<-TRUE;break}
t_pred = predict_func(Phi[i,], W[tau,])
# for each coefficient:
for (j in 1: ncol(W)){
# update the coefficient:
W[(tau+1),j] <- W[tau,j] - eta * (-(T[i]-t_pred) * Phi[i,j] +lambda*W[tau,j])
}
# update the counter:
tau <- tau + 1
objective[tau, 'Train_Error'] <- obj_func(as.matrix(cbind(1, train.data)), W[tau,],train.label,lambda)
}
}
W <- W[1:tau,]
W=as.data.frame(W)
colnames(W) = c('x0','x1','x2','x3','x4')
# Return all the values of parameters in each iteration and corresponding objective function value
return(list(W,objective))
}
bgd_model <- function(train.data, train.label, train.len, eta, lambda, max.epoch, epsilon){
Phi <- as.matrix(cbind('X0'=1, train.data)) # add a column of 1
T <- train.label # rename just for conviniance
tau.max <- max.epoch # maximum number of iterations
W <- matrix(,nrow=tau.max, ncol=ncol(Phi)) # be used to store the estimated oefficients
W[1,] <- runif(ncol(Phi)) # initial weight (any better idea?)
tau <- 1 # iteration counter
objective <- data.frame('tau'=1:tau.max) # to be used to trace the test and training errors in each iteration
objective[tau,'Train_Error']= obj_func(Phi, W[tau,], train.label, lambda)
# for each iteration...
for (tau in 1:(tau.max-1)){
# check termination criteria
if(objective[tau,1]<=epsilon) {break}
# Calculate prediction of current parameters
prediction = predict_func(Phi,W[tau,])
# Update parameters
W[tau+1,] = W[tau,] - eta*(-((t(T - prediction))%*%Phi)/nrow(Phi) + lambda%*%W[tau,])
# Record value of objective function for each update of parameters
objective[tau+1,'Train_Error']= obj_func(Phi, W[tau+1,], train.label, lambda)
}
W=as.data.frame(W)
colnames(W) = c('x0','x1','x2','x3','x4')
# Return all the values of parameters in each iteration and corresponding objective function value
return(list(W,objective))
}
# Run the algorithms
bgd=bgd_model(train.data, train.label, train.len, eta, lambda, max.epoch, epsilon)
sgd=sgd_model(train.data, train.label, train.len, eta, lambda, max.epoch, epsilon)
# Error and parameter values of the model for every itration in bgd
bgd_W = as.data.frame(bgd[1])
bgd_error = as.data.frame(bgd[2])
# Error and parameter values of the model for every itration in sgd
sgd_W = as.data.frame(sgd[1])
sgd_error = as.data.frame(sgd[2])
# Reaname the columns
colnames(bgd_error)=c('tau','BGD_train_error')
colnames(sgd_error)=c('tau','SGD_train_error')
# 1 iteration of bgd equals to 1 full usage of all training data points in sgd
bgd_error['tau']=bgd_error['tau']*nrow(train.data)
# Plotting the graph
bgd_error.m=melt(bgd_error, id='tau')
sgd_error.m=melt(sgd_error,id='tau')
ggplot() + geom_line(data = bgd_error.m, aes(x=tau, y = value, color=variable)) +
geom_line(data = sgd_error.m, aes(x=tau, y = value, color =variable)) +
labs(y='Train Error')+
theme_minimal()
Because SGD makes weigth updates in every iteration wherese BGD makes one weight update after iterating through entire set, SGD makes a lot more number of updates than BGD eventhough they process the same number of points. Making weight update is how the algorithm learns so SGD converges much more faster than BGD in terms of number of processed points or number of iterations.
The reason creates fluctuations in SGD's curve is the noise. Because SGD uses one point at a time, its path is noisier than BGD's curve. BGD get rid of the effect of noise by calculating the error values taking into account the entire set.