Implementation of Bayesian Classifier and Logistic Regression


Author: Anil Gurbuz

Last Updated: 2 Jun 2019


Bayesian and frequentist approaches are widely disscused topics in statistics. In this notebook I will implement bayesian classifier together with logistic regression and discuss training size & error relationship for both of the models. Even though the purpose of those two models are same, their approach to the problem are quite different from each other.

Logistic Regression uses maximum likelihood approach to estimate the coefficients of linear equation -- assumes that labels are coming from a bernoulli distribution and finds the coefficients that minimises negative log likelihood of the observed target values.

On the other hand, Bayesian classifier fits a distribution to the training data i.e prior -- 2-D gaussian for this example -- and and calculates the posterior distribution by taking into account the marginal class probabilities.

In [1]:
# Load libraries
library(mvtnorm) # generates multivariate Gaussian sampels and calculate the densities
library(ggplot2)
library(reshape2)
options(warn=-1)
options(repr.plot.width=6, repr.plot.height=4)

# Load datasets
train = read.csv('train.csv')
test = read.csv('test.csv')

# create training and testing datasets:
train.data = train[,-3]
train.label = train[,3]
test.data = test[,-3]
test.label = test[,3]

# Normalise data
train.data= as.data.frame(scale(train.data))
test.data= as.data.frame(scale(test.data))
In [2]:
# auxiliary function that predicts class labels
predict <- function(w, X, c0, c1){
    sig <- sigmoid(w, X)
    return(ifelse(sig>0.5, c1,c0))
}
    
# auxiliary function that calculate a cost function
cost <- function (w, X, T, c0){
    sig <- sigmoid(w, X)
    return(sum(ifelse(T==c0, 1-sig, sig)))
}

# Sigmoid function (=p(C1|X))
sigmoid <- function(w, x){
    return(1.0/(1.0+exp(-w%*%t(cbind(1,x)))))    
}
In [3]:
bayesian <- function(train.data, train.label, test.data, test.label, co, c1){

    p0.hat <- sum(train.label==c0)/nrow(train.data) # total number of samples in class 0 divided by the total nmber of training data
    p1.hat <- sum(train.label==c1)/nrow(train.data) # or simply 1 - p1.hat
    
    

    # Class means:
    mu0.hat <- colMeans((train.data[train.label==c0,]))
    mu1.hat <- colMeans((train.data[train.label==c1,]))
    

    # class covariance matrices:
    sigma0.hat <- var(train.data[train.label==c0,])
    sigma1.hat <- var(train.data[train.label==c1,])
    


    # shared covariance matrix:
    ## if any matrix is zero, drop the corresponding phat term from the formula
    if(any(is.na(sigma0.hat))){    
        sigma.hat <- p1.hat * sigma1.hat
    
    }else if(any(is.na(sigma1.hat))){
    sigma.hat <- p0.hat * sigma0.hat
    
    }else{
        sigma.hat <- p0.hat * sigma0.hat + p1.hat * sigma1.hat
    }
    

    # calculate posteriors:
    posterior0 <- p0.hat*dmvnorm(x=train.data, mean=mu0.hat, sigma=sigma.hat)
    posterior1 <- p1.hat*dmvnorm(x=train.data, mean=mu1.hat, sigma=sigma.hat)



    # calculate predictions:
    test.predict <- ifelse(p0.hat*dmvnorm(x=test.data, mean=mu0.hat, sigma=sigma.hat) > p1.hat*dmvnorm(x=test.data, mean=mu1.hat, sigma=sigma.hat), c0, c1)


    # calculate error percentage:
    error.percentage = sum(test.label!=test.predict)/nrow(test.data)

    # Return missclassification rate
    return(error.percentage)
}
In [4]:
logistic <- function(train.data, train.label, train.len, c0, epsilon, eta, tau.max){

    # Initializations
    tau <- 1 # iteration counter
    terminate <- FALSE

    ## Just a few name/type conversion to make the rest of the code easy to follow
    X <- as.matrix(train.data) # rename just for conviniance
    T <- ifelse(train.label==c0,0,1) # rename just for conviniance

    W <- matrix(,nrow=tau.max, ncol=(ncol(X)+1)) # to be used to store the estimated coefficients
    W[1,] <- runif(ncol(W)) # initial weight (any better idea?)

    # project data using the sigmoid function (just for convenient)
    Y <- sigmoid(W[1,],X)

    costs <- data.frame('tau'=1:tau.max)  # to be used to trace the cost in each iteration
    costs[1, 'cost'] <- cost(W[1,],X,T, c0)


    while(!terminate){
        # check termination criteria:
        terminate <- tau >= tau.max | cost(W[tau,],X,T, c0)<=epsilon

        # shuffle data:
        train.index <- sample(1:train.len, train.len, replace = FALSE)
        X <- X[train.index,]
        T <- T[train.index]

        # for each datapoint:
        for (i in 1:train.len){
            # check termination criteria:
            if (tau >= tau.max | cost(W[tau,],X,T, c0) <=epsilon) {terminate<-TRUE;break}

            Y <- sigmoid(W[tau,],X)

            # Update the weights
            W[(tau+1),] <- W[tau,] - eta * (Y[i]-T[i]) * cbind(1, t(X[i,]))

            # record the cost:
            costs[(tau+1), 'cost'] <- cost(W[tau,],X,T, c0)

            # update the counter:
            tau <- tau + 1

            # decrease learning rate:
            eta = eta * 0.999
        }
    }

    w <- W[tau,]
    
    # Return final parameters 
    return(w)
}
In [5]:
# Initialize the parameters
tau.max <- 1000 # maximum number of iterations
eta <- 0.01 # learning rate
epsilon <- 0.01 # a threshold on the cost (to terminate the process)

# Assign class labels
c0=1
c1=-1

# Store missclassification rates and corresponding trianing set lengths
Error = as.data.frame(matrix(,ncol=3,nrow = length(seq(5,500,5))))
colnames(Error)=c('Train.len','Bayesian','Logistic Regression')

# For each training set length ...
for (train.len in seq(5,500,5)){
    
    # Calculate parameters of logistic regression
    logistic.w = logistic(train.data[1:train.len,], train.label[1:train.len], train.len, c0, epsilon, eta, tau.max)  
    
    # Make classifications of testing points using parameters calculated in the previous step
    logistic.prediction = predict(logistic.w, test.data, c0, c1)
    
    # Calculate misclassofication rate for logistic regression
    logistic.error = sum(logistic.prediction != test.label)/length(test.label)
    
    # Run bayesian model and get missclassification rate of the model
    bayesian.error = bayesian(train.data[1:train.len,], train.label[1:train.len], test.data, test.label, co, c1)
    
    # Store misclassification rates in Error data frame
    Error[train.len/5,'Train.len']=train.len
    Error[train.len/5,'Bayesian']=bayesian.error
    Error[train.len/5,'Logistic Regression']=logistic.error
    
    
    
}

# Plot Misclassification rate vs training data length for Bayesian Classifier and Logistic Regression
Error.m=melt(Error, id='Train.len')
ggplot(Error.m, aes(Train.len, value, color=variable)) +
labs(y="Misclassification Rate") + ylim(c(0,0.3)) +
ggtitle('Misclassification Ratio vs Number of Training Points') +
geom_line() +  theme_minimal()
Behavior of Bayesian Classifier when the number of training data points are incereased

Bayesian starts with a relatively high misclassification rate and then quickly converges to its stable state which is reached using around 60 training data points. After that point increasing number of training data points doesn't alter the misclassification rate of Bayesian classifier.

Behavior of Logistic Regression when the number of training data points are incereased

Logistic regression starts with a relatively low misclassification rate but increased size of training data doesn't really help to improve its perfomance. Although, size of trianing data set increased significantly, logistic regression fluctuates around same missclassification rate which is slighlty above bayesian's.

Bayesian Classifier is best suited when training set is small and Logistic Regression can be a better option if the training set is relatively larger to provide enough data for Logistic to converge.

Eventhough, Bayesian classifier requires to learn D*(D+5)/2 +1 parameters (in this case 8 different parameters) which suggest that Bayesian classifier may need more training data to estimate these parameters, results showing that bayesian updates can reach final state much faster compared to logistic regression in terms of training iterations.