Author: Anil Gurbuz
Date: Sep 2019
3. Regression Model Development
This document includes the data analysis performed on several substances and their chemical properties. The aim is to build a models to predict the critical temperature of substances.
The first part is exploratory data analysis. It aims to understand the data and relationships between variables.
Second part explains the development of the model. It explains the decision made to increase the model performance and the evaluation of different options. There are 2 main models. A regression model developed with stepwise feature selection using BIC, Adjusted R^2 and C_p. Also, Ridge(L1) and Lasso(L2) versions are developed for the regression model. Best performing regression model out of these 5 is kept as a benchmark model.
Additionaly, XGBoost model is created after grid-searching for optimum hyperparameter values.
Also, for both XGBoost and Regression models, importance of parameters are assessed to understand factors affecting critical temperature of a superconductor.
Finally, resulting models are compared.
# Load Required Libraries
library(ggplot2)
library(reshape2)
library(stats)
library(scales)
library(grid)
library(gridExtra)
library(RColorBrewer)
library(dplyr)
library(leaps)
library(glmnet)
library(fmsb)
library(xgboost)
library(tidyverse)
library(dplyr)
library(caret)
# Load the datasets
data=read.csv("../input/super-conductors//train.csv")
# Load the datasets
nrow = dim(data)[1]
ncol = dim(data)[2]
# Divide dataset into test and train
set.seed(0)
test.index = sample(1:nrow, round(nrow/10), replace = FALSE)
train = data[-test.index, ]
test = data[test.index,]
# Divide dataframe
no_elements = data[,1]
atomic_mass = data[,2:11]
fie = data[,12:21]
atomic_radius = data[,22:31]
density = data[,32:41]
electron_aff = data[,42:51]
fusion_heat = data[,52:61]
thermal_cond = data[,62:71]
valance = data[,72:81]
critical_temp = data[,82]
Except for number_of_elements and critical temperature; all of the variables are derived using mean, weighted mean, geometric mean, weighted geometric mean, standard deviation, weighted standard deviation, entropy, weighted entropy, range and weighted range calculations.
Most of the original features are continous variables due to their nature. For example; atomic mass, fie (possibly fie refers to First Ionisation Energy due to context), atomic radius, density, electron affinity, fusian heat, thermal conductivity, critical temperature. Only number of valance electrons is a dicrete variable though due to applied transformations such mean and weighted mean etc. derived versions of valance variable are mostly continous except for range_valance. Also, number of elements included in the material is a discrete variable.
THERE ARE TOTAL 81 FEATURES IN THE DATA, NOT ALL OF THEM WILL BE SUMMARISED HERE.
number_of_elements: Mostly between 3 and 5.
mean_atomic_mass: Concentrated between 70 and 100. Maximum 209.
wtd_mean_atomic_mass: Concentrated between 50 and 90. Maximum 209.
gmean_atomic_mass: Concentrated between 60 and 80. Maximum 209.
wtd_gmean_atomic_mass: Mostly between 35 and 75. Maximum 209.
entropy_atomic_mass: Mostly between 1 and 2.
wtd_entropy_atomic_mass: Mostly between 0.8 and 2.
range_atomic_mass: Looks normal.
wtd_range_atomic_mass: Mostly between 0 and 40. Maximum 205 can be an outlier.
std_atomic_mass: Looks right skewed.
wtd_std_atomic_mass: Looks right skewed.
critical_temp: Looks skewed.
# Plot a histogram of each variable related with Atomic mass.
# Additionally, number of elements and critical temperature.
options(repr.plot.width=12, repr.plot.height=12)
par(mfrow = c(4,3))
for (i in 2:11){
hist(data[,i], xlab = NULL, main= colnames(data)[i])
}
hist(data[,1], xlab = NULL, main= colnames(data)[1])
hist(data[,82], xlab = NULL, main= colnames(data)[82])
Skewed distributions;
# Plot a boxplot of each variable related with critical temperature.
# Additionally, number of elements and critical temperature.
options(repr.plot.width=8, repr.plot.height=6)
hist(data[,82], xlab = NULL, main= colnames(data)[82])
# Draw Histograms for different transformations
options(repr.plot.width=6, repr.plot.height=12)
par(mfrow=c(3,1))
hist(log(data[,82]), xlab = NULL, main= 'Log - Critical Temperature')
hist(sqrt((data[,82])), xlab = NULL, main= 'Square-root - Critical Temperature')
hist(((data[,82]))^(1/3), xlab = NULL, main= 'Cube-root - Critical Temperature')
Not all the distribution graphs are not included in this notebook to avoid mess but all findings are presented here.
Due to the high number of columns, dataframe will be divided according to features and the correlation between different features will be investigated one by one (except for number of elements), instead of looking all columns together. Aim is to understand correlation between features in stead of columns. Functions used to derive different columns (i.e. entropy, weighted mean etc.) can easily create high correlation values. Therefore, this effect should be seperated when conculuding our understanding of the correlations between features.
Additionally, correlation values higher than 0.5 or lower than -0.5 will be evaluated as significant. Correlation values will be filtered first and then they will be evaluated for each pair.
# Choose pairs with a absolute correlation value higher than 0.5, order then print
df = cor(no_elements,data[,-82])
df = melt(df)
colnames(df)[3] ="Correlation"
df[,1] = "number_of_elements"
df = df[(df$Correlation > 0.5 | df$Correlation < -0.5),]
# Order rows according to correlation value
df = arrange(df, -Correlation)
# Add index column
df["index"] = 0
# Reorder columns
df = df[,c(4,1,2,3)]
# Assign values of index column
df["index"] = 1: length(df$Correlation)
cat("Number of columns have absolute correlation value higher than 0.5: ",length(df$Correlation))
cat("\nHighest positive correlation value is: ", df$Correlation[1])
cat("\nHighest negative correlation value is: ", df$Correlation[length(df$Correlation)])
According to above results;
# Define colors
colorRange <- c('#69091e', '#e37f65', 'white', '#aed2e6', '#042f60')
## colorRamp() returns a function which takes as an argument a number
## on [0,1] and returns a color in the gradient in colorRange
myColorRampFunc <- colorRamp(colorRange)
# Function to modify lower panel of pairs
panel.cor <- function(w, z, ...) {
correlation <- cor(w, z)
## because the func needs [0,1] and cor gives [-1,1], we need to shift and scale it
col <- rgb(myColorRampFunc((1 + correlation) / 2 ) / 255 )
## square it to avoid visual bias due to "area vs diameter"
radius <- sqrt(abs(correlation))
radians <- seq(0, 2*pi, len = 50) # 50 is arbitrary
x <- radius * cos(radians)
y <- radius * sin(radians)
## make them full loops
x <- c(x, tail(x,n=1))
y <- c(y, tail(y,n=1))
## Divide the view into parts
par(new=TRUE)
plot(0, type='n', xlim=c(-1,1), ylim=c(-1,1), axes=FALSE, asp=1)
polygon(x, y, border=col, col=col)
}
options(repr.plot.width=10, repr.plot.height=10)
pairs(c(atomic_mass[,1:4],fie[,1:4]),
lower.panel=panel.cor, main="Atomic Mass - First Ionisation Energy")
Not all the correlation graphs are included in this notebook to avoid being messy but all results are presented here.
Strongest correlation lying between 0.7 and 0.9 are observed between;
Other significant correlations can be noted as follows;
Density and atomic mass have strong positive linear correlation. This make sense if we consider the chemical formula for density which is mass / volume.
Central tendency measures likely to exhebit same correlations with the same features.
Correlation between variability measures;
Except for thermal conductivity, almost all of the features exhibits a strong or significant positive correlation with each others variability measures. Number of elements can be the reason behind this correlations due to derivation of measures.
Number of elements are highly correlated with almost all of the variabilty measures(std, range and entropy) of all the properties.
Correlation matrixs between variability measures such as entropy, standard deviation and range, follow similar patterns. Only difference is strength of correlations though these matrixes mostly look like sclaed version of each other. This suggest that different variability measures reveal the same relationships between features with different levels of strength.
Additionally, It is observed that correlations between variability measures are significantly strong. This situation may lead a collinearity problem of predictors in linear regression model though most of the correlations are due to transformations of the variables (Structral collinearity). Hence, this should not effect the predictive performans of the models.
According to above results, one can claim that there is complex correlation network between predictors. This is one thing that challanges linear regression model's initial assumption. Linear regression models assume that all the predictors are independent from each other whereas in our dataset there are unignorable correlations between predictors. This situation makes it hard to guess best predictors only by looking at their relationship with response because eventhough one varaible has a strong correlation with response, it doesn't necessarily mean that this variable should be included in our model. There can be other predictor combinations that may give a better prediction performance.
In this stage of the analysis, similar approach will be used as used in investigation of correlations between predictors.
First correlation values will be determined between response variable and the predictor and then if a significant correlation is observed, further investigation will be conducted by plotting these two variables on a scatter plot.
# Draw boxplot
boxplot(critical_temp~number_of_elements,data=data,
main="Critical Temperature - Number of elements",
xlab="Number of Elements", ylab="Critical Temperature",
col="orange",border="brown")
According to above graph and correlatio value;
# Choose pairs with a absolute correlation value higher than 0.5, order then print
df = cor(critical_temp,density) # First 4 columns are measure of central tendency.
df = melt(df)
colnames(df)[3] ="Correlation"
df = df[(df$Correlation > 0.5 | df$Correlation < -0.5),]
# Order rows according to correlation value
df = arrange(df, -Correlation)
# Add index column
df["index"] = 0
# Reorder columns
df = df[,c(4,1,2,3)]
# Assign values of index column
df["index"] = 1: length(df$Correlation)
df
cat("Number of columns have absolute correlation value higher than 0.5: "
,length(df$Correlation))
if(df$Correlation[1]>0)
{cat("\nHighest positive correlation value is: ", df$Correlation[1])}
if(df$Correlation[length(df$Correlation)]<0)
{cat("\nHighest negative correlation value is: ", df$Correlation[length(df$Correlation)])}
# Draw Scatter plots with a regresion line for highly correlated variables with response
for (variable in df[,"Var2"]){
p = ggplot(data, aes_string(x = variable, y = "critical_temp")) +
geom_point(alpha=0.3) +
geom_smooth(method="auto",se=TRUE, fullrange=FALSE,level=0.95, color='red')+
facet_wrap( ~ "critical_temp", ncol=2) + theme_minimal()
print(p)
}
According above to plots;
According to above correlation calculations;
Atomic Mass - Critical Temperature : Strongest correlation value is 0.6 and there are 2 features of atomic mass that exhibits stronger correlation than 0.5
First Ionisation Energy - Critical Temperature : Strongest correlation value is 0.55 and there are 4 features of fie that exhibits stronger correlation than 0.5.
Atomic Radius - Critical Temperature : Strongest correlation value is 0.6 and there are 5 features of atomic radius that exhibits stronger correlation than 0.5
Density - Critical Temperature : Strongest correlation value is 0.55 and there are 2 features of Density that exhibits stronger correlation than 0.5.
Electron Affinity - Critical Temperature : No correlation stronger than 0.5 is observed.
Fusion Heat- Critical Temperature : HighStrongestest correlation value is 0.55 and there are 2 features of fusion heat that exhibits stronger correlation than 0.5
Thermal Conductivity - Critical Temperature : Strongest correlation value is 0.7 and there are 3 features of thermal conductivity that exhibits stronger correlation than 0.5.
Valence - Critical Temperature : Strongest correlation value is 0.6 and there are 6 features of Valence that exhibits stronger correlation than 0.5.
By considering above results, one can say that ranking of top 4 correlations between critical temperature and other features are as follows;
# Draw scatter plot and add regresion line
options(repr.plot.width=14, repr.plot.height=8)
ggplot(data, aes(x=wtd_entropy_atomic_mass, y=critical_temp)) +
geom_point(aes(color=factor(number_of_elements))) +
geom_smooth(method="auto",se=TRUE, fullrange=FALSE,level=0.95, color='red')
Slope of regression line is different for different number of elements suggesting that there is a collinearity between 2 predictors.
# Draw scatter plot and add regresion line
ggplot(data, aes(x=wtd_entropy_atomic_mass, y=critical_temp)) +
geom_point(aes(colour=(mean_Valence))) +
geom_smooth(method="auto",se=TRUE, fullrange=FALSE,level=0.95, color='red') +
scale_colour_gradient(low = "white", high = "black")
According to above results for mean_Valence and number of elements;
Number of elements <= 3 and number of elements > 3 , creates 2 groups that act differently than each other. Therefore, these 2 groups can be created and could be added to model as interactions with other predictors to more accurately explain variability in response variable.
mean_Valence <= 4 and mean_Valence > 5 , creates 2 groups that act differently than each other. Therefore, these 2 groups can be created and could be added to model as interactions with other predictors to more accurately explain variability in response variable.
In this part all of the 81 predictors will be included in the initial set of predictors then using subset selection and parameter shrinkage methods, best possible models will be identified and compared.
Stepwise parameter selection will be performed using AIC, BIC, adjusted R^2 and C_p measures.
Best subset selection is too costly to perform due to high number of predictors. It would require to train and compare 2^81 models. Hence, only forward and backward selection approachs will be performed.
Name: Model.Accuracy
Input parameters:
Return Value: A list containing:
Description: Calculate the TSS and RSS as:
Name: RMSE
Input parameters:
Return Value: The RMSE value calculated from the predicted and target values
Description: Calculate the RMSE value: $RMSE = \sqrt {\sum_{i=1}^n (\hat y_i - y_i)^2 / N}$
# Define function to calculate F statistics, R^2 and Residual standard error
Model.Accuracy <- function(predicted, target, df, p) {
rss <- 0
tss <- 0
target.mean <- mean(target)
for (i in 1:length(predicted)) {
rss <- rss + (predicted[i]-target[i])^2
tss <- tss + (target[i]-target.mean)^2
}
rsquared <- 1 - rss/tss
rse <- sqrt(rss/df)
f.stat <- ((tss-rss)/p) / (rss/df)
return(list(rsquared=rsquared,rse=rse,f.stat=f.stat))
}
# Define function to calculate Root-Mean-Squared-Error
RMSE <- function(predicted, target) {
se <- 0
for (i in 1:length(predicted)) {
se <- se + (predicted[i]-target[i])^2
}
return (sqrt(se/length(predicted)))
}
# Create a multiple linear regression model that includes all of 81 predictors
fit1 <- lm(critical_temp ~ . , data=train)
R-squared value indicates that model explain 74% of the variability in the response variable (i.e critical temperature) of the training data. It is also important to note that there are 81 predictors in this model and probably most of them don't really contribute much to this R^2 value though the more predictor the higher R^2.
Also, p-value and F-statistics of the model suggest to reject the null hypothesis that suggests the model explains nothing. p-value < 2.2e-16 means if these model is explaining nothing, observing this good fit or better fits with the training data only occur once in 2.2e+16 samples.
According to summary above, some predictors' p values are not significant so probably they will be removed with the applied procedure in subset selection
# Set the size of the plot
options(repr.plot.width=12, repr.plot.height=12)
# Check residuals using plot function
par(mfrow=c(2,2), cex = 1.5, cex.main = 1.5)
plot(fit1)
mtext("Residual Plots of Regression Model-1", line = -1, outer = TRUE, cex=1.5, font=2)
Residual vs Fitted : One important assumption of linear regression is that the error terms have a constant variance. The plot shows that residuals follow a "V" shape representing an increase in variance of the residuals with increasing fitted value. Moreover, from EDA, we know that critical temperature variable is skewed which can easily create that issue in the model.
Q - Q Plot : Incidating that residuals are mostly normally distributed though, for the higher values of residuals they deviate significantly from dioganal line. This also supports that variance of residuals are not constant.
Scale - Location : This graph also suggests that increase fitted values ends up with an increase in standardised residuals so heteroscedasticity.
Residuals vs Leverage : No influential points, we cant even see cook's distance line.
# Model with log transformaiton of response
fit2 = lm(log(critical_temp)~., data=train)
# Set the size of the plot
options(repr.plot.width=12, repr.plot.height=12)
# Check residuals using plot function
par(mfrow=c(2,2), cex = 1.5, cex.main = 1.5)
plot(fit2)
mtext("Residual Plots of Regression Model-1 with Log-Transformation", line = -1, outer = TRUE, cex=1.5, font=2)
# Calculate model accuracy measures for fit2
Model.Accuracy(exp(fit2$fitted.values), train[,82], df = 19048, p= 89)
According to above results;
fit2 has 3% lower R^2 value and also, in terms of RSE and F-statistics first model performs better.
Although, fit2 slightly decreased the variability of variance of residuals though residuals still don't have a constant variance. This time variance change due to increased range of negative residuals
Also, Q-Q plot show that fit1's residuals are more normally distributed compared to fit2.
# Model with square-root transofrmation of response variable
fit3 = lm(sqrt(critical_temp)~., data=train)
# Set the size of the plot
options(repr.plot.width=12, repr.plot.height=12)
# Check residuals using plot function
par(mfrow=c(2,2), cex = 1.5, cex.main = 1.5)
plot(fit3)
mtext("Residual Plots of Regression Model-1 with Root-Transformation", line = -1, outer = TRUE, cex=1.5, font=2)
# Calculate model accuracy measures for fit3
Model.Accuracy((fit3$fitted.values)^2, train[,82], df = 19048, p = 89)
According to above results;
fit3 improved R^2 value compared to fit1 and fit2 by 2% and 5% respectively. Also, in terms of RSE and F-statistics fit3 outperforms other 2 models.
fit3 doesn't reduced much of the variability of variance of residuals though Scale-location graph still looks better than fit2.
In terms of Q-Q plot, fit1 and fit3 looks similar and they provide much normally distributed residuals compared to fit2.
All in all, square root transformation of the response variable seems like the best option. We also know from EDA that square root transformation of response variable is less skewed than log and plain versions of it.
# Perform forward selection using Cp, adj_R^2, BIC.
regfit.fwd <- regsubsets(sqrt(critical_temp) ~ .,
data = train, method = "forward", nvmax = 81)
reg.summary.fwd <- summary(regfit.fwd)
# Draw results of regsubsets
par(mfrow = c(2, 2), cex = 1.5, cex.main = 1.5)
plot(reg.summary.fwd$cp, xlab = "Number of variables", ylab = "C_p", type = "l")
points(which.min(reg.summary.fwd$cp), reg.summary.fwd$cp[which.min(reg.summary.fwd$cp)],
col = "red", cex = 2, pch = 20)
plot(reg.summary.fwd$bic, xlab = "Number of variables", ylab = "BIC", type = "l")
points(which.min(reg.summary.fwd$bic), reg.summary.fwd$bic[which.min(reg.summary.fwd$bic)],
col = "red", cex = 2, pch = 20)
plot(reg.summary.fwd$adjr2, xlab = "Number of variables", ylab = "Adjusted R^2", type = "l")
points(which.max(reg.summary.fwd$adjr2), reg.summary.fwd$adjr2[which.max(reg.summary.fwd$adjr2)],
col = "red", cex = 2, pch = 20)
plot(reg.summary.fwd$rss, xlab = "Number of variables", ylab = "RSS", type = "l")
mtext("Forward Selection - Criteria Scores vs Number of Included Features & Optimum Point", line = -1, outer = TRUE, cex=1.5, font=2)
cat("Number of predictors for Minimum C_p:",which.min(reg.summary.fwd$cp))
cat("\nNumber of predictors for Minimum BIC:",which.min(reg.summary.fwd$bic))
cat("\nNumber of predictors for Maximum adj_R2:",which.max(reg.summary.fwd$adjr2))
Adjusted R^2 and C_p suggest to keep all of the predictors.
RSS always decreases with increasing number of predictors. Therefore, it doesn't suggest a specific number of variables though, amount of decrease don't seem significant until the right end of the graph that can be interpreted as a sign to eliminate some of the predictors in the model.
BIC has its lowest value when there are 76 variables. Therefore, one model will be created according to BIC's suggestion and it will be compared with full model. ### 3.1.1 Test Error of BIC - Cp - Adj_R^2 Models
# Make predictions on test set and print RMSE
bic_coef.fwd = coef(regfit.fwd, which.min(reg.summary.fwd$bic))
pred_names = names(bic_coef.fwd)
test.mat = model.matrix(critical_temp~., data = test)
pred = test.mat[,pred_names]%*%bic_coef.fwd
pred = pred^2
cat("RMSE of Step Forward BIC Model: ",RMSE(pred,test[,82]))
# Make predictions on test set and print RMSE
cp_coef.fwd = coef(regfit.fwd, which.min(reg.summary.fwd$cp))
pred_names = names(cp_coef.fwd)
test.mat = model.matrix(critical_temp~., data = test)
pred = test.mat[,pred_names]%*%cp_coef.fwd
pred = pred^2
cat("\nRMSE of Step Forward Cp Model: ",RMSE(pred,test[,82]))
# Make predictions on test set and print RMSE
adjr2_coef.fwd = coef(regfit.fwd, which.max(reg.summary.fwd$adjr2))
pred_names = names(adjr2_coef.fwd)
test.mat = model.matrix(critical_temp~., data = test)
pred = test.mat[,pred_names]%*%adjr2_coef.fwd
pred = pred^2
cat("\nRMSE of Step Forward Adjusted R^2 Model: ",RMSE(pred,test[,82]))
# Draw the stepwise selection process
par(cex.axis=.5, cex.lab=.5)
plot(regfit.fwd, main="C_p Stepwise Predictor Selection", scale="Cp",
col=gray(seq(0, 0.9, length = 100)),cex.axis=0.1)
# Draw the stepwise selection process
par(cex.axis=.5, cex.lab=.5)
plot(regfit.fwd, main="Adjusted R^2 Stepwise Predictor Selection", scale="adjr2",
col=gray(seq(0, 0.9, length = 100)),cex.axis=0.1)
# Draw the stepwise selection process
par(cex.axis=.5, cex.lab=.5)
plot(regfit.fwd, main="BIC Stepwise Predictor Selection", scale="bic",
col=gray(seq(0, 0.9, length = 100)))
According to stepwise predictor selection procedures visualised above, top 5 predictors that most contributes to prediction performance according to BIC, adjusted R^2 and C_p are same and as follows;
Apart from individual features, by looking general patterns of the graphs, we can rank the chemical properties (such as Atomic Mass) according to their contribution to predictive performance. In graph, features of a certain chemical property are located together which allows us to rank them by visuall inspection.
According to BIC, adjusted R^2 and Cp ;
1. Electron Affinity : Most of the lines are about to touch to X axis around Electron Affinity measures, that makes electron affinity the one property of the substance that makes the most significant impact on the prediction of critical temperature.
2. Thermal Conductivity : According to result of forward selection procedure, Thermal conductivity should be the second important predictor for prediction of critical temperature.
Other 3 important predictor properties are Atomic Radius, Atomic Mass, Valance
# Create test and training matrix
train.mat <- model.matrix(sqrt(critical_temp) ~ ., data = train)[,-1]
test.mat <- model.matrix(sqrt(critical_temp) ~ ., data = test)[,-1]
# Create grid for hyper parameter tuning
grid <- 10^seq(4, -2.5, length = 100)
# Find the best lambda & train the model
set.seed(1)# the purpose of fixing the seed of the random number generator is to make the result repeatable.
fit.ridge <- glmnet(train.mat, sqrt(train$critical_temp), alpha = 0, lambda = grid, thresh = 1e-12)
cv.ridge <- cv.glmnet(train.mat, sqrt(train$critical_temp), alpha = 0, lambda = grid, thresh = 1e-12)
bestlam.ridge <- cv.ridge$lambda.min
cat("Best performing lambda value for Ridge according to applied cross-validation",bestlam.ridge)
# Make predictions on test set
pred.ridge <- (predict(fit.ridge, s = bestlam.ridge, newx = test.mat))^2
cat("\nMSE of the model using Ridge parameter shrinkage",RMSE(pred.ridge,test[,82]))
# Find the best lambda & train the model
set.seed(1)# the purpose of fixing the seed of the random number generator is to make the result repeatable.
fit.lasso <- glmnet(train.mat, log(train$critical_temp), alpha = 1, lambda = grid, thresh = 1e-12)
cv.lasso <- cv.glmnet(train.mat, log(train$critical_temp), alpha = 1, lambda = grid, thresh = 1e-12)
bestlam.lasso <- cv.lasso$lambda.min
cat("Best performing lambda value for Lasso according to applied cross-validation",bestlam.lasso)
cat("\nNumber of predictors that Lasso excluded from the full model: ",
sum(predict(fit.lasso, s = bestlam.lasso, type = "coefficients")==0))
# Make predictions on test set
pred.lasso <- (predict(fit.lasso, s = bestlam.lasso, newx = test.mat))^2
cat("\nMSE of the model using Lasso parameter shrinkage",RMSE(pred.lasso,test[,82]))
Stepwise Selection BIC: BIC reached its lowest value with 76 predictors. RMSE on test set is 16.87109 for forward selection with BIC.
Stepwise Selection Cp and Adjusted R^2: Both of them suggested to use model with 81 predictors. Hence, they suggested the same model and the RMSE values are same consequently. RMSE on test set is 16.83296 for forward selection with Adjusted R^2 and Cp.
Parameter Shrinkage using Ridge: Ridge didn't assign 0 coefficient to any predictors which means its model included 81 predictors. RMSE on test set is 17.01915 for the model with Ridge parameter shrinkage.
Parameter Shrinkage using Lasso: As expected, Lasso excluded some of the predictors. It ended up with 60 predictors including factors. RMSE on test set is 39.09433 for the model with Lasso parameter shrinkage.
Starting with all 81 predictors and applying subset selection and parameter shrinkage methods above models are obtained. Best predictive performance is observed when all the features are included as suggested by Adj_R^2 and C_p. Therefore, final regression model will be the one with all the predictors.
# Final regression model is the one with all predictors
final_reg = fit3
# Load the datasets
data=read.csv("../input//super-conductors/train.csv")
nrow = dim(data)[1]
ncol = dim(data)[2]
# Divide dataset into test and train
set.seed(0)
test.index = sample(1:nrow, round(nrow/3), replace = FALSE)
train = data[-test.index, ]
test = data[test.index,]
# Seperate labels
train_data = as.matrix(train[,-82])
train_label = train[,82]
test_data = as.matrix(test[,-82])
test_label = test[,82]
# Convert into Dmatrix for XGBoost
dtrain <- xgb.DMatrix(data = train_data, label= train_label)
dtest <- xgb.DMatrix(data = test_data, label= test_label)
# xgboost fitting with arbitrary parameters
params = list(
objective = "reg:squarederror", # Optimise on Squared Error in regression
eta = 0.1, # learning rate
max.depth = 3, # Max depth of each decision tree
eval_metric = "rmse" # Root-Mean-Squared-Error
)
# cross-validate xgboost to get the accurate measure of error
xgb_cv = xgb.cv(params = params,
data = dtrain,
nrounds = 200,
nfold = 5, # number of folds in K-fold
prediction = TRUE, # return the prediction using the final model
early_stop_round = 3,
verbose=FALSE
)
# Exclude std. columns
xgb_cv = xgb_cv$evaluation_log %>% select(-contains("std")) %>% melt(id=c("iter"))
# Size the plot
options(repr.plot.width=10, repr.plot.height=8)
# Plot Train & Test Error
ggplot(xgb_cv, aes(x=iter, y=value,col=variable)) + geom_line() +theme_minimal(base_size = 18) +
xlab("Round No") + ylab("RMSE") + ggtitle(" Cross-Validation - Rounds") +
scale_color_discrete(name = "Labels", labels = c( "Train", "Test"))
# set up the cross-validated hyper-parameter search grid
grid = expand.grid(
nrounds = 100,
eta = c(1, 0.1, 0.01),
max_depth = c(2, 4, 6),
gamma = 0,
colsample_bytree=1,
min_child_weight = 2,
subsample = 1
)
# Set train control parameters
cont_params = trainControl(
method = "cv",
number = 5, # Number of Folds in CV
verboseIter = FALSE,
returnData = FALSE,
returnResamp = "all", # save losses across all models
)
# train XGBoost models for each parameter combination in the grid
xgb_tuning = train(
x = train_data,
y = train_label,
trControl = cont_params,
tuneGrid = grid,
method = "xgbTree"
)
# Size the plot
options(repr.plot.width=6, repr.plot.height=6)
# Plot of the RMSE against hyper-parameters max_depth and eta
ggplot(xgb_tuning$results, aes(x = as.factor(eta), y = max_depth, size = RMSE, color = RMSE)) +
geom_point() +
theme_minimal(base_size = 18) +
xlab("eta") +
scale_size_continuous(guide = "none") + ggtitle("Grid-Search - Max-depth & eta")
cat("Lowest RMSE giving eta: ",xgb_tuning$bestTune$eta)
cat("\nLowest RMSE giving max_depth: ",xgb_tuning$bestTune$max_depth)
xgb_model <- xgb.train(objective = "reg:squarederror", # Optimise on Squared Error in regression
eta = 0.1, # learning rate
max.depth = 6, # Max depth of each decision tree
eval_metric = "rmse",
data = dtrain,
nrounds = 250, # Number of trees to create
nfold = 5, # number of folds in K-fold
early_stop_round = 3, # Stop, if no improvement 3 consequtive rounds
verbosa=0
)
# get the importance of features
importance_mat = xgb.importance(colnames(dtrain), model = xgb_model)
# Size the plot
options(repr.plot.width=10, repr.plot.height=10)
# Draw the bar plot of feature importance
xgb.ggplot.importance(importance_mat, top_n = 10) + theme_minimal(base_size = 18) +
ggtitle("Top 10 Feature Importance") + ylab("Importance: Weights in the Model")
These results are consistent with the ones found using regression models. Especially Thermal conductivity is the most important checmical property according to all of the created models.
# Make predictions on test set and print RMSE
regression_pred=(predict(final_reg, as.data.frame(test_data)))^2
cat("RMSE of Final Regression Model: ",RMSE(regression_pred,test_label),
"; which is ",RMSE(regression_pred,test_label)*100/mean(test_label),
"% of the mean critical temperature")
xgb_pred = (predict(xgb_model, dtest))
cat("\nRMSE of Tuned XGBoost Model: ",RMSE(xgb_pred,test_label),
"; which is ",RMSE(xgb_pred,test_label)*100/mean(test_label),
"% of the mean critical temperature")
RMSE: By looking at RMSE, XGBoost model deviates from the actual critical temperature 9.7 on average for test set whereas RMSE is 16.9 for regression. Also, these values can be evaluated according to average critical temperature. For this case regression model deviates 48% of the mean critical temperature for a prediction on average. XGBoost decision trees 27% suggesting that XGBoost has a better predictive performance than the model created with regression.
All in all, by looking at above results, Model created using XGBoost outperforms the all other developed model in this notebook.