Logistic Regression
Problem
By use of the logistic regression equation of vehicle transmission in the data set mtcars, estimate the probability of a vehicle being fitted with a manual transmission if it has a 120hp engine and weights 2800 lbs.
In [1]:
head(mtcars)
Solution
We apply the function glm to a formula that describes the transmission type (am) by the horsepower (hp) and weight (wt). This creates a generalized linear model (GLM) in the binomial family.
In [3]:
set.seed(123)
glm.fit <- span=""> glm(formula=am ~ hp + wt, data=mtcars, family=binomial)
->
We then wrap the test parameters inside a data frame newdata.
In [4]:
newdata = data.frame(hp=120, wt=2.8)
newdata
Now we apply the function predict to the generalized linear model glm.fit along with newdata. We will have to select response prediction type in order to obtain the predicted probability.
In [5]:
predict(glm.fit, newdata, type="response")
Answer
For an automobile with 120hp engine and 2800 lbs weight, the probability of it being fitted with a manual transmission is about 64%.
Note
Further detail of the function predict for generalized linear model can be found in the R documentation.
help(predict.glm)
Significance Test for Logistic Regression
We want to know whether there is any significant relationship between the dependent variable am and the independent variables hp and wt.
Problem
At .05 significance level, decide if any of the independent variables in the logistic regression model of vehicle transmission in data set mtcars is statistically insignificant.
Solution
We apply the function glm to a formula that describes the transmission type (am) by the horsepower (hp) and weight (wt). This creates a generalized linear model (GLM) in the binomial family. We have finished this step in previous section.
We then print out the summary of the generalized linear model and check for the p-values of the hp and wt variables.
In [7]:
summary(glm.fit)
Answer
As the p-values of the hp and wt variables are both less than 0.05, either feature - hp or wt - is significant in the logistic regression model.
Note
Further detail of the function summary for the generalized linear model can be found in the R documentation.
help(summary.glm)
Random Forest Regression
In a normal decision tree, one decision tree is built and in a random forest algorithm number of decision trees are built during the process. A vote from each of the decision trees is considered in deciding the final class of a case or an object, this is called ensemble process.
Random Forest uses Gini Index based impurity measures for building decision tree. Gini Index is also used for building Classification and Regression Tree (CART). In earlier the following blog post, they explained working of CART Decision Tree and a worked out example of Gini Index calculation .
Random Forest using R
Random Forest algorithm is built in randomForest package of R and same name function allows us to use the Random Forest in R.
In [10]:
#install.packages("randomForest")
# Load library
library(randomForest)
# Help on ramdonForest package and function
#library(help=randomForest)
#help(randomForest)
Some of the commonly used parameters of randomForest functions are
- x : Random Forest Formula
- data: Input data frame
- ntree: Number of decision trees to be grown
- replace: Takes True and False and indicates whether to take sample with/without replacement
- sampsize: Sample size to be drawn from the input data for growing decision tree
- importance: Whether independent variable importance in random forest be assessed
- proximity: Whether to calculate proximity measures between rows of a data frame
Random Forest can be used for Classification and Regression problems. Based on type of target /response variable, the relevant decision trees will be built. If the target variable type is factor, the decision tree will be built for a classifcation problem. If the target variable type is numeric, the decsion tree will be built for a regression problem.
Classfication
Taking am as target variable, and others as predictor variables.
In [37]:
table(mtcars$am)/nrow(mtcars)
table(mtcars$am)
The class distribution is pretty much balanced.
Next, we will split the data sample into development and validation samples.
In [12]:
sample.idx <- span=""> sample(2, nrow(mtcars), replace = T, prob = c(0.75,0.25))
#sample.idx <- mtcars="" nrow="" sample="" span="">
dat.train <- span=""> mtcars[sample.idx==1,]
dat.test <- span=""> mtcars[sample.idx==2,]
str(sample.idx)
table(dat.train$am)/nrow(dat.train)
nrow(dat.train)
table(dat.test$am)/nrow(dat.test)
nrow(dat.test)
->->->->
In [13]:
#Check the type of target variable
class(dat.train$am)
class(dat.test$am)
dat.train$am = as.factor(dat.train$am)
dat.test$am = as.factor(dat.test$am)
class(dat.train$am)
class(dat.test$am)
In [25]:
# define the training data columns, excluding the PatientMemberID column
#cols = names(mtcars)
#which(names(mtcars)== 'am')
#cols = cols[-which(names(mtcars)== 'am')]
#cols
set.seed(123)
model.rf <- span=""> randomForest(am ~ .,data=dat.train, ntree=20, importance=TRUE)
model.rf.pred <- span=""> predict(model.rf, dat.test)
rf_cost.pred = model.rf.pred
rf_cost.pred
dat.test$predicted.response <- span=""> rf_cost.pred
head(dat.test)
->->->
In [27]:
model.rf
plot(model.rf)
In [18]:
## Look at variable importance:
important.feature <- span=""> round(importance(model.rf), 2)
fcts <- span=""> important.feature[sort.list(important.feature[,1],decreasing = TRUE),]
fcts
# Feature Importance Plot
varImpPlot(model.rf,
sort = T,
main="Feature Importance",
n.var=5)
->->
Model Evaluation
Confusion Matrix
confusionMatrix function from caret package can be used for creating confusion matrix based on actual response variable and predicted value.
In [19]:
str(dat.test)
In [ ]:
#install.packages("e1071")
#install.packages("caret")
library(e1071)
library(caret)
# Create Confusion Matrix
confusionMatrix(data=dat.test$predicted.response,
reference=dat.test$am,
positive='1')
plot()
In statistics, a receiver operating characteristic (ROC), or ROC curve, is a graphical plot that illustrates the performance of a binary classifier system as its discrimination threshold is varied. The curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. The true-positive rate is also known as sensitivity, recall or probability of detection[1] in machine learning. The false-positive rate is also known as the fall-out or probability of false alarm[1] and can be calculated as (1 - specificity). The ROC curve is thus the sensitivity as a function of fall-out. In general, if the probability distributions for both detection and false alarm are known, the ROC curve can be generated by plotting the cumulative distribution function (area under the probability distribution from {\displaystyle -\infty } -\infty to the discrimination threshold) of the detection probability in the y-axis versus the cumulative distribution function of the false-alarm probability in x-axis.
In [29]:
head(dat.train)
head(dat.train[,-9])
In [33]:
#load ROCR library
install.packages('ROCR')
library('ROCR')
In [36]:
OOB.votes <- span=""> predict (model.rf,dat.train[,-9],type="prob")
OOB.pred <- span=""> OOB.votes[,2]
pred.obj <- span=""> prediction (OOB.pred,dat.train$am)
RP.perf <- span=""> performance(pred.obj, "rec","prec")
plot (RP.perf)
ROC.perf <- span=""> performance(pred.obj, "fpr","tpr");
plot (ROC.perf)
plot (RP.perf@alpha.values[[1]],RP.perf@x.values[[1]])
lines (RP.perf@alpha.values[[1]],RP.perf@y.values[[1]])
lines (ROC.perf@alpha.values[[1]],ROC.perf@x.values[[1]])
->->->->->