Logistic Regression

Problem

By use of the logistic regression equation of vehicle transmission in the data set mtcars, estimate the probability of a vehicle being fitted with a manual transmission if it has a 120hp engine and weights 2800 lbs.

In [1]:

head(mtcars)

	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
Mazda RX4	21	6	160	110	3.9	2.62	16.46	0	1	4	4
Mazda RX4 Wag	21	6	160	110	3.9	2.875	17.02	0	1	4	4
Datsun 710	22.8	4	108	93	3.85	2.32	18.61	1	1	4	1
Hornet 4 Drive	21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
Hornet Sportabout	18.7	8	360	175	3.15	3.44	17.02	0	0	3	2
Valiant	18.1	6	225	105	2.76	3.46	20.22	1	0	3	1

Solution

We apply the function glm to a formula that describes the transmission type (am) by the horsepower (hp) and weight (wt). This creates a generalized linear model (GLM) in the binomial family.

In [3]:

set.seed(123)
glm.fit <- span=""> glm(formula=am ~ hp + wt, data=mtcars, family=binomial)

We then wrap the test parameters inside a data frame newdata.

In [4]:

newdata = data.frame(hp=120, wt=2.8)
newdata

	hp	wt
1	120	2.8

Now we apply the function predict to the generalized linear model glm.fit along with newdata. We will have to select response prediction type in order to obtain the predicted probability.

In [5]:

predict(glm.fit, newdata, type="response")

1: 0.641812528409381

Answer

For an automobile with 120hp engine and 2800 lbs weight, the probability of it being fitted with a manual transmission is about 64%.

Note

Further detail of the function predict for generalized linear model can be found in the R documentation.

help(predict.glm)

Significance Test for Logistic Regression

We want to know whether there is any significant relationship between the dependent variable am and the independent variables hp and wt.

Problem

At .05 significance level, decide if any of the independent variables in the logistic regression model of vehicle transmission in data set mtcars is statistically insignificant.

Solution

We apply the function glm to a formula that describes the transmission type (am) by the horsepower (hp) and weight (wt). This creates a generalized linear model (GLM) in the binomial family. We have finished this step in previous section.

We then print out the summary of the generalized linear model and check for the p-values of the hp and wt variables.

In [7]:

summary(glm.fit)

Call:
glm(formula = am ~ hp + wt, family = binomial, data = mtcars)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.2537  -0.1568  -0.0168   0.1543   1.3449  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept) 18.86630    7.44356   2.535  0.01126 * 
hp           0.03626    0.01773   2.044  0.04091 * 
wt          -8.08348    3.06868  -2.634  0.00843 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 43.230  on 31  degrees of freedom
Residual deviance: 10.059  on 29  degrees of freedom
AIC: 16.059

Number of Fisher Scoring iterations: 8

Answer

As the p-values of the hp and wt variables are both less than 0.05, either feature - hp or wt - is significant in the logistic regression model.

Note

Further detail of the function summary for the generalized linear model can be found in the R documentation.

help(summary.glm)

Random Forest Regression

In a normal decision tree, one decision tree is built and in a random forest algorithm number of decision trees are built during the process. A vote from each of the decision trees is considered in deciding the final class of a case or an object, this is called ensemble process.

Random Forest uses Gini Index based impurity measures for building decision tree. Gini Index is also used for building Classification and Regression Tree (CART). In earlier the following blog post, they explained working of CART Decision Tree and a worked out example of Gini Index calculation .

Random Forest using R

Random Forest algorithm is built in randomForest package of R and same name function allows us to use the Random Forest in R.

In [10]:

#install.packages("randomForest")
# Load library
library(randomForest)
# Help on ramdonForest package and function
#library(help=randomForest)
#help(randomForest)

Some of the commonly used parameters of randomForest functions are

x : Random Forest Formula
data: Input data frame
ntree: Number of decision trees to be grown
replace: Takes True and False and indicates whether to take sample with/without replacement
sampsize: Sample size to be drawn from the input data for growing decision tree
importance: Whether independent variable importance in random forest be assessed
proximity: Whether to calculate proximity measures between rows of a data frame

Random Forest can be used for Classification and Regression problems. Based on type of target /response variable, the relevant decision trees will be built. If the target variable type is factor, the decision tree will be built for a classifcation problem. If the target variable type is numeric, the decsion tree will be built for a regression problem.

Classfication

Taking am as target variable, and others as predictor variables.

In [37]:

table(mtcars$am)/nrow(mtcars)
table(mtcars$am)

      0       1 
0.59375 0.40625

 0  1 
19 13

The class distribution is pretty much balanced.

Next, we will split the data sample into development and validation samples.

In [12]:

sample.idx <- span=""> sample(2, nrow(mtcars), replace = T, prob = c(0.75,0.25))
#sample.idx <- mtcars="" nrow="" sample="" span="">
dat.train <- span=""> mtcars[sample.idx==1,]
dat.test <- span=""> mtcars[sample.idx==2,]
 
str(sample.idx)

table(dat.train$am)/nrow(dat.train)

nrow(dat.train)
 
table(dat.test$am)/nrow(dat.test)
nrow(dat.test)

 int [1:32] 1 2 1 2 2 1 1 2 1 1 ...

        0         1 
0.5714286 0.4285714

        0         1 
0.6363636 0.3636364

In [13]:

#Check the type of target variable
class(dat.train$am)
class(dat.test$am)

dat.train$am = as.factor(dat.train$am)
dat.test$am = as.factor(dat.test$am)

class(dat.train$am)
class(dat.test$am)

'numeric'

'factor'

In [25]:

# define the training data columns, excluding the PatientMemberID column    
#cols = names(mtcars)
#which(names(mtcars)== 'am')
#cols = cols[-which(names(mtcars)== 'am')]
#cols
set.seed(123)
model.rf <- span=""> randomForest(am ~ .,data=dat.train, ntree=20, importance=TRUE)
model.rf.pred <- span=""> predict(model.rf, dat.test)
rf_cost.pred = model.rf.pred
rf_cost.pred
dat.test$predicted.response <- span=""> rf_cost.pred
head(dat.test)

Mazda RX4 Wag: 1
Hornet 4 Drive: 0
Hornet Sportabout: 0
Merc 240D: 0
Merc 280C: 0
Lincoln Continental: 0
Toyota Corolla: 1
Toyota Corona: 1
Camaro Z28: 0
Maserati Bora: 0
Volvo 142E: 1

	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb	predicted.response
Mazda RX4 Wag	21	6	160	110	3.9	2.875	17.02	0	1	4	4	1
Hornet 4 Drive	21.4	6	258	110	3.08	3.215	19.44	1	0	3	1	0
Hornet Sportabout	18.7	8	360	175	3.15	3.44	17.02	0	0	3	2	0
Merc 240D	24.4	4	146.7	62	3.69	3.19	20	1	0	4	2	0
Merc 280C	17.8	6	167.6	123	3.92	3.44	18.9	1	0	4	4	0
Lincoln Continental	10.4	8	460	215	3	5.424	17.82	0	0	3	4	0

In [27]:

model.rf
plot(model.rf)

Call:
 randomForest(formula = am ~ ., data = dat.train, ntree = 20,      importance = TRUE) 
               Type of random forest: classification
                     Number of trees: 20
No. of variables tried at each split: 3

        OOB estimate of  error rate: 14.29%
Confusion matrix:
   0 1 class.error
0 10 2   0.1666667
1  1 8   0.1111111

In [18]:

## Look at variable importance:
important.feature <- span=""> round(importance(model.rf), 2)
fcts <- span=""> important.feature[sort.list(important.feature[,1],decreasing = TRUE),]
fcts


# Feature Importance Plot
varImpPlot(model.rf,
           sort = T,
           main="Feature Importance",
           n.var=5)

	0	1	MeanDecreaseAccuracy	MeanDecreaseGini
wt	2.89	3.04	3.19	3.42
drat	2.71	2.48	2.96	3.13
mpg	1.48	2.10	2.16	1.36
disp	1.03	1.22	1.75	0.46
gear	1.03	1.03	1.03	0.42
cyl	0.00	0.00	0.00	0.13
hp	0.00	-0.16	0.00	0.53
qsec	0	0	0	0
vs	0.00	0.00	0.00	0.08
carb	0.00	0.00	0.00	0.25

Model Evaluation

Confusion Matrix

confusionMatrix function from caret package can be used for creating confusion matrix based on actual response variable and predicted value.

In [19]:

str(dat.test)

'data.frame': 11 obs. of  12 variables:
 $ mpg               : num  21 21.4 18.7 24.4 17.8 10.4 33.9 21.5 13.3 15 ...
 $ cyl               : num  6 6 8 4 6 8 4 4 8 8 ...
 $ disp              : num  160 258 360 147 168 ...
 $ hp                : num  110 110 175 62 123 215 65 97 245 335 ...
 $ drat              : num  3.9 3.08 3.15 3.69 3.92 3 4.22 3.7 3.73 3.54 ...
 $ wt                : num  2.88 3.21 3.44 3.19 3.44 ...
 $ qsec              : num  17 19.4 17 20 18.9 ...
 $ vs                : num  0 1 0 1 1 0 1 1 0 0 ...
 $ am                : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 2 ...
 $ gear              : num  4 3 3 4 4 3 4 3 3 5 ...
 $ carb              : num  4 1 2 2 4 4 1 1 4 8 ...
 $ predicted.response: Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 2 1 1 ...

In [ ]:

#install.packages("e1071")
#install.packages("caret")
library(e1071)
library(caret)
# Create Confusion Matrix
confusionMatrix(data=dat.test$predicted.response,
                reference=dat.test$am,
                positive='1')
plot()

In statistics, a receiver operating characteristic (ROC), or ROC curve, is a graphical plot that illustrates the performance of a binary classifier system as its discrimination threshold is varied. The curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. The true-positive rate is also known as sensitivity, recall or probability of detection[1] in machine learning. The false-positive rate is also known as the fall-out or probability of false alarm[1] and can be calculated as (1 - specificity). The ROC curve is thus the sensitivity as a function of fall-out. In general, if the probability distributions for both detection and false alarm are known, the ROC curve can be generated by plotting the cumulative distribution function (area under the probability distribution from {\displaystyle -\infty } -\infty to the discrimination threshold) of the detection probability in the y-axis versus the cumulative distribution function of the false-alarm probability in x-axis.

In [29]:

head(dat.train)
head(dat.train[,-9])

	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
Mazda RX4	21	6	160	110	3.9	2.62	16.46	0	1	4	4
Datsun 710	22.8	4	108	93	3.85	2.32	18.61	1	1	4	1
Valiant	18.1	6	225	105	2.76	3.46	20.22	1	0	3	1
Duster 360	14.3	8	360	245	3.21	3.57	15.84	0	0	3	4
Merc 230	22.8	4	140.8	95	3.92	3.15	22.9	1	0	4	2
Merc 280	19.2	6	167.6	123	3.92	3.44	18.3	1	0	4	4

	mpg	cyl	disp	hp	drat	wt	qsec	vs	gear	carb
Mazda RX4	21	6	160	110	3.9	2.62	16.46	0	4	4
Datsun 710	22.8	4	108	93	3.85	2.32	18.61	1	4	1
Valiant	18.1	6	225	105	2.76	3.46	20.22	1	3	1
Duster 360	14.3	8	360	245	3.21	3.57	15.84	0	3	4
Merc 230	22.8	4	140.8	95	3.92	3.15	22.9	1	4	2
Merc 280	19.2	6	167.6	123	3.92	3.44	18.3	1	4	4

In [33]:

#load ROCR library
install.packages('ROCR')
library('ROCR')

Installing package into '/home/nbcommon/R'
(as 'lib' is unspecified)

The downloaded source packages are in
 '/tmp/RtmpZnuB2d/downloaded_packages'

Loading required package: gplots
KernSmooth 2.23 loaded
Copyright M. P. Wand 1997-2009

Attaching package: 'gplots'

The following object is masked from 'package:stats':

    lowess

In [36]:

OOB.votes <- span=""> predict (model.rf,dat.train[,-9],type="prob")
OOB.pred <- span=""> OOB.votes[,2]
pred.obj <- span=""> prediction (OOB.pred,dat.train$am)
RP.perf <- span=""> performance(pred.obj, "rec","prec")
plot (RP.perf)
ROC.perf <- span=""> performance(pred.obj, "fpr","tpr");
plot (ROC.perf)
plot  (RP.perf@alpha.values[[1]],RP.perf@x.values[[1]])
lines (RP.perf@alpha.values[[1]],RP.perf@y.values[[1]])
lines (ROC.perf@alpha.values[[1]],ROC.perf@x.values[[1]])

Yiyu Jia's technical Blog

Friday, May 1, 2015

R sample code for Classification

Logistic Regression

Problem

Solution

Answer

Note

Significance Test for Logistic Regression

Problem

Solution

Answer

Note

Random Forest Regression

Random Forest using R

Classfication

Model Evaluation

No comments:

Post a Comment