Friday, May 1, 2015

R sample code for Classification


Logistic Regression

Problem

By use of the logistic regression equation of vehicle transmission in the data set mtcars, estimate the probability of a vehicle being fitted with a manual transmission if it has a 120hp engine and weights 2800 lbs.
In [1]:
head(mtcars)
mpgcyldisphpdratwtqsecvsamgearcarb
Mazda RX42161601103.92.6216.460144
Mazda RX4 Wag2161601103.92.87517.020144
Datsun 71022.84108933.852.3218.611141
Hornet 4 Drive21.462581103.083.21519.441031
Hornet Sportabout18.783601753.153.4417.020032
Valiant18.162251052.763.4620.221031

Solution

We apply the function glm to a formula that describes the transmission type (am) by the horsepower (hp) and weight (wt). This creates a generalized linear model (GLM) in the binomial family.
In [3]:
set.seed(123)
glm.fit <- span=""> glm(formula=am ~ hp + wt, data=mtcars, family=binomial)
We then wrap the test parameters inside a data frame newdata.
In [4]:
newdata = data.frame(hp=120, wt=2.8)
newdata
hpwt
11202.8
Now we apply the function predict to the generalized linear model glm.fit along with newdata. We will have to select response prediction type in order to obtain the predicted probability.
In [5]:
predict(glm.fit, newdata, type="response")
1: 0.641812528409381

Answer

For an automobile with 120hp engine and 2800 lbs weight, the probability of it being fitted with a manual transmission is about 64%.

Note

Further detail of the function predict for generalized linear model can be found in the R documentation.
help(predict.glm)

Significance Test for Logistic Regression

We want to know whether there is any significant relationship between the dependent variable am and the independent variables hp and wt.

Problem

At .05 significance level, decide if any of the independent variables in the logistic regression model of vehicle transmission in data set mtcars is statistically insignificant.

Solution

We apply the function glm to a formula that describes the transmission type (am) by the horsepower (hp) and weight (wt). This creates a generalized linear model (GLM) in the binomial family. We have finished this step in previous section.
We then print out the summary of the generalized linear model and check for the p-values of the hp and wt variables.
In [7]:
summary(glm.fit)
Call:
glm(formula = am ~ hp + wt, family = binomial, data = mtcars)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.2537  -0.1568  -0.0168   0.1543   1.3449  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept) 18.86630    7.44356   2.535  0.01126 * 
hp           0.03626    0.01773   2.044  0.04091 * 
wt          -8.08348    3.06868  -2.634  0.00843 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 43.230  on 31  degrees of freedom
Residual deviance: 10.059  on 29  degrees of freedom
AIC: 16.059

Number of Fisher Scoring iterations: 8

Answer

As the p-values of the hp and wt variables are both less than 0.05, either feature - hp or wt - is significant in the logistic regression model.

Note

Further detail of the function summary for the generalized linear model can be found in the R documentation.
help(summary.glm)

Random Forest Regression

In a normal decision tree, one decision tree is built and in a random forest algorithm number of decision trees are built during the process. A vote from each of the decision trees is considered in deciding the final class of a case or an object, this is called ensemble process.
Random Forest uses Gini Index based impurity measures for building decision tree. Gini Index is also used for building Classification and Regression Tree (CART). In earlier the following blog post, they explained working of CART Decision Tree and a worked out example of Gini Index calculation .

Random Forest using R

Random Forest algorithm is built in randomForest package of R and same name function allows us to use the Random Forest in R.
In [10]:
#install.packages("randomForest")
# Load library
library(randomForest)
# Help on ramdonForest package and function
#library(help=randomForest)
#help(randomForest)
Some of the commonly used parameters of randomForest functions are
  • x : Random Forest Formula
  • data: Input data frame
  • ntree: Number of decision trees to be grown
  • replace: Takes True and False and indicates whether to take sample with/without replacement
  • sampsize: Sample size to be drawn from the input data for growing decision tree
  • importance: Whether independent variable importance in random forest be assessed
  • proximity: Whether to calculate proximity measures between rows of a data frame
Random Forest can be used for Classification and Regression problems. Based on type of target /response variable, the relevant decision trees will be built. If the target variable type is factor, the decision tree will be built for a classifcation problem. If the target variable type is numeric, the decsion tree will be built for a regression problem.

Classfication

Taking am as target variable, and others as predictor variables.
In [37]:
table(mtcars$am)/nrow(mtcars)
table(mtcars$am)
      0       1 
0.59375 0.40625 
 0  1 
19 13 
The class distribution is pretty much balanced.
Next, we will split the data sample into development and validation samples.
In [12]:
sample.idx <- span=""> sample(2, nrow(mtcars), replace = T, prob = c(0.75,0.25))
#sample.idx <- mtcars="" nrow="" sample="" span="">
dat.train <- span=""> mtcars[sample.idx==1,]
dat.test <- span=""> mtcars[sample.idx==2,]
 
str(sample.idx)

table(dat.train$am)/nrow(dat.train)

nrow(dat.train)
 
table(dat.test$am)/nrow(dat.test)
nrow(dat.test)
 int [1:32] 1 2 1 2 2 1 1 2 1 1 ...
        0         1 
0.5714286 0.4285714 
21
        0         1 
0.6363636 0.3636364 
11
In [13]:
#Check the type of target variable
class(dat.train$am)
class(dat.test$am)

dat.train$am = as.factor(dat.train$am)
dat.test$am = as.factor(dat.test$am)

class(dat.train$am)
class(dat.test$am)
'numeric'
'numeric'
'factor'
'factor'
In [25]:
# define the training data columns, excluding the PatientMemberID column    
#cols = names(mtcars)
#which(names(mtcars)== 'am')
#cols = cols[-which(names(mtcars)== 'am')]
#cols
set.seed(123)
model.rf <- span=""> randomForest(am ~ .,data=dat.train, ntree=20, importance=TRUE)
model.rf.pred <- span=""> predict(model.rf, dat.test)
rf_cost.pred = model.rf.pred
rf_cost.pred
dat.test$predicted.response <- span=""> rf_cost.pred
head(dat.test)
Mazda RX4 Wag
1
Hornet 4 Drive
0
Hornet Sportabout
0
Merc 240D
0
Merc 280C
0
Lincoln Continental
0
Toyota Corolla
1
Toyota Corona
1
Camaro Z28
0
Maserati Bora
0
Volvo 142E
1
mpgcyldisphpdratwtqsecvsamgearcarbpredicted.response
Mazda RX4 Wag2161601103.92.87517.0201441
Hornet 4 Drive21.462581103.083.21519.4410310
Hornet Sportabout18.783601753.153.4417.0200320
Merc 240D24.44146.7623.693.192010420
Merc 280C17.86167.61233.923.4418.910440
Lincoln Continental10.4846021535.42417.8200340
In [27]:
model.rf
plot(model.rf)
Call:
 randomForest(formula = am ~ ., data = dat.train, ntree = 20,      importance = TRUE) 
               Type of random forest: classification
                     Number of trees: 20
No. of variables tried at each split: 3

        OOB estimate of  error rate: 14.29%
Confusion matrix:
   0 1 class.error
0 10 2   0.1666667
1  1 8   0.1111111
In [18]:
## Look at variable importance:
important.feature <- span=""> round(importance(model.rf), 2)
fcts <- span=""> important.feature[sort.list(important.feature[,1],decreasing = TRUE),]
fcts


# Feature Importance Plot
varImpPlot(model.rf,
           sort = T,
           main="Feature Importance",
           n.var=5)
01MeanDecreaseAccuracyMeanDecreaseGini
wt2.893.043.193.42
drat2.712.482.963.13
mpg1.482.102.161.36
disp1.031.221.750.46
gear1.031.031.030.42
cyl0.000.000.000.13
hp0.00-0.160.000.53
qsec0000
vs0.000.000.000.08
carb0.000.000.000.25

Model Evaluation

Confusion Matrix
confusionMatrix function from caret package can be used for creating confusion matrix based on actual response variable and predicted value.
In [19]:
str(dat.test)
'data.frame': 11 obs. of  12 variables:
 $ mpg               : num  21 21.4 18.7 24.4 17.8 10.4 33.9 21.5 13.3 15 ...
 $ cyl               : num  6 6 8 4 6 8 4 4 8 8 ...
 $ disp              : num  160 258 360 147 168 ...
 $ hp                : num  110 110 175 62 123 215 65 97 245 335 ...
 $ drat              : num  3.9 3.08 3.15 3.69 3.92 3 4.22 3.7 3.73 3.54 ...
 $ wt                : num  2.88 3.21 3.44 3.19 3.44 ...
 $ qsec              : num  17 19.4 17 20 18.9 ...
 $ vs                : num  0 1 0 1 1 0 1 1 0 0 ...
 $ am                : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 2 ...
 $ gear              : num  4 3 3 4 4 3 4 3 3 5 ...
 $ carb              : num  4 1 2 2 4 4 1 1 4 8 ...
 $ predicted.response: Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 2 1 1 ...
In [ ]:
#install.packages("e1071")
#install.packages("caret")
library(e1071)
library(caret)
# Create Confusion Matrix
confusionMatrix(data=dat.test$predicted.response,
                reference=dat.test$am,
                positive='1')
plot()
In statistics, a receiver operating characteristic (ROC), or ROC curve, is a graphical plot that illustrates the performance of a binary classifier system as its discrimination threshold is varied. The curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. The true-positive rate is also known as sensitivity, recall or probability of detection[1] in machine learning. The false-positive rate is also known as the fall-out or probability of false alarm[1] and can be calculated as (1 - specificity). The ROC curve is thus the sensitivity as a function of fall-out. In general, if the probability distributions for both detection and false alarm are known, the ROC curve can be generated by plotting the cumulative distribution function (area under the probability distribution from {\displaystyle -\infty } -\infty to the discrimination threshold) of the detection probability in the y-axis versus the cumulative distribution function of the false-alarm probability in x-axis.
In [29]:
head(dat.train)
head(dat.train[,-9])
mpgcyldisphpdratwtqsecvsamgearcarb
Mazda RX42161601103.92.6216.460144
Datsun 71022.84108933.852.3218.611141
Valiant18.162251052.763.4620.221031
Duster 36014.383602453.213.5715.840034
Merc 23022.84140.8953.923.1522.91042
Merc 28019.26167.61233.923.4418.31044
mpgcyldisphpdratwtqsecvsgearcarb
Mazda RX42161601103.92.6216.46044
Datsun 71022.84108933.852.3218.61141
Valiant18.162251052.763.4620.22131
Duster 36014.383602453.213.5715.84034
Merc 23022.84140.8953.923.1522.9142
Merc 28019.26167.61233.923.4418.3144
In [33]:
#load ROCR library
install.packages('ROCR')
library('ROCR')
Installing package into '/home/nbcommon/R'
(as 'lib' is unspecified)
The downloaded source packages are in
 '/tmp/RtmpZnuB2d/downloaded_packages'
Loading required package: gplots
KernSmooth 2.23 loaded
Copyright M. P. Wand 1997-2009

Attaching package: 'gplots'

The following object is masked from 'package:stats':

    lowess

In [36]:
OOB.votes <- span=""> predict (model.rf,dat.train[,-9],type="prob")
OOB.pred <- span=""> OOB.votes[,2]
pred.obj <- span=""> prediction (OOB.pred,dat.train$am)
RP.perf <- span=""> performance(pred.obj, "rec","prec")
plot (RP.perf)
ROC.perf <- span=""> performance(pred.obj, "fpr","tpr");
plot (ROC.perf)
plot  (RP.perf@alpha.values[[1]],RP.perf@x.values[[1]])
lines (RP.perf@alpha.values[[1]],RP.perf@y.values[[1]])
lines (ROC.perf@alpha.values[[1]],ROC.perf@x.values[[1]])