Friday, May 1, 2015

R sample code for data manipulating


Data frame

A data frame is used for storing data tables. It is a list of vectors of equal length. For example, the following variable df is a data frame containing three vectors n, s, b.
In [1]:
n = c(2, 3, 5) 
s = c("aa", "bb", "cc") 
b = c(TRUE, FALSE, TRUE) 
df = data.frame(n, s, b)       # df is a data frame

class(n)
# display the data frame
df

#check the data types for each column
str(df)

# check the class of the object df
class(df)
'numeric'
nsb
12aaTRUE
23bbFALSE
35ccTRUE
'data.frame': 3 obs. of  3 variables:
 $ n: num  2 3 5
 $ s: Factor w/ 3 levels "aa","bb","cc": 1 2 3
 $ b: logi  TRUE FALSE TRUE
'data.frame'

Data science process

We use an example dataset to show how to work on a data science problem. There are many built-in data frames in R. In this tutorial, we use a built-in data frame called mtcars as if we are doing a IoT porject.
In [2]:
# to check all available built-in datasetsin utils package
data()
In [3]:
# display data frame 
head(mtcars)
mpgcyldisphpdratwtqsecvsamgearcarb
Mazda RX42161601103.92.6216.460144
Mazda RX4 Wag2161601103.92.87517.020144
Datsun 71022.84108933.852.3218.611141
Hornet 4 Drive21.462581103.083.21519.441031
Hornet Sportabout18.783601753.153.4417.020032
Valiant18.162251052.763.4620.221031
The top line of the table, called the header, contains the column names. Each horizontal line afterward denotes a data row, which begins with the name of the row, and then followed by the actual data. Each data member of a row is called a cell.
To retrieve data in a cell, we would enter its row and column coordinates in the single square bracket "[]" operator, seperated by a comma.
In [4]:
# retrieve the value of the data cell at the 2nd row and the 3rd column
mtcars[2,3]

# reference a column
mtcars$cyl

mtcars[,2]
160
  1. 6
  2.  
  3. 6
  4.  
  5. 4
  6.  
  7. 6
  8.  
  9. 8
  10.  
  11. 6
  12.  
  13. 8
  14.  
  15. 4
  16.  
  17. 4
  18.  
  19. 6
  20.  
  21. 6
  22.  
  23. 8
  24.  
  25. 8
  26.  
  27. 8
  28.  
  29. 8
  30.  
  31. 8
  32.  
  33. 8
  34.  
  35. 4
  36.  
  37. 4
  38.  
  39. 4
  40.  
  41. 4
  42.  
  43. 8
  44.  
  45. 8
  46.  
  47. 8
  48.  
  49. 8
  50.  
  51. 4
  52.  
  53. 4
  54.  
  55. 4
  56.  
  57. 8
  58.  
  59. 6
  60.  
  61. 8
  62.  
  63. 4
  1. 6
  2.  
  3. 6
  4.  
  5. 4
  6.  
  7. 6
  8.  
  9. 8
  10.  
  11. 6
  12.  
  13. 8
  14.  
  15. 4
  16.  
  17. 4
  18.  
  19. 6
  20.  
  21. 6
  22.  
  23. 8
  24.  
  25. 8
  26.  
  27. 8
  28.  
  29. 8
  30.  
  31. 8
  32.  
  33. 8
  34.  
  35. 4
  36.  
  37. 4
  38.  
  39. 4
  40.  
  41. 4
  42.  
  43. 8
  44.  
  45. 8
  46.  
  47. 8
  48.  
  49. 8
  50.  
  51. 4
  52.  
  53. 4
  54.  
  55. 4
  56.  
  57. 8
  58.  
  59. 6
  60.  
  61. 8
  62.  
  63. 4

Data exploration

In this example, we assume the data size is not very huge and the entire dataset has been read into the R workspace.

Step 1: Understand the data

1.1 Qick scan

We want to take the first quick scan of the data and get to know following information.
In [49]:
# check basic info such as the data types, the # rows, the # columns, names, etc
str(mtcars)
nrow(mtcars)
names(mtcars)
row.names(mtcars)
'data.frame': 32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
32
  1. 'mpg'
  2.  
  3. 'cyl'
  4.  
  5. 'disp'
  6.  
  7. 'hp'
  8.  
  9. 'drat'
  10.  
  11. 'wt'
  12.  
  13. 'qsec'
  14.  
  15. 'vs'
  16.  
  17. 'am'
  18.  
  19. 'gear'
  20.  
  21. 'carb'
  1. 'Mazda RX4'
  2.  
  3. 'Mazda RX4 Wag'
  4.  
  5. 'Datsun 710'
  6.  
  7. 'Hornet 4 Drive'
  8.  
  9. 'Hornet Sportabout'
  10.  
  11. 'Valiant'
  12.  
  13. 'Duster 360'
  14.  
  15. 'Merc 240D'
  16.  
  17. 'Merc 230'
  18. 'Merc 280'
  19.  
  20. 'Merc 280C'
  21.  
  22. 'Merc 450SE'
  23.  
  24. 'Merc 450SL'
  25.  
  26. 'Merc 450SLC'
  27.  
  28. 'Cadillac Fleetwood'
  29.  
  30. 'Lincoln Continental'
  31.  
  32. 'Chrysler Imperial'
  33.  
  34. 'Fiat 128'
  35. 'Honda Civic'
  36.  
  37. 'Toyota Corolla'
  38.  
  39. 'Toyota Corona'
  40.  
  41. 'Dodge Challenger'
  42.  
  43. 'AMC Javelin'
  44.  
  45. 'Camaro Z28'
  46.  
  47. 'Pontiac Firebird'
  48.  
  49. 'Fiat X1-9'
  50. 'Porsche 914-2'
  51.  
  52. 'Lotus Europa'
  53.  
  54. 'Ford Pantera L'
  55.  
  56. 'Ferrari Dino'
  57.  
  58. 'Maserati Bora'
  59.  
  60. 'Volvo 142E'
In [7]:
# get the basic summary statistics of the data: min, max, quantiles, etc.
summary(mtcars)
      mpg             cyl             disp             hp       
 Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
 1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
 Median :19.20   Median :6.000   Median :196.3   Median :123.0  
 Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
 3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
 Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
      drat             wt             qsec             vs        
 Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
 1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
 Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
 Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
 3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
 Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
       am              gear            carb      
 Min.   :0.0000   Min.   :3.000   Min.   :1.000  
 1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
 Median :0.0000   Median :4.000   Median :2.000  
 Mean   :0.4062   Mean   :3.688   Mean   :2.812  
 3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
 Max.   :1.0000   Max.   :5.000   Max.   :8.000  

1.2 Gather more information

In real world project, we should communicate with the data owners to understand each data column from following aspects: description, data type, value range (can the value be NULL/missing?), the relationship among columns, etc. In this example, we can use help() function to find out relivant information.
In [8]:
help(mtcars)
mtcars {datasets}R Documentation

Motor Trend Car Road Tests

Description

The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).

Usage

mtcars

Format

A data frame with 32 observations on 11 variables.
[, 1]mpgMiles/(US) gallon
[, 2]cylNumber of cylinders
[, 3]dispDisplacement (cu.in.)
[, 4]hpGross horsepower
[, 5]dratRear axle ratio
[, 6]wtWeight (lb/1000)
[, 7]qsec1/4 mile time
[, 8]vsV/S
[, 9]amTransmission (0 = automatic, 1 = manual)
[,10]gearNumber of forward gears
[,11]carbNumber of carburetors

Source

Henderson and Velleman (1981), Building multiple regression models interactively. Biometrics37, 391–411.

Examples

require(graphics)
pairs(mtcars, main = "mtcars data")
coplot(mpg ~ disp | as.factor(cyl), data = mtcars,
       panel = panel.smooth, rows = 1)

[Package datasets version 3.1.1 ]
By looking at the data description, we can infer following data types:
  • Continuous variables: e.g. mpg, hp
  • Categorial variables: e.g. cyl, am, gear, carb
For other variables, I cannot tell their type right away. In order to find out the data types, we need to talk to the domain experts or to furthther explore the data.

Step 2: Visualize the data

Plots and figures, usually combined with stats, help the answer some questions about data visually and intuitively. We use different plotting functions for continuous variables and categorical variables. The plotting figure types can be about a single variable or multiple variables. In this example, we introduce the commonly used one-variable and two-variable plotting types.

One-variable plotting for continuous varables

In this task, we mainly want to understand the variable distribution. The distribution includes two key aspects: central tendency and spread of data
For example, we can have following questions about the variable mpg.
  1. How many miles a car runs per gallon in general? What the value range of mpg variable?
  2. What the variability, or spread of mpg?
In [9]:
mtcars$mpg
hist(mtcars$mpg)
  1. 21
  2.  
  3. 21
  4.  
  5. 22.8
  6.  
  7. 21.4
  8.  
  9. 18.7
  10.  
  11. 18.1
  12.  
  13. 14.3
  14.  
  15. 24.4
  16.  
  17. 22.8
  18.  
  19. 19.2
  20.  
  21. 17.8
  22.  
  23. 16.4
  24.  
  25. 17.3
  26.  
  27. 15.2
  28.  
  29. 10.4
  30.  
  31. 10.4
  32.  
  33. 14.7
  34.  
  35. 32.4
  36.  
  37. 30.4
  38.  
  39. 33.9
  40.  
  41. 21.5
  42.  
  43. 15.5
  44.  
  45. 15.2
  46. 13.3
  47.  
  48. 19.2
  49.  
  50. 27.3
  51.  
  52. 26
  53.  
  54. 30.4
  55.  
  56. 15.8
  57.  
  58. 19.7
  59.  
  60. 15
  61.  
  62. 21.4
CDF distribution plot. Cumulative Distribution plot can display more insightful info about the distribution.
In [52]:
p <- span=""> ecdf(mtcars$mpg)
plot(p)
In [10]:
boxplot(mtcars$mpg)
In [12]:
# check central tendency
mean(mtcars$mpg)
median(mtcars$mpg)

# check spread of data
# sd: the standard deviation
# IQR: inter quantile range Q3-Q1
# mad: the median absolute deviation. 
sd(mtcars$mpg)
IQR(mtcars$mpg)
mad(mtcars$mpg)

# overall
summary(mtcars$mpg)
20.090625
19.2
6.0269480520891
7.375
5.41149
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  10.40   15.42   19.20   20.09   22.80   33.90 

One-variable plotting for categorical varables

In this task, we mainly want to understand the variable frequency at each level. I will take the variable cyl as an example.
In [13]:
# cyl is a categorial variable
table(mtcars$cyl)
#barplot(mtcars$cyl)
barplot(table(mtcars$cyl))
 4  6  8 
11  7 14 
Insights we can obain by intepreting the bar chart: 8 cylinder cars are more common.

Two-variable plotting

One continuous varable and one categorical variable
The univariate data on miles per gallon is interesting, but of course we expect there to be some relationship with the size of the engine. The engine size is stored in various ways: with the cylinder size, or the horsepower or even the displacement. Let's view it two ways. First, cylinder size is a discrete variable with just a few values, a scatterplot will produce an interesting graph.
In [15]:
# avoid to always having the data frame reference
attach(mtcars)
The following objects are masked from mtcars (pos = 3):

    am, carb, cyl, disp, drat, gear, hp, mpg, qsec, vs, wt

In [17]:
plot(cyl,mpg)
We see a decreasing trend as the number of cylinders increases, and lots of variation between the different cars. We might be tempted to fit a regression line. To do so is easy with the command simple.lm which is a convenient front end to the lm command. (You need to have loaded the Simple package prior to this.)
Two continuous variables
hp and mpg
Lets investigate the relationship between the continuous variables horsepower and miles per gallon. The same commands as above will work, but the scatterplot will look different as horsepower is essentially a continuous variable.
In [18]:
# scatter plot
plot(hp,mpg)
In [19]:
# correlation function
# This is the Pearson correlation coefficient, R. Squaring it gives R2.
cor(hp,mpg)

R_square <- span=""> cor(hp, mpg) ^2
R_square
-0.776168371826586
0.602437341423934
The usual interpretation is that 60% of the variation is explained by the linear relationship for the relationship between the horse power and the miles per gallon.

Two-variable plotting for categorical varables

In [20]:
plot(hp,mpg,pch=cyl)
In [23]:
plot(hp,mpg,pch=cyl,col=cyl,main= "hp vs. mpg scatter plot", xlab = "horse power", ylab = "miles per gallon")
legend(250,30,pch=c(4,6,8), legend=c("4 cyl","6 cyl","8 cyl"),col=c(4,6,8))
In [26]:
# scatterplot matrices
pairs(mtcars[,c(1:5)])
In [28]:
# calculate the person correlation coefficient between mgp and each of other 4 variables: cyl, disp, hp, drat
head(mtcars)

for(i in 2:5){
print(cor(mtcars[,i],mtcars$mpg))
}
mpgcyldisphpdratwtqsecvsamgearcarb
Mazda RX42161601103.92.6216.460144
Mazda RX4 Wag2161601103.92.87517.020144
Datsun 71022.84108933.852.3218.611141
Hornet 4 Drive21.462581103.083.21519.441031
Hornet Sportabout18.783601753.153.4417.020032
Valiant18.162251052.763.4620.221031
[1] -0.852162
[1] -0.8475514
[1] -0.7761684
[1] 0.6811719

Use functions in ggplot2 package to render the above figures

ggplot2 is based on the grammar of graphics, the idea that you can build every graph from the same few components: a data set, a set of geoms—visual marks that represent data points, and a coordinate system.
In [29]:
# load ggplot2 package
library(ggplot2)
Attaching package: 'ggplot2'

The following object is masked from 'mtcars':

    mpg

The following object is masked from 'mtcars':

    mpg

In [30]:
# quick plot
qplot(x = hp, y = mpg, color = cyl, data = mtcars, geom = "point", main="Scatter plot using qplot")
In [55]:
ggplot(mtcars, aes(hp, mpg)) +
 geom_point(aes(color = cyl)) +
 geom_smooth(method = "auto") +
 coord_cartesian() +
 scale_color_gradient() +
 theme_bw()
In [ ]:
search()
attach(mtcars)
search()
In [46]:
# side by side plots for multiple figures

par(mfrow=c(2,2))

plot(hp,mpg)
plot(hp,mpg,pch=cyl)

plot(hp,mpg,pch=cyl,col=cyl)
legend(250,30,pch=c(4,6,8), legend=c("4 cyl","6 cyl","8 cyl"))  

plot(hp,mpg,pch=cyl,col=cyl,main= "hp vs. mpg scatter plot", xlab = "horse power", ylab = "miles per gallon")
legend(250,30,pch=c(4,6,8), legend=c("4 cyl","6 cyl","8 cyl"),col=c(4,6,8))

Linear Regression

For illustration purposes, use mpg, miles per gallon, as the response variable and 4 variables - cyl, disp, hp, drat - as predictors.
In [47]:
# fit a model to predict mtchars$mpg using 4 variables: cyl, disp, hp, drat
lm.fit <- span=""> lm(mpg ~ cyl + disp + hp + drat, data = mtcars)

# check model performance
summary(lm.fit)
Call:
lm(formula = mpg ~ cyl + disp + hp + drat, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5660 -1.8161 -0.6469  1.4094  6.5749 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept) 23.98524    7.98905   3.002  0.00571 **
cyl         -0.81402    0.84368  -0.965  0.34318   
disp        -0.01390    0.01089  -1.276  0.21287   
hp          -0.02317    0.01576  -1.470  0.15299   
drat         2.15405    1.59866   1.347  0.18905   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.012 on 27 degrees of freedom
Multiple R-squared:  0.7825, Adjusted R-squared:  0.7503 
F-statistic: 24.29 on 4 and 27 DF,  p-value: 1.314e-08
Intepretation:
  • The coefficient for variable cyl is -0.81402. It in practical terms means that, if cyl increases 1 unit, your m.p.g. drops by 0.81 on average if other variables keep unchanged.
  • The coefficient for variable drat is 2.15405. It in practical terms means that, if drat increases 1 unit, your m.p.g. increases by 2.15 on average if other variables keep unchanged.

Testing the regression assumptions

In order to make statistical inferences about the regression line, we need to ensure that the assumptions behind the statistical model are appropriate. In this case, we want to check that the residuals have no trends, and are normallu distributed. We can do so graphically once we get our hands on the residuals. These are available through the resid method for the result of an lm usage.
  • The error (or disturbance) of an observed value is the deviation of the observed value from the (unobservable) true value of a quantity of interest (for example, a population mean)
  • the residual of an observed value is the difference between the observed value and the estimated value of the quantity of interest (for example, a sample mean).
In [48]:
lm.resids = resid(lm.fit)     # the residuals as a vector 
 par(mfrow=c(2,2)) 
 plot(lm.resids)               # look for change in spread    
 hist(lm.resids)               # is data bell shaped?
 qqnorm(lm.resids)             # is data on straight line?
In [ ]:

No comments:

Post a Comment