Data frame¶
A data frame is used for storing data tables. It is a list of vectors of equal length. For example, the following variable df is a data frame containing three vectors n, s, b.
In [1]:
n = c(2, 3, 5)
s = c("aa", "bb", "cc")
b = c(TRUE, FALSE, TRUE)
df = data.frame(n, s, b) # df is a data frame
class(n)
# display the data frame
df
#check the data types for each column
str(df)
# check the class of the object df
class(df)
Data science process
We use an example dataset to show how to work on a data science problem. There are many built-in data frames in R. In this tutorial, we use a built-in data frame called mtcars as if we are doing a IoT porject.
In [2]:
# to check all available built-in datasetsin utils package
data()
In [3]:
# display data frame
head(mtcars)
The top line of the table, called the header, contains the column names. Each horizontal line afterward denotes a data row, which begins with the name of the row, and then followed by the actual data. Each data member of a row is called a cell.
To retrieve data in a cell, we would enter its row and column coordinates in the single square bracket "[]" operator, seperated by a comma.
In [4]:
# retrieve the value of the data cell at the 2nd row and the 3rd column
mtcars[2,3]
# reference a column
mtcars$cyl
mtcars[,2]
Data exploration
In this example, we assume the data size is not very huge and the entire dataset has been read into the R workspace.
Step 1: Understand the data
1.1 Qick scan
We want to take the first quick scan of the data and get to know following information.
In [49]:
# check basic info such as the data types, the # rows, the # columns, names, etc
str(mtcars)
nrow(mtcars)
names(mtcars)
row.names(mtcars)
In [7]:
# get the basic summary statistics of the data: min, max, quantiles, etc.
summary(mtcars)
1.2 Gather more information
In real world project, we should communicate with the data owners to understand each data column from following aspects: description, data type, value range (can the value be NULL/missing?), the relationship among columns, etc. In this example, we can use help() function to find out relivant information.
In [8]:
help(mtcars)
By looking at the data description, we can infer following data types:
- Continuous variables: e.g. mpg, hp
- Categorial variables: e.g. cyl, am, gear, carb
For other variables, I cannot tell their type right away. In order to find out the data types, we need to talk to the domain experts or to furthther explore the data.
Step 2: Visualize the data
Plots and figures, usually combined with stats, help the answer some questions about data visually and intuitively. We use different plotting functions for continuous variables and categorical variables. The plotting figure types can be about a single variable or multiple variables. In this example, we introduce the commonly used one-variable and two-variable plotting types.
One-variable plotting for continuous varables
In this task, we mainly want to understand the variable distribution. The distribution includes two key aspects: central tendency and spread of data
For example, we can have following questions about the variable mpg.
- How many miles a car runs per gallon in general? What the value range of mpg variable?
- What the variability, or spread of mpg?
In [9]:
mtcars$mpg
hist(mtcars$mpg)
CDF distribution plot. Cumulative Distribution plot can display more insightful info about the distribution.
In [52]:
p <- span=""> ecdf(mtcars$mpg)
plot(p)
->
In [10]:
boxplot(mtcars$mpg)
In [12]:
# check central tendency
mean(mtcars$mpg)
median(mtcars$mpg)
# check spread of data
# sd: the standard deviation
# IQR: inter quantile range Q3-Q1
# mad: the median absolute deviation.
sd(mtcars$mpg)
IQR(mtcars$mpg)
mad(mtcars$mpg)
# overall
summary(mtcars$mpg)
One-variable plotting for categorical varables
In this task, we mainly want to understand the variable frequency at each level. I will take the variable cyl as an example.
In [13]:
# cyl is a categorial variable
table(mtcars$cyl)
#barplot(mtcars$cyl)
barplot(table(mtcars$cyl))
Insights we can obain by intepreting the bar chart: 8 cylinder cars are more common.
Two-variable plotting
One continuous varable and one categorical variable
The univariate data on miles per gallon is interesting, but of course we expect there to be some relationship with the size of the engine. The engine size is stored in various ways: with the cylinder size, or the horsepower or even the displacement. Let's view it two ways. First, cylinder size is a discrete variable with just a few values, a scatterplot will produce an interesting graph.
In [15]:
# avoid to always having the data frame reference
attach(mtcars)
In [17]:
plot(cyl,mpg)
We see a decreasing trend as the number of cylinders increases, and lots of variation between the different cars. We might be tempted to fit a regression line. To do so is easy with the command simple.lm which is a convenient front end to the lm command. (You need to have loaded the Simple package prior to this.)
Two continuous variables
hp and mpg
Lets investigate the relationship between the continuous variables horsepower and miles per gallon. The same commands as above will work, but the scatterplot will look different as horsepower is essentially a continuous variable.
In [18]:
# scatter plot
plot(hp,mpg)
In [19]:
# correlation function
# This is the Pearson correlation coefficient, R. Squaring it gives R2.
cor(hp,mpg)
R_square <- span=""> cor(hp, mpg) ^2
R_square
->
The usual interpretation is that 60% of the variation is explained by the linear relationship for the relationship between the horse power and the miles per gallon.
Two-variable plotting for categorical varables
In [20]:
plot(hp,mpg,pch=cyl)
In [23]:
plot(hp,mpg,pch=cyl,col=cyl,main= "hp vs. mpg scatter plot", xlab = "horse power", ylab = "miles per gallon")
legend(250,30,pch=c(4,6,8), legend=c("4 cyl","6 cyl","8 cyl"),col=c(4,6,8))
In [26]:
# scatterplot matrices
pairs(mtcars[,c(1:5)])
In [28]:
# calculate the person correlation coefficient between mgp and each of other 4 variables: cyl, disp, hp, drat
head(mtcars)
for(i in 2:5){
print(cor(mtcars[,i],mtcars$mpg))
}
Use functions in ggplot2 package to render the above figures
ggplot2 is based on the grammar of graphics, the idea that you can build every graph from the same few components: a data set, a set of geomsâvisual marks that represent data points, and a coordinate system.
In [29]:
# load ggplot2 package
library(ggplot2)
In [30]:
# quick plot
qplot(x = hp, y = mpg, color = cyl, data = mtcars, geom = "point", main="Scatter plot using qplot")
In [55]:
ggplot(mtcars, aes(hp, mpg)) +
geom_point(aes(color = cyl)) +
geom_smooth(method = "auto") +
coord_cartesian() +
scale_color_gradient() +
theme_bw()
In [ ]:
search()
attach(mtcars)
search()
In [46]:
# side by side plots for multiple figures
par(mfrow=c(2,2))
plot(hp,mpg)
plot(hp,mpg,pch=cyl)
plot(hp,mpg,pch=cyl,col=cyl)
legend(250,30,pch=c(4,6,8), legend=c("4 cyl","6 cyl","8 cyl"))
plot(hp,mpg,pch=cyl,col=cyl,main= "hp vs. mpg scatter plot", xlab = "horse power", ylab = "miles per gallon")
legend(250,30,pch=c(4,6,8), legend=c("4 cyl","6 cyl","8 cyl"),col=c(4,6,8))
Linear Regression
For illustration purposes, use mpg, miles per gallon, as the response variable and 4 variables - cyl, disp, hp, drat - as predictors.
In [47]:
# fit a model to predict mtchars$mpg using 4 variables: cyl, disp, hp, drat
lm.fit <- span=""> lm(mpg ~ cyl + disp + hp + drat, data = mtcars)
# check model performance
summary(lm.fit)
->
Intepretation:
- The coefficient for variable cyl is -0.81402. It in practical terms means that, if cyl increases 1 unit, your m.p.g. drops by 0.81 on average if other variables keep unchanged.
- The coefficient for variable drat is 2.15405. It in practical terms means that, if drat increases 1 unit, your m.p.g. increases by 2.15 on average if other variables keep unchanged.
Testing the regression assumptions
In order to make statistical inferences about the regression line, we need to ensure that the assumptions behind the statistical model are appropriate. In this case, we want to check that the residuals have no trends, and are normallu distributed. We can do so graphically once we get our hands on the residuals. These are available through the resid method for the result of an lm usage.
- The error (or disturbance) of an observed value is the deviation of the observed value from the (unobservable) true value of a quantity of interest (for example, a population mean)
- the residual of an observed value is the difference between the observed value and the estimated value of the quantity of interest (for example, a sample mean).
In [48]:
lm.resids = resid(lm.fit) # the residuals as a vector
par(mfrow=c(2,2))
plot(lm.resids) # look for change in spread
hist(lm.resids) # is data bell shaped?
qqnorm(lm.resids) # is data on straight line?
In [ ]:
No comments:
Post a Comment