Hierarchical Cluster Analysis

We use mtcars dataset as an example to run the distance matrix with hclust, and plot a dendrogram that displays a hierarchical relationship among the vehicles.

In [1]:

head(mtcars)

	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
Mazda RX4	21	6	160	110	3.9	2.62	16.46	0	1	4	4
Mazda RX4 Wag	21	6	160	110	3.9	2.875	17.02	0	1	4	4
Datsun 710	22.8	4	108	93	3.85	2.32	18.61	1	1	4	1
Hornet 4 Drive	21.4	6	258	110	3.08	3.215	19.44	1	0	3	1
Hornet Sportabout	18.7	8	360	175	3.15	3.44	17.02	0	0	3	2
Valiant	18.1	6	225	105	2.76	3.46	20.22	1	0	3	1

In [2]:

d <- span=""> dist(as.matrix(mtcars))   # find distance matrix 
 hc <- span=""> hclust(d)                # apply hirarchical clustering 
 plot(hc)

Careful inspection of the dendrogram shows that 1974 Pontiac Firebird and Camaro Z28 are classified as close relatives as expected.

K-Means

In [4]:

dat <- span=""> mtcars[,c('hp','drat')]
head(dat)
plot(dat)

	hp	drat
Mazda RX4	110	3.9
Mazda RX4 Wag	110	3.9
Datsun 710	93	3.85
Hornet 4 Drive	110	3.08
Hornet Sportabout	175	3.15
Valiant	105	2.76

In [5]:

set.seed(123)
kmeans.fit <- span=""> kmeans(dat, 3,  nstart=100)
kmeans.fit

K-means clustering with 3 clusters of sizes 17, 10, 5

Cluster means:
         hp     drat
1  93.52941 3.897647
2 178.50000 3.090000
3 263.80000 3.586000

Clustering vector:
          Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive 
                  1                   1                   1                   1 
  Hornet Sportabout             Valiant          Duster 360           Merc 240D 
                  2                   1                   3                   1 
           Merc 230            Merc 280           Merc 280C          Merc 450SE 
                  1                   1                   1                   2 
         Merc 450SL         Merc 450SLC  Cadillac Fleetwood Lincoln Continental 
                  2                   2                   2                   2 
  Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla 
                  3                   1                   1                   1 
      Toyota Corona    Dodge Challenger         AMC Javelin          Camaro Z28 
                  1                   2                   2                   3 
   Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa 
                  2                   1                   1                   1 
     Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E 
                  3                   2                   3                   1 

Within cluster sum of squares by cluster:
[1] 8373.865 3702.932 6919.493
 (between_SS / total_SS =  87.0 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"

In [6]:

plot(dat, col = kmeans.fit$cluster)
points(kmeans.fit$centers, col = 1:5, pch = 8)

Elbow Plot

In [5]:

# Check for the optimal number of clusters given the data

mydata <- span=""> dat
wss <- span=""> (nrow(mydata)-1)*sum(apply(mydata,2,var))
wss
for (i in 2:8) wss[i] <- span=""> sum(kmeans(mydata, centers=i)$withinss)
wss
    plot(1:8, wss, type="b", xlab="Number of Clusters",
     ylab="Within groups sum of squares",
     main="Assessing the Optimal Number of Clusters with the Elbow Method",
     pch=20, cex=2)

145735.737321875

145735.737321875
52988.2599554285
32049.7886928571
12043.1237533333
11092.83402
4269.45733194444
2773.82481253968
8921.22099083333

With the Elbow method, the solution criterion value (within groups sum of squares) will tend to decrease substantially with each successive increase in the number of clusters. Simplistically, an optimal number of clusters is identified once a â€œkinkâ€ in the line plot is observed. As you can grasp, identifying the point in which a â€œkinkâ€ exists is not a very objective approach and is very prone to heuristic processes.

But from the example above, we can say that after 6 clusters the observed difference in the within-cluster dissimilarity is not substantial. Consequently, we can say with some reasonable confidence that the optimal number of clusters to be used is 5.

Learnings

When performing clustering, some important concepts must be tackled. One of them is how to deal with data that contains multiple (or more than 2) variables. In such cases, one option would be to perform Principal Component Analysis (PCA) and then plot the first two vectors and maybe additionally apply K-Means.
From the results above we can see that there is a relatively well defined set of groups of car models that are relatively distinct when it comes to two features: hp and drat. It is only natural to think the next steps from this sort of output. One could start to devise strategies to understand why certain car models show the values the way they do and what to do about it.

More info: http://stackoverflow.com/questions/15376075/cluster-analysis-in-r-determine-the-optimal-number-of-clusters

Yiyu Jia's technical Blog

Friday, May 1, 2015

R sample code for cluster analysis

Hierarchical Cluster Analysis

K-Means

Elbow Plot

Learnings

No comments:

Post a Comment