Friday, May 1, 2015

R sample code for cluster analysis


Hierarchical Cluster Analysis

We use mtcars dataset as an example to run the distance matrix with hclust, and plot a dendrogram that displays a hierarchical relationship among the vehicles.
In [1]:
head(mtcars)
mpgcyldisphpdratwtqsecvsamgearcarb
Mazda RX42161601103.92.6216.460144
Mazda RX4 Wag2161601103.92.87517.020144
Datsun 71022.84108933.852.3218.611141
Hornet 4 Drive21.462581103.083.21519.441031
Hornet Sportabout18.783601753.153.4417.020032
Valiant18.162251052.763.4620.221031
In [2]:
d <- span=""> dist(as.matrix(mtcars))   # find distance matrix 
 hc <- span=""> hclust(d)                # apply hirarchical clustering 
 plot(hc)
Careful inspection of the dendrogram shows that 1974 Pontiac Firebird and Camaro Z28 are classified as close relatives as expected.

K-Means

In [4]:
dat <- span=""> mtcars[,c('hp','drat')]
head(dat)
plot(dat)
hpdrat
Mazda RX41103.9
Mazda RX4 Wag1103.9
Datsun 710933.85
Hornet 4 Drive1103.08
Hornet Sportabout1753.15
Valiant1052.76
In [5]:
set.seed(123)
kmeans.fit <- span=""> kmeans(dat, 3,  nstart=100)
kmeans.fit
K-means clustering with 3 clusters of sizes 17, 10, 5

Cluster means:
         hp     drat
1  93.52941 3.897647
2 178.50000 3.090000
3 263.80000 3.586000

Clustering vector:
          Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive 
                  1                   1                   1                   1 
  Hornet Sportabout             Valiant          Duster 360           Merc 240D 
                  2                   1                   3                   1 
           Merc 230            Merc 280           Merc 280C          Merc 450SE 
                  1                   1                   1                   2 
         Merc 450SL         Merc 450SLC  Cadillac Fleetwood Lincoln Continental 
                  2                   2                   2                   2 
  Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla 
                  3                   1                   1                   1 
      Toyota Corona    Dodge Challenger         AMC Javelin          Camaro Z28 
                  1                   2                   2                   3 
   Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa 
                  2                   1                   1                   1 
     Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E 
                  3                   2                   3                   1 

Within cluster sum of squares by cluster:
[1] 8373.865 3702.932 6919.493
 (between_SS / total_SS =  87.0 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"      
In [6]:
plot(dat, col = kmeans.fit$cluster)
points(kmeans.fit$centers, col = 1:5, pch = 8)

Elbow Plot

In [5]:
# Check for the optimal number of clusters given the data

mydata <- span=""> dat
wss <- span=""> (nrow(mydata)-1)*sum(apply(mydata,2,var))
wss
for (i in 2:8) wss[i] <- span=""> sum(kmeans(mydata, centers=i)$withinss)
wss
    plot(1:8, wss, type="b", xlab="Number of Clusters",
     ylab="Within groups sum of squares",
     main="Assessing the Optimal Number of Clusters with the Elbow Method",
     pch=20, cex=2)
145735.737321875
  1. 145735.737321875
  2.  
  3. 52988.2599554285
  4.  
  5. 32049.7886928571
  6.  
  7. 12043.1237533333
  8.  
  9. 11092.83402
  10.  
  11. 4269.45733194444
  12.  
  13. 2773.82481253968
  14. 8921.22099083333
With the Elbow method, the solution criterion value (within groups sum of squares) will tend to decrease substantially with each successive increase in the number of clusters. Simplistically, an optimal number of clusters is identified once a “kink” in the line plot is observed. As you can grasp, identifying the point in which a “kink” exists is not a very objective approach and is very prone to heuristic processes.
But from the example above, we can say that after 6 clusters the observed difference in the within-cluster dissimilarity is not substantial. Consequently, we can say with some reasonable confidence that the optimal number of clusters to be used is 5.

Learnings

  • When performing clustering, some important concepts must be tackled. One of them is how to deal with data that contains multiple (or more than 2) variables. In such cases, one option would be to perform Principal Component Analysis (PCA) and then plot the first two vectors and maybe additionally apply K-Means.
  • From the results above we can see that there is a relatively well defined set of groups of car models that are relatively distinct when it comes to two features: hp and drat. It is only natural to think the next steps from this sort of output. One could start to devise strategies to understand why certain car models show the values the way they do and what to do about it.

No comments:

Post a Comment