Hierarchical Cluster Analysis
We use mtcars dataset as an example to run the distance matrix with hclust, and plot a dendrogram that displays a hierarchical relationship among the vehicles.
In [1]:
head(mtcars)
In [2]:
d <- span=""> dist(as.matrix(mtcars)) # find distance matrix
hc <- span=""> hclust(d) # apply hirarchical clustering
plot(hc)
->->
Careful inspection of the dendrogram shows that 1974 Pontiac Firebird and Camaro Z28 are classified as close relatives as expected.
K-Means
In [4]:
dat <- span=""> mtcars[,c('hp','drat')]
head(dat)
plot(dat)
->
In [5]:
set.seed(123)
kmeans.fit <- span=""> kmeans(dat, 3, nstart=100)
kmeans.fit
->
In [6]:
plot(dat, col = kmeans.fit$cluster)
points(kmeans.fit$centers, col = 1:5, pch = 8)
Elbow Plot
In [5]:
# Check for the optimal number of clusters given the data
mydata <- span=""> dat
wss <- span=""> (nrow(mydata)-1)*sum(apply(mydata,2,var))
wss
for (i in 2:8) wss[i] <- span=""> sum(kmeans(mydata, centers=i)$withinss)
wss
plot(1:8, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares",
main="Assessing the Optimal Number of Clusters with the Elbow Method",
pch=20, cex=2)
->->->
With the Elbow method, the solution criterion value (within groups sum of squares) will tend to decrease substantially with each successive increase in the number of clusters. Simplistically, an optimal number of clusters is identified once a âkinkâ in the line plot is observed. As you can grasp, identifying the point in which a âkinkâ exists is not a very objective approach and is very prone to heuristic processes.
But from the example above, we can say that after 6 clusters the observed difference in the within-cluster dissimilarity is not substantial. Consequently, we can say with some reasonable confidence that the optimal number of clusters to be used is 5.
Learnings
- When performing clustering, some important concepts must be tackled. One of them is how to deal with data that contains multiple (or more than 2) variables. In such cases, one option would be to perform Principal Component Analysis (PCA) and then plot the first two vectors and maybe additionally apply K-Means.
- From the results above we can see that there is a relatively well defined set of groups of car models that are relatively distinct when it comes to two features: hp and drat. It is only natural to think the next steps from this sort of output. One could start to devise strategies to understand why certain car models show the values the way they do and what to do about it.
No comments:
Post a Comment