Monday, January 6, 2014

Clustering algorithms types: partitional clustering and hierarchical clustering

1. Flat or Partitional clustering:
(K-means, Gaussian mixture models, etc.)
Partitions are independent of each other

2. Hierarchical clustering:
(e.g., agglomerative clustering, divisive clustering)
- Partitions can be visualized using a tree structure (a dendr
- Does not need the number of clusters as input
- Possible to view partitions at different levels of granularities
(i.e., can refine/coarsen clusters) using different K

 K-means variants:
-Hartigan’s k-means algorithm
-Lloyd’s k-means algorithm
-Forgy’s k-means algorithm
-McQueen’s k-means algorithm

a good article about cluster analysis in R.
Read up on Gower's Distance measures (available in the ecodist
package) which can combine numeric and categorical data
What do you mean by representing the categorical fields by 1:k?


That guarantees your results are worthless unless your categories
have an inherent order (e.g. tiny, small, medium, big, giant).
Otherwise it should be four (k-1) indicator/dummy variables (e.g.):

Then you can use Euclidean distance.

David L Carlson
Associate Professor of Anthropology
Texas A&M University
College Station, TX 77840-4352
Do also note that a generalized Gower's distance (+ weighting of
variables) is available from the ('recommended' hence always
installed) package 'cluster' :

  ## notably  daisy(*,  metric="gower")

Note that daisy() is more sophisticated than most users know, using the 'type = *' specification allowing, notably for binary variables (as your a. dummies above) allowing asymmetric behavior which maybe quite important in "rare event" and similar cases.

The first step is calculating a distance matrix. For a data set with n observations, the distance matrix will have n rows and n columns; the (i,j)th element of the distance matrix will be the difference between observation i and observation j. There are two functions that can be used to calculate distance matrices in R; the dist function, which is included in every version of R, and the daisy function, which is part of the cluster library.
The daisy function in the cluster library will automatically perform standardization, but it doesn't give you complete control. If you have a particular method of standardization in mind, you can use the scale function.

No comments:

Post a Comment