Monday, January 6, 2014

Clustering algorithms types: partitional clustering and hierarchical clustering

1. Flat or Partitional clustering:
(K-means, Gaussian mixture models, etc.)
Partitions are independent of each other

2. Hierarchical clustering:
(e.g., agglomerative clustering, divisive clustering)
- Partitions can be visualized using a tree structure (a dendr
ogram)
- Does not need the number of clusters as input
- Possible to view partitions at different levels of granularities
(i.e., can refine/coarsen clusters) using different K


 K-means variants:
-Hartigan’s k-means algorithm
-Lloyd’s k-means algorithm
-Forgy’s k-means algorithm
-McQueen’s k-means algorithm


a good article about cluster analysis in R.
============================================
Read up on Gower's Distance measures (available in the ecodist
package) which can combine numeric and categorical data
=======
What do you mean by representing the categorical fields by 1:k?

becomes


That guarantees your results are worthless unless your categories
have an inherent order (e.g. tiny, small, medium, big, giant).
Otherwise it should be four (k-1) indicator/dummy variables (e.g.):


Then you can use Euclidean distance.

-------------------------------------
David L Carlson
Associate Professor of Anthropology
Texas A&M University
College Station, TX 77840-4352
===============
Do also note that a generalized Gower's distance (+ weighting of
variables) is available from the ('recommended' hence always
installed) package 'cluster' :

  require("cluster")
  ?daisy
  ## notably  daisy(*,  metric="gower")

Note that daisy() is more sophisticated than most users know, using the 'type = *' specification allowing, notably for binary variables (as your a. dummies above) allowing asymmetric behavior which maybe quite important in "rare event" and similar cases.

Martin
===============================
The first step is calculating a distance matrix. For a data set with n observations, the distance matrix will have n rows and n columns; the (i,j)th element of the distance matrix will be the difference between observation i and observation j. There are two functions that can be used to calculate distance matrices in R; the dist function, which is included in every version of R, and the daisy function, which is part of the cluster library.
==================================
The daisy function in the cluster library will automatically perform standardization, but it doesn't give you complete control. If you have a particular method of standardization in mind, you can use the scale function.
source: http://www.stat.berkeley.edu/classes/s133/Cluster2a.html
===================================

No comments:

Post a Comment