**1. Flat or Partitional clustering:**
(K-means, Gaussian mixture models, etc.)

Partitions are independent of each other

**2. Hierarchical clustering:**
(e.g., agglomerative clustering, divisive clustering)

- Partitions can be visualized using a tree structure (a dendr

ogram)

- Does not need the number of clusters as input

- Possible to view partitions at different levels of granularities

(i.e., can refine/coarsen clusters) using different K

K-means variants:

-Hartigan’s k-means algorithm

-Lloyd’s k-means algorithm

-Forgy’s k-means algorithm

-McQueen’s k-means algorithm

a good article about cluster analysis in R.

============================================

Read up on Gower's Distance measures (available in the ecodist

package) which can combine numeric and categorical data

=======

What do you mean by representing the categorical fields by 1:k?

becomes

That guarantees your results are worthless unless your categories

have an inherent order (e.g. tiny, small, medium, big, giant).

Otherwise it should be four (k-1) indicator/dummy variables (e.g.):

Then you can use Euclidean distance.

-------------------------------------

David L Carlson

Associate Professor of Anthropology

Texas A&M University

College Station, TX 77840-4352

===============

Do also note that a generalized Gower's distance (+ weighting of

variables) is available from the ('recommended' hence always

installed) package 'cluster' :

require("cluster")

?daisy

## notably daisy(*, metric="gower")

Note that daisy() is more sophisticated than most users know, using the 'type = *' specification allowing, notably for binary variables (as your a. dummies above) allowing asymmetric behavior which maybe quite important in "rare event" and similar cases.

Martin

===============================

The first step is calculating a distance matrix. For a data set with n observations, the distance matrix will have n rows and n columns; the (i,j)th element of the distance matrix will be the difference between observation i and observation j. There are two functions that can be used to calculate distance matrices in R; the dist function, which is included in every version of R, and the daisy function, which is part of the cluster library.

==================================

The daisy function in the cluster library will automatically perform standardization, but it doesn't give you complete control. If you have a particular method of standardization in mind, you can use the scale function.

source: http://www.stat.berkeley.edu/classes/s133/Cluster2a.html

===================================