This is an age-old question, which actually does not have (I think even cannot have) a definite answer, because first you need to define what you mean by a cluster and so on. A famous saying in this regard is that "cluster is in the eye of a beholder". It is easy to construct examples where somebody could see one cluster, but somebody else more than one.
This being said, the MDL (minimum description length) principle would lead you
to devise (IMHO) a clustering cost function in a most principled way, which by
optimizing you could the find the cluster assignments and number of clusters
simultaneously. For multinomial data you can see following:
P.Kontkanen, P.Myllymäki, W.Buntine, J.Rissanen, H.Tirri, An MDL Framework
for Data Clustering. In Advances in Minimum Description Length: Theory and
Applications, edited by P. Grünwald, I.J. Myung and M. Pitt. The MIT Press,
2005.
The intuitively-appealing idea behind MDL clustering is that by clustering you
create a model of the data. So the assumption is that a very good model is one
that lets you compress the data well.
Anyway MDL might not be easy to apply, if you are looking for a practical way to detect the number of clusters. BIC (Bayesian information criteria) and the F-ratio have
proven to work OK in practice.
No comments:
Post a Comment