Click here to Skip to main content
Click here to Skip to main content

K-Means Clustering Algorithm

, 8 Aug 2012
Rate this:
Please Sign up or sign in to vote.
This post is dedicated to K-Means Clustering Algorithm, used for unsupervised learning.

This post is dedicated to K-Means Clustering Algorithm, used for unsupervised learning. Here is a short introduction into the unsupervised learning subject. This video shows the algorithm at work.

So, how does it work?

1. First of all we have the number K (the number of clusters) and the data set X1,...,Xm as inputs, K < m and Xj ∈ ℜn for ∀j=1..m. Considering that clustering is used as a task in data mining (e.g. including text), it is a nice exercise indeed to formalise a problem in a way where data is represented by vectors Smile | :)

2. At this step we randomly initialise K cluster centroids μ1,...,μK, μj ∈ ℜn for ∀j=1..K. The recommended way is to randomly initialise the centroids from the data set X1,...,Xm and make sure that μi≠μj for ∀ i≠j.

3. At this step we execute the loop:

Repeat {
  //cluster assignment step
  For i=1..m {
    Find the closest centroid for X<sub>i</sub>, i.e.
    min{||X<sub>i</sub>-μ<sub>j</sub>||}, ∀j=1..K, e.g. it is μ<sub>t</sub>;
    Assign c<sub>i</sub>=t;
    //in this case c<sub>i</sub> is the index of 
    //the closest centroid for X<sub>i</sub>
  }

  //move centroids step, if a particular cluster 
  //centroid has no points assigned to it, 
  //then eliminate it
  For j=1..K {
    old_μ<sub>j</sub> = μ<sub>j</sub>;
    μ<sub>j</sub> = average (mean) of all X<sub>i</sub> where c<sub>i</sub>=j, ∀i=1..m;
  }
}

This loop ends when all the centroids stop "moving", i.e. ||old_μj - μj|| < ε, ∀j=1..K, where ε is an error we are happy with (e.g. 0.01 or 0.001).

This is pretty much the algorithm. However, in this form, there is a risk to get stuck in a local minima. By local minima I mean the local minima of the cost function:
J(X1,...,Xm,c1,...cm1,...,μK) = (1 ⁄ m)⋅∑||Xi – μci||2, sum is for i=1..m

In order to address this, the algorithm (steps 2, 3) are executed several times (e.g. 100). Each time the cost function (J(...)) is computed and the clustering with the lowest J(...) is picked up. E.g.

For p=1..100 {
  Perform step 2;
  Perform step 3;
  Calculate cost function J(...);
}
Pick up clustering with the lowest cost J(...);

In order to make it more interactive, I and my son spent couple of hours implementing a JavaScript version of this algorithm using <canvas> tag from HTML5 (my son studies HTML at school, so it was an interesting exercise). Here is the link to the page (excuse me and my son for our non OOP approach to JavaScript Smile | :) ). If you want to have a look at the sources, feel free to download that page (we didn't provide comments, but hopefully this article explains everything). We tested it with Google Chrome and Internet Explorer 9 (if you have problems with IE9, please consult the following link).

Note: While I was writing this article, a friend of mine suggested this nice JavaScript library for plotting www.jqplot.com.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

rtybase
Software Developer (Senior) Snappli Ltd.
United Kingdom United Kingdom
My name is Ruslan Ciurca. Currently I am a Software Developer at snappli.com.

Comments and Discussions

 
GeneralNice PinmemberMajid_grok10-Jun-13 2:36 
Generalsimple, straight forward : ) PinmemberMokhtar Ashour22-Oct-12 4:37 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web01 | 2.8.140721.1 | Last Updated 8 Aug 2012
Article Copyright 2012 by rtybase
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid