12,998,785 members (70,976 online)
Add your own
alternative version

#### Stats

151.4K views
12.8K downloads
70 bookmarked
Posted 14 Aug 2012

# Text Documents Clustering using K-Means Algorithm

, 26 Jan 2013
 Rate this:
Please Sign up or sign in to vote.
Text documents clustering using K-Means clustering algorithm.

## Introduction

Clustering can be considered the most important unsupervised learning problem; so, as every other problem of this kind, it deals with finding a structure in a collection of unlabeled data. A loose definition of clustering could be “the process of organizing objects into groups whose members are similar in some way”. A cluster is therefore a collection of objects which are coherent internally, but clearly dissimilar to the objects belonging to other clusters.

In this case we easily identify the 3 clusters into which the data can be divided; the similarity criterion is distance: two or more objects belong to the same cluster if they are “close” according to a given distance (in this case geometrical distance). This is called distance-based clustering, here I’m going to deal with is distance-based clustering. Another kind of clustering is conceptual clustering: two or more objects belong to the same cluster if this one defines a concept common to all that objects. In other words, objects are grouped according to their fit to descriptive concepts, not according to simple similarity measures.

### Classification

Clustering algorithms may be classified as listed below

1. Flat clustering (Creates a set of clusters without any explicit structure that would relate clusters to each other; It’s also called exclusive clustering)
2. Hierarchical clustering (Creates a hierarchy of clusters)
3. Hard clustering (Assigns each document/object as a member of exactly one cluster)
4. Soft clustering (Distribute the document/object over all clusters)
##### Algorithms
1. Agglomerative (Hierarchical clustering)
2. K-Means (Flat clustering, Hard clustering)
3. EM Algorithm (Flat clustering, Soft clustering)

Hierarchical Agglomerative Clustering (HAC) and K-Means algorithm have been applied to text clustering in a straightforward way. Typically it usages normalized, TF-IDF-weighted vectors and cosine similarity. Here, I have illustrated the k-means algorithm using a set of points in n-dimensional vector space for text clustering.

### K-Means Algorithm

The k-means clustering algorithm is known to be efficient in clustering large data sets. This clustering algorithm was developed by MacQueen , and is one of the simplest and the best known unsupervised learning algorithms that solve the well-known clustering problem. The K-Means algorithm aims to partition a set of objects, based on their attributes/features, into k clusters, where k is a predefined or user-defined constant. The main idea is to define k centroids, one for each cluster. The centroid of a cluster is formed in such a way that it is closely related (in terms of similarity function; similarity can be measured by using different methods such as cosine similarity, Euclidean distance, Extended Jaccard) to all objects in that cluster.

### Basic K-Means Algorithm

1. Choose k number of clusters to be determined
2. Choose k objects randomly as the initial cluster center
3. Repeat
3.1. Assign each object to their closest cluster
3.2. Compute new clusters, i.e. Calculate mean points.
4. Until
4.1. No changes on cluster centers (i.e. Centroids do not change location any more) OR
4.2. No object changes its cluster (We may define stopping criteria as well)

### Residual Sum of Squares

RSS is a measure of how well the centroids represent the members of their clusters, the squared distance of each vector from its centroid summed over all vectors.

The algorithm then moves the cluster centers around in space in order to minimize RSS

### Termination Condition

We can apply one of the following termination conditions.

• A fixed number of iterations I has been completed. This condition limits the runtime of the clustering algorithm, but in some cases the quality of the clustering will be poor because of an insufficient number of iterations.
• Assignment of documents to clusters does not change between iterations. Except for cases with a bad local minimum, this produces a good clustering, but runtimes may be unacceptably long.
• Centroids do not change between iterations. This is equivalent to not Terminate when RSS falls below a threshold. This criterion ensures that the clustering is of a desired quality after termination. In practice, we need to combine it with a bound on the number of iterations to guarantee termination.
• Terminate when the decrease in RSS falls below a threshold t . For small t, this indicates that we are close to convergence. Again, we need to combine it with a bound on the number of iterations to prevent very long runtimes.

### Bad choice of initial seed

In K-Means algorithm there is unfortunately no guarantee that a global minimum in the objective function will be reached, this is a particular problem if a document set contains many outliers , documents that are far from any other documents and therefore do not fit well into any cluster. Frequently, if an outlier is chosen as an initial seed, then no other vector is assigned to it during subsequent iterations. Thus, we end up with a singleton cluster (a cluster with only one document) even though there is probably a clustering with lower RSS.

Effective heuristics for seed selection include:

1. Excluding outliers from the seed set
2. Trying out multiple starting points and choosing the clustering with the lowest cost; and
3. Obtaining seeds from another method such as hierarchical clustering.

## Diving into the code

##### Document Representation

Each document is represented as a vector using the vector space model. The vector space model also called term vector model is an algebraic model for representing text document (or any object, in general) as vectors of identifiers. For example, TF-IDF weight.

Here I have defined `DocumentVector` class whose instance holds the document and its corresponding representation on vector space.

```public class DocumentVector
{
//Content represents the document(or any other object) to be clustered
public string Content { get; set; }
//represents the tf*idf of  each document
public float[] VectorSpace { get; set; }
}```

And instance of `DocumentCollection` represtnts all the documents to be clustered

```class DocumentCollection
{
public  List<String> DocumentList { get; set; }
}```
##### TF-IDF

TF-IDF stands for term frequency-inverse document frequency, is a numerical statistics which reflects how important a word is to a document in a collection or corpus, it is the most common weighting method used to describe documents in the Vector Space Model, particularly on IR problems.

The number of times a term occurs in a document is called its term frequency. We can calculate the term frequency for a word as the ratio of number of times the word occurs in the document to the total number of words in the document.

The inverse document frequency is a measure of whether the term is common or rare across all documents. It is obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient.

The tf*idf of term t in document d is calculated as:

```//Calculates TF-IDF weight for each term t in document d
private static float FindTFIDF(string document, string term)
{
float tf = FindTermFrequency(document, term);
float idf = FindInverseDocumentFrequency(term);
return tf * idf;
}

private static float FindTermFrequency(string document, string term)
{
int count = r.Split(document).Where(s => s.ToUpper() == term.ToUpper()).Count();
//ratio of no of occurance of term t in document d to the total no of terms in the document
return (float)((float)count / (float)(r.Split(document).Count()));
}

private static float FindInverseDocumentFrequency(string term)
{
//find the  no. of document that contains the term in whole document collection
int count = documentCollection.ToArray().Where(s => r.Split(
s.ToUpper()).ToArray().Contains(term.ToUpper())).Count();
/*
* log of the ratio of  total no of document in the collection to the no. of document containing the term
* we can also use Math.Log(count/(1+documentCollection.Count)) to deal with divide by zero case;
*/
return (float)Math.Log((float)documentCollection.Count() / (float)count);
}```
##### Finding Similarity Score

I have used cosine similarity to identify the similarity score of a document. The method `FindCosineSimilarity` takes two argument `vecA` and `vecB` as parameter which are vector representation of document A and B, and returns the similarity score which lies between 1 and 0, indicating that document A and B are completely similar and dissimilar respectively.

```public static float FindCosineSimilarity(float[] vecA, float[] vecB)
{
var dotProduct = DotProduct(vecA, vecB);
var magnitudeOfA = Magnitude(vecA);
var magnitudeOfB = Magnitude(vecB);
float result = dotProduct / (magnitudeOfA * magnitudeOfB);
//when 0 is divided by 0 it shows result NaN so return 0 in such case.
if (float.IsNaN(result))
return 0;
else
return (float)result;
}
```
##### K-Means Algorithm Implementation

To implement K-Means algorithm I have defined a class `Centroid` in which documents are assigned during the clustering process.

```public class Centroid
{
public List<DocumentVector> GroupedDocument { get; set; }
}```
##### Preparing Document Cluster
```public static List<Centroid> PrepareDocumentCluster(int k,
List<DocumentVector> documentCollection,ref int _counter)
{
globalCounter = 0;
//prepares k initial centroid and assign one object randomly to each centroid
List<Centroid> centroidCollection = new List<Centroid>();
Centroid c;

/*
* Avoid repeation of random number, if same no is generated
* more than once same document is added to the next cluster
* so avoid it using HasSet collection
*/
HashSet<int> uniqRand = new HashSet<int>();
GenerateRandomNumber(ref uniqRand,k,documentCollection.Count);

foreach(int pos in uniqRand)
{
c = new Centroid();
c.GroupedDocument = new List<DocumentVector>();
c.GroupedDocument.Add(documentCollection[pos]);
centroidCollection.Add(c);
}

Boolean stoppingCriteria;
List<Centroid> resultSet;
List<Centroid> prevClusterCenter;

InitializeClusterCentroid(out resultSet, centroidCollection.Count);

do
{
prevClusterCenter = centroidCollection;

foreach (DocumentVector obj in documentCollection)
{
int index = FindClosestClusterCenter(centroidCollection, obj);
resultSet[index].GroupedDocument.Add(obj);
}
InitializeClusterCentroid(out centroidCollection, centroidCollection.Count());
centroidCollection = CalculateMeanPoints(resultSet);
stoppingCriteria = CheckStoppingCriteria(prevClusterCenter, centroidCollection);
if (!stoppingCriteria)
{
//initialize the result set for next iteration
InitializeClusterCentroid(out resultSet, centroidCollection.Count);
}

} while (stoppingCriteria == false);

_counter = counter;
return resultSet;

}```
##### Initializing cluster center

Cluster center is initialized for the next iteration, here the `count` variable holds the value of user defined initial cluster center.

```private static void InitializeClusterCentroid(out List<Centroid> centroid,int count)
{
Centroid c;
centroid = new List<Centroid>();
for (int i = 0; i < count; i++)
{
c = new Centroid();
c.GroupedDocument = new List<DocumentVector>();
centroid.Add(c);
}
} ```
##### Finding closest cluster center

This function returns the index of closest cluster center for each document, I have used cosine similarity to identify the closeness of document. The array `similarityMeasure` holds the similarity score for the document `obj` with each cluster center, the index which has maximum score is taken as the closest cluster center of the given document.

```private static int FindClosestClusterCenter(List<Centroid> clusterCenter,DocumentVector obj)
{
float[] similarityMeasure = new float[clusterCenter.Count()];

for (int i = 0; i < clusterCenter.Count(); i++)
{
similarityMeasure[i] =
SimilarityMatrics.FindCosineSimilarity(
clusterCenter[i].GroupedDocument[0].VectorSpace, obj.VectorSpace);
}

int index = 0;
float maxValue = similarityMeasure[0];
for (int i = 0; i < similarityMeasure.Count(); i++)
{
//if document is similar assign the document
//to the lowest index cluster center to avoid the long loop
if (similarityMeasure[i] >maxValue)
{
maxValue = similarityMeasure[i];
index = i;
}
}
return index;
}```
##### Identifying the new position of the cluster center

After each document being assigned to its closest cluster center we recalculate the mean of each cluster center which indicates the new position of cluster center (centroid).

```private static List<Centroid> CalculateMeanPoints(List<Centroid> _clusterCenter)
{
for (int i = 0; i < _clusterCenter.Count(); i++)
{
if (_clusterCenter[i].GroupedDocument.Count() > 0)
{
for (int j = 0; j < _clusterCenter[i].GroupedDocument[0].VectorSpace.Count(); j++)
{
float total = 0;

foreach (DocumentVector vSpace in _clusterCenter[i].GroupedDocument)
{
total += vSpace.VectorSpace[j];
}
//reassign new calculated mean on each cluster center,
//It indicates the reposition of centroid
_clusterCenter[i].GroupedDocument[0].VectorSpace[j] =
total / _clusterCenter[i].GroupedDocument.Count();
}
}
}
return _clusterCenter;
}```

## License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

## About the Author

 Software Developer United States
Interested in Information Retrieval, Artificial Intelligence
research and programming.Programming is my passion,I enjoy programming for the same reasons that I enjoy reading great books.What I love about writing code is the nitpicking and problem solving.

 Pro

## Comments and Discussions

 First PrevNext
 how to run in neatbeans? Member 1304261521-Mar-17 2:08 Member 13042615 21-Mar-17 2:08
 help me please Member 1274849727-Feb-17 1:54 Member 12748497 27-Feb-17 1:54
 Message Removed ahme_iraq9-Feb-17 22:39 ahme_iraq 9-Feb-17 22:39
 how to run this code Member 1231843511-Feb-16 18:03 Member 12318435 11-Feb-16 18:03
 Text Mining using ISODATA ms_soft892-Aug-15 4:28 ms_soft89 2-Aug-15 4:28
 how to run this code?? Member 1123968627-Jan-15 22:51 Member 11239686 27-Jan-15 22:51
 Increase the speed Hardik techworld19-May-14 11:49 Hardik techworld 19-May-14 11:49
 Running of program Member 1060064718-Feb-14 4:34 Member 10600647 18-Feb-14 4:34
 Running of given code in netbeans Member 1060064718-Feb-14 4:19 Member 10600647 18-Feb-14 4:19
 Re: Running of given code in netbeans Member 1195586225-Oct-15 3:50 Member 11955862 25-Oct-15 3:50
 how to run this code? Member 105659751-Feb-14 23:55 Member 10565975 1-Feb-14 23:55
 Error!! :( Member 1053213217-Jan-14 23:50 Member 10532132 17-Jan-14 23:50
 Re: Error!! :( ahmed mostafa anwar8-May-15 13:41 ahmed mostafa anwar 8-May-15 13:41
 Improvement limpaus8-Jul-13 2:32 limpaus 8-Jul-13 2:32
 ? Member 100194098-Jun-13 8:18 Member 10019409 8-Jun-13 8:18
 how to create vector space model kwido27-May-13 4:24 kwido 27-May-13 4:24
 Re: how to create vector space model Samir Kunwar31-May-13 6:16 Samir Kunwar 31-May-13 6:16
 Exception Ondras894-Apr-13 13:46 Ondras89 4-Apr-13 13:46
 Re: Exception Cihan Yüksel26-Apr-13 2:54 Cihan Yüksel 26-Apr-13 2:54
 How to run in Ubuntu? RAJANYA DHAR15-Mar-13 19:22 RAJANYA DHAR 15-Mar-13 19:22
 how we can run the code Member 969062528-Jan-13 18:16 Member 9690625 28-Jan-13 18:16
 Re: how we can run the code Samir Kunwar29-Jan-13 6:23 Samir Kunwar 29-Jan-13 6:23
 Hi, Please open the solution file using Microsoft Visual Studio with .NET Framework version 4 or later and run the application. 1. Add the number of initial clusters so called user defined cluster. 2. Enter your text in the text field of the GUI 3. Click Add button. 4. If you have more than four documents than repeat from 2 to 3. 5. Now press start button. If you have any further questions please feel free to ask me.modified 4-Feb-13 11:37am.
 Re: how we can run the code Member 1001940910-Jun-13 1:38 Member 10019409 10-Jun-13 1:38
 error comming.. :( Member 103186276-Oct-13 6:00 Member 10318627 6-Oct-13 6:00
 Very nice - where were you 8 months ago? The Mike B28-Jan-13 8:29 The Mike B 28-Jan-13 8:29
 Last Visit: 31-Dec-99 18:00     Last Update: 23-Jun-17 14:56 Refresh 12 Next »

General    News    Suggestion    Question    Bug    Answer    Joke    Praise    Rant    Admin

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Terms of Use | Mobile
Web01 | 2.8.170622.1 | Last Updated 26 Jan 2013
Article Copyright 2012 by Samir Kunwar
Everything else Copyright © CodeProject, 1999-2017
Layout: fixed | fluid