## Introduction

This WPF application finds clusters within a list of page summaries or search results based on tags that have been applied to each summary. It shows the tags in each cluster, and allows you to sort the list of results based on membership in a cluster.

Non Negative Matrix Factorization (NNMF) might sound scary, but it's a relatively simple technique that extracts "features" from a document/term matrix. These features represent a cluster of terms within the documents, with each term belonging to each feature to some degree, and each document belonging to each feature to some degree.

NNMF clustering can be applied to any term document matrix, but in this application, it is applied to the del.icio.us RSS feed which includes categories (tags), and to a custom Web Service that returns search results and tags. This application could be easily extended to cluster any number of tagged document feeds from the internet.

The NNMF algorithm can work on the plain text of a document as well as on tags, but tags are used here on the theory that they represent the salient points of each document.

The application is a smart client because it does some intensive client side calculations on data that is obtained from the internet. Along with a richer client side experience, the decision to build a smart client over a traditional web application might be made based on the requirements for computational complexity. If the user is using a computer with an idle CPU, then why not farm out some processing locally and leave the web-servers free to deliver more responsive pages and data!

## Background

Vector space model techniques have been used in information retrieval and indexing (Google reportedly uses Latent Semantic Indexing, a related technique, as a component in its ranking algorithm).

If we have a bunch of documents, we can count the words in each of those documents and store the result in a two dimensional array (or matrix) like this:

This is called a document/term matrix, and simply lists the documents and the count of the terms that they contain.

From this matrix, we can then think of each row as a single dimensional array (or vector), and compute the similarity of it against any other vector in the same matrix by using the inner product of the two vectors. This inner product (or dot product) gives the cosine of the angles between the two vectors, which is a measure of their similarity. There is a classic use of this technique in information retrieval, whereby a vector is constructed from the query and then the dot product of each document against the query is calculated, which results in an ordered list of documents, sorted by relevance to the query.

NNMF goes beyond single row comparisons, and tries to find two matrices (**features** and **weights**) from the document/term matrix, that when multiplied together will reconstruct it. These features and weights cannot contain negative values, hence the Non Negative part in the name.

The **features** matrix contains a row for each feature and a column for each term, with the value giving the degree to which the term belongs in the feature.

The **weights** matrix contains a row for each document and a column for each feature, with the value giving the degree to which each document belongs in the feature.

## Using the Application

The sample application immediately loads the del.icio.us RSS feed and clusters the links. Clicking a cluster on the left sorts the document list by its relevance in that cluster. Clicking the document opens it in the browser.

Entering a search term fires off a new search from another Web Service, the results of which are then clustered and displayed.

## Using the Code

Fairly basic (non-optimized) `Matrix`

and `Vector`

classes are included and used by the algorithm.

The most important part of the NNMF algorithm is defined in `NNMFMatrix`

.

public void Factorise(int featureCount, int numberOfIterations,
out Matrix weights, out Matrix features)
{
Random random = new Random();
weights = new Matrix(featureCount, ShapeY);
features = new Matrix(ShapeX, featureCount);
for(int i = 0; i < featureCount; i++) {
for(int j = 0; j < ShapeY; j++)
weights.Set(i, j, (float)random.NextDouble());
}
for(int i = 0; i < ShapeX; i++) {
for(int j = 0; j < featureCount; j++)
features.Set(i, j, (float)random.NextDouble());
}
float lastCost = 0;
for(int i = 0; i < numberOfIterations; i++) {
Matrix wh = weights.Multiply(features);
float cost = _DifferenceCost(wh);
if(cost == 0)
break;
lastCost = cost;
Matrix transposedWeights = weights.Transpose();
Matrix hn = transposedWeights.Multiply(this);
Matrix hd = transposedWeights.Multiply(weights).Multiply(features);
features = features.FastMultiply(hn).FastDivide(hd);
Matrix transposedFeatures = features.Transpose();
Matrix wn = Multiply(transposedFeatures);
Matrix wd = weights.Multiply(features).Multiply(transposedFeatures);
weights = weights.FastMultiply(wn).FastDivide(wd);
}
}

This function initializes the factorials to random values, then iterates `numberOfIterations`

times, each time trying to improve its score against the `_DifferenceCost`

function. More information on this algorithm can be found here. Because the initial values are completely random, every time the clustering algorithm runs, it will find slightly different clusters. In most cases, although the ordering of the clusters may be different, the document membership will remain the same. But, there is actually no guarantee that the algorithm will find an optimal solution, no matter how many times it iterates. Work in NNMF is ongoing to try to solve these problems.

The difference cost simply finds an approximate distance between the factored matrices and the original matrix:

private float _DifferenceCost(Matrix m)
{
if(ShapeX != m.ShapeX || ShapeY != m.ShapeY)
return float.MaxValue;
double ret = 0;
for(int i = 0; i < ShapeX; i++) {
for(int j = 0; j < ShapeY; j++)
ret += Math.Pow(_data[i, j] - m.Get(i, j), 2);
}
return (float)ret;
}

The rest of the application relates to WPF and color coding the clusters and results. The `Cluster`

and `Result`

classes represent the UI logic for these artifacts. Getting the sorted list of results relative to the membership within a feature is accomplished with:

SortedDictionary<float, List<DownloadResult>> sorter =
new SortedDictionary<float, List<DownloadResult>>();
for(int i = 0; i < _weightMatrix.ShapeY; i++) {
float weight = _weightMatrix[featureIndex, i];
DownloadResult result = _resultList[i];
List<DownloadResult> list;
if(!sorter.TryGetValue(weight, out list))
sorter.Add(weight, list = new List<DownloadResult>());
list.Add(result);
}
List<DownloadResult> sortedList = new List<DownloadResult>();
foreach(KeyValuePair<float, List<DownloadResult>> item in sorter) {
item.Value.Reverse();
foreach(DownloadResult result in item.Value)
sortedList.Add(result);
}
sortedList.Reverse();
_ShowResults(sortedList);

## About the Tagged Search Feed

When executing a search, the application uses a Web Service provided by Dog Blue Software, which uses Microsoft Live Search to get the results, then adds tags from a representational sampling of the nouns and adjectives (only).

## Conclusion

NNMF of a term/document matrix can yield some interesting results. The sets of related tags in each cluster could be used to make suggestions as part of a search process (does Google Suggest come to mind?). With their degrees of membership in the features, NNMF clusters seem ideally suited for use with fuzzy logic.

When looking at a large data-set, clustering in this way can help order documents relative to common themes - which seems a useful outcome.

## References

Parts of the code for this article were ported from the Python version in Programming Collective Intelligence by Toby Segaran (O'Reilly).

## History

**March 22, 2009**: First version.
**May 18, 2009**: Bug fixes, service definition update.