Click here to Skip to main content
Click here to Skip to main content

Search Result Clustering with Non Negative Matrix Factorization (Hybrid Smart Client)

, 18 May 2009
Rate this:
Please Sign up or sign in to vote.
A WPF application that uses Non Negative Matrix Factorization to cluster tagged results.

Introduction

This WPF application finds clusters within a list of page summaries or search results based on tags that have been applied to each summary. It shows the tags in each cluster, and allows you to sort the list of results based on membership in a cluster.

Non Negative Matrix Factorization (NNMF) might sound scary, but it's a relatively simple technique that extracts "features" from a document/term matrix. These features represent a cluster of terms within the documents, with each term belonging to each feature to some degree, and each document belonging to each feature to some degree.

NNMF clustering can be applied to any term document matrix, but in this application, it is applied to the del.icio.us RSS feed which includes categories (tags), and to a custom Web Service that returns search results and tags. This application could be easily extended to cluster any number of tagged document feeds from the internet.

The NNMF algorithm can work on the plain text of a document as well as on tags, but tags are used here on the theory that they represent the salient points of each document.

The application is a smart client because it does some intensive client side calculations on data that is obtained from the internet. Along with a richer client side experience, the decision to build a smart client over a traditional web application might be made based on the requirements for computational complexity. If the user is using a computer with an idle CPU, then why not farm out some processing locally and leave the web-servers free to deliver more responsive pages and data!

Background

Vector space model techniques have been used in information retrieval and indexing (Google reportedly uses Latent Semantic Indexing, a related technique, as a component in its ranking algorithm).

If we have a bunch of documents, we can count the words in each of those documents and store the result in a two dimensional array (or matrix) like this:

This is called a document/term matrix, and simply lists the documents and the count of the terms that they contain.

From this matrix, we can then think of each row as a single dimensional array (or vector), and compute the similarity of it against any other vector in the same matrix by using the inner product of the two vectors. This inner product (or dot product) gives the cosine of the angles between the two vectors, which is a measure of their similarity. There is a classic use of this technique in information retrieval, whereby a vector is constructed from the query and then the dot product of each document against the query is calculated, which results in an ordered list of documents, sorted by relevance to the query.

NNMF goes beyond single row comparisons, and tries to find two matrices (features and weights) from the document/term matrix, that when multiplied together will reconstruct it. These features and weights cannot contain negative values, hence the Non Negative part in the name.

The features matrix contains a row for each feature and a column for each term, with the value giving the degree to which the term belongs in the feature.

The weights matrix contains a row for each document and a column for each feature, with the value giving the degree to which each document belongs in the feature.

Using the Application

The sample application immediately loads the del.icio.us RSS feed and clusters the links. Clicking a cluster on the left sorts the document list by its relevance in that cluster. Clicking the document opens it in the browser.

Entering a search term fires off a new search from another Web Service, the results of which are then clustered and displayed.

Using the Code

Fairly basic (non-optimized) Matrix and Vector classes are included and used by the algorithm.

The most important part of the NNMF algorithm is defined in NNMFMatrix.

public void Factorise(int featureCount, int numberOfIterations, 
                      out Matrix weights, out Matrix features)
{
    Random random = new Random();
    weights = new Matrix(featureCount, ShapeY);
    features = new Matrix(ShapeX, featureCount);
    for(int i = 0; i < featureCount; i++) {
        for(int j = 0; j < ShapeY; j++)
            weights.Set(i, j, (float)random.NextDouble());
    }
    for(int i = 0; i < ShapeX; i++) {
        for(int j = 0; j < featureCount; j++)
            features.Set(i, j, (float)random.NextDouble());
    }

    float lastCost = 0;
    for(int i = 0; i < numberOfIterations; i++) {
        Matrix wh = weights.Multiply(features);
        float cost = _DifferenceCost(wh);
        if(cost == 0)
            break;
        lastCost = cost;

        Matrix transposedWeights = weights.Transpose();
        Matrix hn = transposedWeights.Multiply(this);
        Matrix hd = transposedWeights.Multiply(weights).Multiply(features);
        features = features.FastMultiply(hn).FastDivide(hd);

        Matrix transposedFeatures = features.Transpose();
        Matrix wn = Multiply(transposedFeatures);
        Matrix wd = weights.Multiply(features).Multiply(transposedFeatures);
        weights = weights.FastMultiply(wn).FastDivide(wd);
    }
}

This function initializes the factorials to random values, then iterates numberOfIterations times, each time trying to improve its score against the _DifferenceCost function. More information on this algorithm can be found here. Because the initial values are completely random, every time the clustering algorithm runs, it will find slightly different clusters. In most cases, although the ordering of the clusters may be different, the document membership will remain the same. But, there is actually no guarantee that the algorithm will find an optimal solution, no matter how many times it iterates. Work in NNMF is ongoing to try to solve these problems.

The difference cost simply finds an approximate distance between the factored matrices and the original matrix:

private float _DifferenceCost(Matrix m)
{
    if(ShapeX != m.ShapeX || ShapeY != m.ShapeY)
        return float.MaxValue;

    double ret = 0;
    for(int i = 0; i < ShapeX; i++) {
        for(int j = 0; j < ShapeY; j++)
            ret += Math.Pow(_data[i, j] - m.Get(i, j), 2);
    }
    return (float)ret;
}

The rest of the application relates to WPF and color coding the clusters and results. The Cluster and Result classes represent the UI logic for these artifacts. Getting the sorted list of results relative to the membership within a feature is accomplished with:

SortedDictionary<float, List<DownloadResult>> sorter = 
                 new SortedDictionary<float, List<DownloadResult>>();
for(int i = 0; i < _weightMatrix.ShapeY; i++) {
    float weight = _weightMatrix[featureIndex, i];
    DownloadResult result = _resultList[i];
    List<DownloadResult> list;
    if(!sorter.TryGetValue(weight, out list))
        sorter.Add(weight, list = new List<DownloadResult>());
    list.Add(result);
}

// get the results in descending order relative to their weight
List<DownloadResult> sortedList = new List<DownloadResult>();
foreach(KeyValuePair<float, List<DownloadResult>> item in sorter) {
    item.Value.Reverse();
    foreach(DownloadResult result in item.Value)
        sortedList.Add(result);
}
sortedList.Reverse();
_ShowResults(sortedList);

About the Tagged Search Feed

When executing a search, the application uses a Web Service provided by Dog Blue Software, which uses Microsoft Live Search to get the results, then adds tags from a representational sampling of the nouns and adjectives (only).

Conclusion

NNMF of a term/document matrix can yield some interesting results. The sets of related tags in each cluster could be used to make suggestions as part of a search process (does Google Suggest come to mind?). With their degrees of membership in the features, NNMF clusters seem ideally suited for use with fuzzy logic.

When looking at a large data-set, clustering in this way can help order documents relative to common themes - which seems a useful outcome.

References

Parts of the code for this article were ported from the Python version in Programming Collective Intelligence by Toby Segaran (O'Reilly).

History

  • March 22, 2009: First version.
  • May 18, 2009: Bug fixes, service definition update.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Jack_Dermody
Architect Dog Blue Software
Australia Australia
I graduated from The University of Sydney in 1998. Since then I have worked on large websites with c#/asp.net, built Windows software in c++, .net and WPF, and started my own company Ice Blue Digital that provides software consultancy and product design services.
 
I am currently working on Ask Bluey Visual Search - a new, free, innovative way to search.

Comments and Discussions

 
QuestionParallel PinmemberKangerm00se24-Feb-12 0:40 
GeneralFeature Count PinmemberDima_Tsarev19-Nov-10 7:31 
GeneralRe: Feature Count PinmemberJack_Dermody19-Nov-10 10:41 
QuestionQuestion about how to use this Pinmemberjhonu1-Feb-10 22:55 
AnswerRe: Question about how to use this PinmemberJack_Dermody3-Feb-10 14:48 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web03 | 2.8.140721.1 | Last Updated 18 May 2009
Article Copyright 2009 by Jack_Dermody
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid