Click here to Skip to main content
13,351,050 members (29,513 online)
Click here to Skip to main content
Add your own
alternative version


21 bookmarked
Posted 22 Mar 2009

Document Clustering with Non Negative Matrix Factorization

, 20 Feb 2017
Rate this:
Please Sign up or sign in to vote.
A WPF application that uses Non Negative Matrix Factorization to cluster documents.

Application screenshot


This WPF application finds clusters within a list of accepted papers to AAAI 2014 (an artificial intelligence conference). It shows the keywords in each cluster, and allows you to sort the list of results based on membership in a cluster.

Non Negative Matrix Factorization (NNMF) is a relatively simple technique that uses matrix factorisation to create a predetermined number of clusters within some single matrix.

NNMF clustering can be applied to any uniformly positive matrix, but in this application, it is applied to the document/term matrix that has been created from the list of documents and their meta data.


Vector space model techniques have been used in information retrieval and indexing for many decades.

If we have a collection of documents, we can count the words in each of those documents and store the result in a two dimensional array (or matrix) like this:

This is called a document/term matrix, and simply lists the documents and the count of the terms that they contain. The above document/term matrix can also be normalised so that each value falls between 0 and 1.

From this matrix, we can then think of each row as a single dimensional array (or vector), and compute the similarity of it against any other vector in the same matrix by using the inner product of the two vectors. There is a classic use of this technique in information retrieval, in which a vector is constructed from the query and then the dot product of each document against the query is calculated, which results in an ordered list of documents, sorted by relevance to the query.

NNMF goes beyond single row comparisons, and tries to find two matrices (features and weights) from the document/term matrix, that when multiplied together will reconstruct it. These features and weights cannot contain negative values, hence the Non Negative part.

The features matrix contains a row for each feature and a column for each term, with the value giving the degree to which the term belongs in the feature.

The weights matrix contains a row for each document and a column for each feature, with the value giving the degree to which each document belongs in the feature.

Linear Algebra

The application uses the open source Bright Wire machine learning library to create and normalise the term document matrix and to convert that matrix into dense vectors that are used by the NNMF class.

Using the Application

The sample application immediately downloads the document data set and clusters the links. Clicking on a cluster sorts the document list by its relevance in that cluster. Clicking each document searches for that document on Google.

Using the Code

The dataset is downloaded from the UCI machine learning repository. Then the CSV is parsed into a data table, features are extracted and normalised and then dense vectors are created from the sparse feature sets.

A lookup table is created to associate the dense vectors with the AAAIDocuments from which they were created.

// download the document list
var docList = new List<AAAIDocument>();
using (var client = new WebClient()) {
    var data = client.DownloadData(uri);

    // parse the file CSV
    var dataTable = new StreamReader(new MemoryStream(data)).ParseCSV(',');
    // create strongly typed documents from the data table
    dataTable.ForEach(row => docList.Add(new AAAIDocument {
        Abstract = row.GetField<string>(5),
        Keyword = row.GetField<string>(3).Split(KEYWORD_SPLIT, StringSplitOptions.RemoveEmptyEntries).Select(str => str.ToLower()).ToArray(),
        Topic = row.GetField<string>(4).Split(TOPIC_SPLIT, StringSplitOptions.RemoveEmptyEntries),
        Group = row.GetField<string>(2).Split(TOPIC_SPLIT, StringSplitOptions.RemoveEmptyEntries),
        Title = row.GetField<string>(0)
// create a document lookup table
var docTable = docList.ToDictionary(d => d.Title, d => d);
// extract features from the document's metadata
var stringTable = new StringTableBuilder();
var classificationSet = new SparseVectorClassificationSet {
    Classification = docList.Select(d => d.AsClassification(stringTable)).ToArray()
// normalise the document/term associations
var encodings = classificationSet.Vectorise(true);
// convert the sparse feature vectors into dense vectors
var documentClusterList = new List<DocumentCluster>();
using (var lap = Provider.CreateLinearAlgebra()) {
    var lookupTable = encodings
        .Select(d => Tuple.Create(d, lap.Create(d.Data).AsIndexable()))
        .ToDictionary(d => d.Item2, d => docTable[d.Item1.Classification])
    var vectorList = lookupTable.Select(d => d.Key).ToList();

These dense vectors are then clustered by NNMF.

public IReadOnlyList<IReadOnlyList<IIndexableVector>> Cluster(
  int numIterations,
  Action<float> callback, 
  float errorThreshold = 0.001f
) {
    for (int i = 0; i < numIterations; i++) {
        using (var wh = _weights.Multiply(_features)) {
            var cost = _DifferenceCost(_dataMatrix, wh);
            if (cost <= errorThreshold)
            using (var wT = _weights.Transpose())
            using (var hn = wT.Multiply(_dataMatrix))
            using (var wTw = wT.Multiply(_weights))
            using (var hd = wTw.Multiply(_features))
            using (var fhn = _features.PointwiseMultiply(hn)) {
                _features = fhn.PointwiseDivide(hd);
            using (var fT = _features.Transpose())
            using (var wn = _dataMatrix.Multiply(fT))
            using (var wf = _weights.Multiply(_features))
            using (var wd = wf.Multiply(fT))
            using (var wwn = _weights.PointwiseMultiply(wn)) {
                _weights = wwn.PointwiseDivide(wd);
    // weights gives cluster membership
    return _weights.AsIndexable().Rows
        .Select((c, i) => Tuple.Create(i, c.MaximumIndex()))
        .GroupBy(d => d.Item2)
        .Select(g => g.Select(d => _data[d.Item1]).ToArray())

This function iterates numIterations times, each time trying to improve its score against the _DifferenceCost function by slightly improving the matrix factorisation.

Because the initial values are completely random, every time the clustering algorithm runs, it will find slightly different clusters. However sets of documents tend to remain somewhat fixed - i,e. the same documents will tend to occur together between runs.

The difference cost simply finds the distance between the factored matrices and the original matrix:

float _DifferenceCost(IMatrix m1, IMatrix m2)
    return m1.AsIndexable().Rows
        .Zip(m2.AsIndexable().Rows, (r1, r2) => _costFunction.Compute(r1, r2))


NNMF of a term/document matrix can yield some interesting results. Finding the the strongest correlated sets of terms with each cluster and having degrees of membership between each document and cluster can be useful.


  • March 22, 2009: First version.
  • May 18, 2009: Bug fixes, service definition update.
  • February 21, 2017: Major revision and application migrated to .net 4.6.1.


This article, along with any associated source code and files, is licensed under The MIT License


About the Author

Founder Ice Blue Digital
Australia Australia
I am the founder of Ice Blue Digital - a Sydney based software company in the natural language processing and machine learning space.

You may also be interested in...


Comments and Discussions

QuestionSeems it doesnt work Pin
Member 1058612521-Sep-15 3:44
memberMember 1058612521-Sep-15 3:44 
AnswerRe: Seems it doesnt work Pin
Jack_Dermody22-Feb-17 9:33
memberJack_Dermody22-Feb-17 9:33 
GeneralRe: Seems it doesnt work Pin
Alexey KK23-Feb-17 2:56
professionalAlexey KK23-Feb-17 2:56 
GeneralRe: Seems it doesnt work Pin
Jack_Dermody23-Feb-17 10:05
memberJack_Dermody23-Feb-17 10:05 
GeneralRe: Seems it doesnt work Pin
Alexey KK23-Feb-17 10:32
professionalAlexey KK23-Feb-17 10:32 
GeneralRe: Seems it doesnt work Pin
Jack_Dermody23-Feb-17 11:43
memberJack_Dermody23-Feb-17 11:43 
GeneralRe: Seems it doesnt work Pin
Alexey KK23-Feb-17 11:58
professionalAlexey KK23-Feb-17 11:58 
GeneralRe: Seems it doesnt work Pin
Jack_Dermody23-Feb-17 12:15
memberJack_Dermody23-Feb-17 12:15 
GeneralRe: Seems it doesnt work Pin
Alexey KK23-Feb-17 12:30
professionalAlexey KK23-Feb-17 12:30 
QuestionParallel Pin
Kangerm00se24-Feb-12 1:40
memberKangerm00se24-Feb-12 1:40 
Thanks Jack, this is much easier than trying to understand my mathematics book on this topic. I'm using NNMF on webcam images for feature detection, so speed is important. Therefore I looked at how I could improve the code. Parallelisation can do a lot in this case.

For example:
public Matrix Multiply(Matrix m)
            if (ShapeX != m.ShapeY)
                return null;

            Matrix ret = new Matrix(m.ShapeX, ShapeY);
            Parallel.For(0, m.ShapeX, i =>
                for (int j = 0; j < ShapeY; j++)
                    float val = 0;
                    for (int k = 0; k < ShapeX; k++)
                        val += (_data[k, j] * m._data[i, k]);
                    ret._data[i, j] = val;
            return ret;

Instead of the:
for(int i; i < m.ShapeX; i++)
The increase in processing speed is really great with just small changes. Notice that it only makes sense to make the outer loop run in parallel.
AnswerRe: Parallel Pin
Member 1058612521-Sep-15 3:08
memberMember 1058612521-Sep-15 3:08 
GeneralFeature Count Pin
Dima_Tsarev19-Nov-10 8:31
memberDima_Tsarev19-Nov-10 8:31 
GeneralRe: Feature Count Pin
Jack_Dermody19-Nov-10 11:41
memberJack_Dermody19-Nov-10 11:41 
QuestionQuestion about how to use this Pin
jhonu1-Feb-10 23:55
memberjhonu1-Feb-10 23:55 
AnswerRe: Question about how to use this Pin
Jack_Dermody3-Feb-10 15:48
memberJack_Dermody3-Feb-10 15:48 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Terms of Use | Mobile
Web03 | 2.8.180111.1 | Last Updated 20 Feb 2017
Article Copyright 2009 by Jack_Dermody
Everything else Copyright © CodeProject, 1999-2018
Layout: fixed | fluid