13,090,426 members (37,343 online)
alternative version

#### Stats

28K views
21 bookmarked
Posted 22 Mar 2009

# Document Clustering with Non Negative Matrix Factorization

, 20 Feb 2017
 Rate this:
A WPF application that uses Non Negative Matrix Factorization to cluster documents.

## Introduction

This WPF application finds clusters within a list of accepted papers to AAAI 2014 (an artificial intelligence conference). It shows the keywords in each cluster, and allows you to sort the list of results based on membership in a cluster.

Non Negative Matrix Factorization (NNMF) is a relatively simple technique that uses matrix factorisation to create a predetermined number of clusters within some single matrix.

NNMF clustering can be applied to any uniformly positive matrix, but in this application, it is applied to the document/term matrix that has been created from the list of documents and their meta data.

## Background

Vector space model techniques have been used in information retrieval and indexing for many decades.

If we have a collection of documents, we can count the words in each of those documents and store the result in a two dimensional array (or matrix) like this:

This is called a document/term matrix, and simply lists the documents and the count of the terms that they contain. The above document/term matrix can also be normalised so that each value falls between 0 and 1.

From this matrix, we can then think of each row as a single dimensional array (or vector), and compute the similarity of it against any other vector in the same matrix by using the inner product of the two vectors. There is a classic use of this technique in information retrieval, in which a vector is constructed from the query and then the dot product of each document against the query is calculated, which results in an ordered list of documents, sorted by relevance to the query.

NNMF goes beyond single row comparisons, and tries to find two matrices (features and weights) from the document/term matrix, that when multiplied together will reconstruct it. These features and weights cannot contain negative values, hence the Non Negative part.

The features matrix contains a row for each feature and a column for each term, with the value giving the degree to which the term belongs in the feature.

The weights matrix contains a row for each document and a column for each feature, with the value giving the degree to which each document belongs in the feature.

## Linear Algebra

The application uses the open source Bright Wire machine learning library to create and normalise the term document matrix and to convert that matrix into dense vectors that are used by the `NNMF` class.

## Using the Application

The sample application immediately downloads the document data set and clusters the links. Clicking on a cluster sorts the document list by its relevance in that cluster. Clicking each document searches for that document on Google.

## Using the Code

The dataset is downloaded from the UCI machine learning repository. Then the CSV is parsed into a data table, features are extracted and normalised and then dense vectors are created from the sparse feature sets.

A lookup table is created to associate the dense vectors with the `AAAIDocument`s from which they were created.

```// download the document list
var docList = new List<AAAIDocument>();
using (var client = new WebClient()) {

// parse the file CSV
var dataTable = new StreamReader(new MemoryStream(data)).ParseCSV(',');

// create strongly typed documents from the data table
Abstract = row.GetField<string>(5),
Keyword = row.GetField<string>(3).Split(KEYWORD_SPLIT, StringSplitOptions.RemoveEmptyEntries).Select(str => str.ToLower()).ToArray(),
Topic = row.GetField<string>(4).Split(TOPIC_SPLIT, StringSplitOptions.RemoveEmptyEntries),
Group = row.GetField<string>(2).Split(TOPIC_SPLIT, StringSplitOptions.RemoveEmptyEntries),
Title = row.GetField<string>(0)
}));
}

// create a document lookup table
var docTable = docList.ToDictionary(d => d.Title, d => d);

// extract features from the document's metadata
var stringTable = new StringTableBuilder();
var classificationSet = new SparseVectorClassificationSet {
Classification = docList.Select(d => d.AsClassification(stringTable)).ToArray()
};

// normalise the document/term associations
var encodings = classificationSet.Vectorise(true);

// convert the sparse feature vectors into dense vectors
var documentClusterList = new List<DocumentCluster>();
using (var lap = Provider.CreateLinearAlgebra()) {
var lookupTable = encodings
.Select(d => Tuple.Create(d, lap.Create(d.Data).AsIndexable()))
.ToDictionary(d => d.Item2, d => docTable[d.Item1.Classification])
;
var vectorList = lookupTable.Select(d => d.Key).ToList();```

These dense vectors are then clustered by `NNMF`.

```public IReadOnlyList<IReadOnlyList<IIndexableVector>> Cluster(
int numIterations,
Action<float> callback,
float errorThreshold = 0.001f
) {
for (int i = 0; i < numIterations; i++) {
using (var wh = _weights.Multiply(_features)) {
var cost = _DifferenceCost(_dataMatrix, wh);
callback(cost);
if (cost <= errorThreshold)
break;

using (var wT = _weights.Transpose())
using (var hn = wT.Multiply(_dataMatrix))
using (var wTw = wT.Multiply(_weights))
using (var hd = wTw.Multiply(_features))
using (var fhn = _features.PointwiseMultiply(hn)) {
_features.Dispose();
_features = fhn.PointwiseDivide(hd);
}

using (var fT = _features.Transpose())
using (var wn = _dataMatrix.Multiply(fT))
using (var wf = _weights.Multiply(_features))
using (var wd = wf.Multiply(fT))
using (var wwn = _weights.PointwiseMultiply(wn)) {
_weights.Dispose();
_weights = wwn.PointwiseDivide(wd);
}
}
}

// weights gives cluster membership
return _weights.AsIndexable().Rows
.Select((c, i) => Tuple.Create(i, c.MaximumIndex()))
.GroupBy(d => d.Item2)
.Select(g => g.Select(d => _data[d.Item1]).ToArray())
.ToList()
;
}```

This function iterates `numIterations` times, each time trying to improve its score against the `_DifferenceCost` function by slightly improving the matrix factorisation.

Because the initial values are completely random, every time the clustering algorithm runs, it will find slightly different clusters. However sets of documents tend to remain somewhat fixed - i,e. the same documents will tend to occur together between runs.

The difference cost simply finds the distance between the factored matrices and the original matrix:

```float _DifferenceCost(IMatrix m1, IMatrix m2)
{
return m1.AsIndexable().Rows
.Zip(m2.AsIndexable().Rows, (r1, r2) => _costFunction.Compute(r1, r2))
.Average()
;```

## Conclusion

NNMF of a term/document matrix can yield some interesting results. Finding the the strongest correlated sets of terms with each cluster and having degrees of membership between each document and cluster can be useful.

## History

• March 22, 2009: First version.
• May 18, 2009: Bug fixes, service definition update.
• February 21, 2017: Major revision and application migrated to .net 4.6.1.

## Share

 Founder Ice Blue Digital Australia
I am the founder of Ice Blue Digital - a Sydney based software company in the natural language processing and machine learning space.

## You may also be interested in...

 First Prev Next
 Seems it doesnt work Member 1058612521-Sep-15 2:44 Member 10586125 21-Sep-15 2:44
 Re: Seems it doesnt work Jack_Dermody22-Feb-17 8:33 Jack_Dermody 22-Feb-17 8:33
 Re: Seems it doesnt work Alexey KK23-Feb-17 1:56 Alexey KK 23-Feb-17 1:56
 Re: Seems it doesnt work Jack_Dermody23-Feb-17 9:05 Jack_Dermody 23-Feb-17 9:05
 Re: Seems it doesnt work Alexey KK23-Feb-17 9:32 Alexey KK 23-Feb-17 9:32
 Re: Seems it doesnt work Jack_Dermody23-Feb-17 10:43 Jack_Dermody 23-Feb-17 10:43
 Re: Seems it doesnt work Alexey KK23-Feb-17 10:58 Alexey KK 23-Feb-17 10:58
 Re: Seems it doesnt work Jack_Dermody23-Feb-17 11:15 Jack_Dermody 23-Feb-17 11:15
 Re: Seems it doesnt work Alexey KK23-Feb-17 11:30 Alexey KK 23-Feb-17 11:30
 Parallel Kangerm00se24-Feb-12 0:40 Kangerm00se 24-Feb-12 0:40
 Re: Parallel Member 1058612521-Sep-15 2:08 Member 10586125 21-Sep-15 2:08
 Feature Count Dima_Tsarev19-Nov-10 7:31 Dima_Tsarev 19-Nov-10 7:31
 Re: Feature Count Jack_Dermody19-Nov-10 10:41 Jack_Dermody 19-Nov-10 10:41
 Question about how to use this jhonu1-Feb-10 22:55 jhonu 1-Feb-10 22:55
 Re: Question about how to use this Jack_Dermody3-Feb-10 14:48 Jack_Dermody 3-Feb-10 14:48
 Last Visit: 31-Dec-99 18:00     Last Update: 19-Aug-17 16:57 Refresh 1