WordCloud - A Squarified Treemap of Word Frequency






4.85/5 (12 votes)
Aug 10, 2007
4 min read

84597

2243
A squarified treemap of word frequency
- Download demo project - 61.8 KB
- Download source - 71.9 KB
- Download Microsoft's Data Visualization Components - 2.3 MB


Introduction
WordCloud is a visual depiction of how many times a word is used, or its frequency if you will, within a given set of words. It does this by: reading in plain text, filtering out "stop words", counting how many times a word is used, and displaying results in a Squarified Treemap. (In the images above, the larger a node and more saturated the color, the more frequent its use.)
Background
I was really impressed, and inspired, by Chirag Mehta's cool web-based tag cloud generator of US Presidential Speeches. So I took a shot at doing a simplified version using .NET.
At best, I'm a hobbyist with the technologies used in this example, so I'm defaulting to various articles I read that lead to creating WordCloud.
The Squarified Treemap
- Display is handled by Microsoft's TreemapGenerator, part of the Data Visualization Components suite. While a true treemap utilizes both hierarchical and proportional attributes, WordCloud only uses proportional attributes to show word count.
- Wikipedia's Treemapping overview is a good place to start for understanding the origins.
- Jonathan Hodgson's Squarified Treemaps Code Project article is an excellent in-depth look at this subject.
- WordCloud performs the same basic function as a tag cloud.
- Newsmap - a very impressive Flashed-based Squarified Treemap of Google News.
- Internet tagging site del.icio.us most popular treemap.
Stemming
- WordCloud uses the Porter stemming algorithm to remove (or reduce) words with common origins.
Stop words
- Stop words are used to filter out common words before processing.
The Code
To build WordCloud, you'll need to grab the latest version of Microsoft's Data Visualization Components, and update WordCloud's project references to include TreemapGenerator
. You'll find this reference in \VisualizationComponents\Treemap\Latest\Components\TreemapGenerator.dll. NOTE: WordCloud needs .NET Framework 2.0 or greater to build and run.
TreemapPanel.cs
TreemapPanel
handles node rendering. Nodes are preprocessed into an ArrayList
collection and then added to the TreemapGenerator
. Object data is stored within each node in the form of NodeInfo
.
// Treemap drawing engine in TreemapPanel.cs
protected TreemapGenerator m_oTreemapGenerator;
...
public void PopulateTreeMap(Hashtable wordsHash, Hashtable stemmedWordsHash)
{
AssertValid();
ArrayList nodes = new ArrayList();
ArrayList aKeys = new ArrayList(stemmedWordsHash.Keys);
aKeys.Sort();
foreach (string key in aKeys)
{
//build each node element
int count = (int)stemmedWordsHash[key];
string name = (string)wordsHash[key];
//show count in node?
if(m_bShowWordCount)
name += String.Format(" ({0})", count);
NodeInfo nodeinfo = new NodeInfo(name, count);
nodes.Add(nodeinfo);
}
m_nodes = nodes;
RepopulateTreeMap();
}
...
private void RepopulateTreeMap()
{
if(m_nodes.Count == 0)
return;
Nodes TreemapGeneratorNodes;
//reset treemap
m_TreemapGenerator.Clear();
TreemapGeneratorNodes = m_TreemapGenerator.Nodes;
foreach(NodeInfo n in m_nodes)
{
//does this node have enough to display?
if(n.Count >= m_nDisplayCount)
{
//Create node with basic default size and color
Node oWordNode = new Node(n.Name, n.Count * 50.0f, 0F);
//set object data
oWordNode.Tag = n;
//add category to tree
TreemapGeneratorNodes.Add(oWordNode);
//used later for determining node color
if (n.Count > m_nLargestCount)
m_nLargestCount = n.Count;
else if (n.Count < m_nSmallestCount)
m_nSmallestCount = n.Count;
}
}
}
Drawing Nodes
The treemap uses custom drawing for nodes, which is called from OnPaint.
// We want to do owner drawing, so handle the DrawItem event.
m_TreemapGenerator.DrawItem +=
new TreemapGenerator.TreemapDrawItemEventHandler(DrawItem);
...
protected override void OnPaint(PaintEventArgs e)
{
AssertValid();
// Save the Graphics object so it can be accessed by OnDrawItem().
m_Graphics = e.Graphics;
// Tell the TreemapGenerator to draw the treemap using owner-
// implemented code. This causes the DrawItem event to get fired for
// each node in the treemap.
m_TreemapGenerator.Draw(this.ClientRectangle);
// All DrawItem events have been fired. Make sure the Graphics object
// doesn't get used again.
m_Graphics = null;
}
Node rendering is handled in DrawItem()
. Within this method we extract the NodeInfo
object, get name and count, set color and text size based on count, and then draw the node. Final node result: the greater the count, the larger the text and more saturated the color.
private void DrawItem(Object sender, TreemapDrawItemEventArgs e)
{
AssertValid();
Node oNode = e.Node;
float fontSize = m_FontSize;
int count = 0;
// Retrieve the NodeInfo object from the node's tag.
if (oNode.Tag is NodeInfo)
{
//get word count
NodeInfo oInfo = (NodeInfo)oNode.Tag;
count = oInfo.Count;
//if we're using text scaling, increment font size
if(m_bUseTextScaling)
fontSize += oInfo.Count;
}
else
{
//should never get here
Debug.WriteLine("DrawItem: Skipping node - bad");
return;
}
//set color alpha based on frequency
Color newStartColor = GetColor(count, m_startColor);
Color newEndColor = GetColor(count, m_endColor);
//set gradient colors and gamma
LinearGradientBrush nodeBrush = new LinearGradientBrush(e.Bounds,
newStartColor, newEndColor, LinearGradientMode.Vertical);
nodeBrush.GammaCorrection = true;
m_Graphics.FillRectangle(nodeBrush, e.Bounds);
// Create font and align in the center
Font newfont = new Font(m_FontName, fontSize, m_FontStyle);
StringFormat sf = new StringFormat();
sf.Alignment = StringAlignment.Center;
sf.LineAlignment = StringAlignment.Center;
//draw the text
m_Graphics.DrawString(e.Node.Text, newfont, new SolidBrush(m_FontColor),
e.Bounds, sf);
// Draw a black border around each node
Pen blackPen = new Pen(Color.Black, 2);
m_Graphics.DrawRectangle(blackPen, e.Bounds);
//clean up
nodeBrush.Dispose();
newfont.Dispose();
blackPen.Dispose();
}
"Massaging" The Text
A worker thread method, DoWordProcessing()
, in the main form processes the word collection document. Stemming is also performed in this method for word suffix stripping.
private void DoWordProcessing(object obj)
{
//unpack array
object[] objArray = (object[])obj;
IProgressCallback callback = (IProgressCallback)objArray[0];
StringBuilder sbRawText = (StringBuilder)objArray[1];
ArrayList arrStopWords = (ArrayList)objArray[2];
try
{
//Build a hash of words and thier frequency
Hashtable wordsHash = new Hashtable();
Hashtable stemmedWordsHash = new Hashtable();
PorterStemmer ps = new PorterStemmer();
//construct our document from the input text
Document doc = new Document(sbRawText.ToString());
callback.Begin(0, doc.Words.Count);
for (int i = 0; i < doc.Words.Count; ++i)
{
//cancel button clicked?
if (callback.IsAborting)
{
callback.End();
return;
}
//update progress dialog
callback.SetText(String.Format("Reading word: {0}", i));
callback.StepTo(i);
//Don't do numbers
if (!IsNumeric(doc.Words[i]))
{
// normalize each word to lowercase
string key = doc.Words[i].ToLower();
//check stop words list
if (!arrStopWords.Contains(key))
{
//set our stemming term
ps.stemTerm(key);
//get the stem word
string stemmedKey = ps.getTerm();
//either add to hash or increment frequency
if (!stemmedWordsHash.Contains(stemmedKey))
{
//add new word
stemmedWordsHash.Add(stemmedKey, 1);
wordsHash.Add(stemmedKey, key);
}
else
{
//increment word count
stemmedWordsHash[stemmedKey] =
(int)stemmedWordsHash[stemmedKey] + 1;
}
}
}
}
//now let the treemap load the information
this.TreePanel.PopulateTreeMap(wordsHash, stemmedWordsHash);
}
catch (System.Threading.ThreadAbortException)
{
// noop
}
catch (System.Threading.ThreadInterruptedException)
{
// noop
}
finally
{
if (callback != null)
{
callback.End();
}
}
}
The Demo Application
Controls

Description of the toolbar buttons (in order from left to right):
- Open Text File: Open a text file document to visualize
- Input Text: Paste text into this dialog from another document to visualize (128k max, but can be changed to your liking)
- Stop Words: A dialog allowing you to modify the set of stop words**
- Font: A dialog allowing you set the display font
- Node Color: A dialog allowing you set the gradient colors for node display
- Scale Text: Toggle for scaling text relative to count
- Show Count: Toggle for showing/hiding word count in nodes**
- Minimum word count slider: Dynamically controls how many nodes to display based on word frequency
- Save as image: Save the treemap as a gif image
**NOTE: Document text is not retained in memory; it's only parsed, added to the treemap as nodes, and then discarded. So the Show Count and Stop Words features are only useful before opening/inputting text; it doesn't dynamically show/hide node counts or apply stop words.
Input Data
I've tried various document sizes, ranging from 400 to 6000 words - mostly presidential speeches and the like. In the project, I've included two text files: mlk.txt and kennedy.txt. These are Martin Luther King's "I Have a Dream" address at the March on Washington, August 28, 1963, and former United States President John F. Kennedy's 1961 State of the Union Address - 1,588 and 5,184 words respectively.
Another issue to be aware of is stop words. I've added a default set of stop words which is user configurable and greatly affects word parsing. The 430 stop words provided are fairly standard and cover a wide number of stop words without getting too aggressive.
Conclusion
While crude, un-optimized, not web-based, and entry level at best when compared to other tag/word cloud generators, the example could perhaps be a starting point for someone interested in the idea. It also may serve as a basic example using Microsoft's TreemapGenerator from the Data Visualization Components suite.
Attribution
Tony Capone's Google Groups posting for the TreemapGenerator code
Matthew Adams's Progress Dialog
Leif Azzopardi's port of the Porter's Porter stemming algorithm