Automating Semantic Mapping of a Document With Natural Language Processing

Marc Clifton

5.00/5 (9 votes)

Aug 13, 2014

CPOL

10 min read

30211

Using AlchemyAPI, we create visualizations of keyword and sentence relationships so the user can extract meaningful concepts quickly and efficiently.

Source Code and Running the Program

The source code for this article is hosted on GitHub: https://github.com/cliftonm/nlpvisualizer

To run the program, you will need to obtain an API key on AlchemyAPI's registration page. The free account permits you to perform 1000 queries per day. You can put the key directly into the source code or, as I have done, create the file "alchemyapikey.txt" in the bin\debug folder and copy your key into the first line of that file.

Using the Program

Basic Operation

Enter a URL in the URL textbox and click Process.
Once the keywords are displayed, you can click on the keyword list to display sentences containing that keyword and update the selected visualizer for that keyword.
If there are multiple sentences, double-click on a sentence in the RichTextBox to narrow the scope of the visualization down to that one sentence
Navigate previous and next sentences with the "Prev. Sentence" and "Next Sentence" buttons.

Visualization

Right-click and drag to move the entire visualization surface
Use the mouse wheel to zoom in and out (this is a bit problematic because of the ridiculous way the mouse wheel event works in relation to what control in the form has focus.) This feature is only available in the Keyword Directed Graph visualization. Alternatively (and if you don't have a mouse wheel) left-click and drag up/right or down/left to zoom in/out.
In the "Neighboring Sentence Keywords" visualization, double-click on a keyword to select that keyword to navigate to in the text and visualization.
In the "Keyword Directed Graph" visualization, double-click on a node (blue circle) to select that keyword to navigate to in the text and visualization.

Introduction

Natural Language Processing (NLP) intends to enable computers to derive meaning from human or natural language input. In my article reviewing three NLP's, we saw that these services extract entities, keywords, topics, events, themes and concepts. Other than themes and concepts, the results are essentially keywords or phrases. The extracted "strings" often have an associated relevance or strength, count or frequency, and/or sentiment value. I used the features of one NLP provider, AlchemyAPI, in another article to provide some filtering capabilities of RSS feeds, enabling the user to create filters based on the extracted strings and additional values.

Meaning in Concepts

Still, I found myself rather dissatisfied with the results. My first issue is with concept extraction. When analyzing a short publication on The Threefold Social Order and Education Reform ¹, AlchemyAPI"s "concept" extraction is very high level:

sociology
education
soul
meaning of life
religion
human
school
life

As is Semantria's "themes":

economic organization
social organism
human nature
economic system
bourgeois world view

However, given this sentence in the document:

"Rather, the spiritual-cultural organ of the social organism should, following the dictates of its own independent administration, bring those who are suitably gifted to a certain level of cultivation, and the state and economic life should organize themselves in accordance with the results of work in the spiritual-cultural sphere."

None of the NLP's that I reviewed earlier determine that this sentence is dealing with the concept of "Meritocracy.^"2

Meaning in Relationships

My second concern is that meaning is closely related to relationships between keywords or concepts. This article discusses two approaches for extracting relational meaning from keywords within a single document, creating a kind of semantic mind map or concept map. The two approaches use two different kinds of visualizations -- one is a simple "keywords in adjacent sentences" visualization, and the other is a force directed graph³ (FDG) of the relationships between keywords in the sentences in which the selected and related keywords occur. How to read the GDF will be explained in more detail later. The FDG code was originally written and posted by Bradley Smith in his July 2, 2010 blog entry⁴ -- I have made some minor modifications to that code to improve processing and to render text nodes.

Visualizations

There are two visualizations: adjacent sentence keywords and keyword associations. For these examples, I am using the wikipedia page on Founding Fathers of the United States⁵.

Adjacent Sentence Keyword Visualization

In the sentence containing the keyword "national affairs":

The previous sentence contains the keyword "well-educated men" and the next sentence contains the keywords "American Revolution" and "Continental Army":

This visualization is actually of interesting use. While it should not be interpreted as having any causal relationship, it can be interpreted as having a concept relationship. In the above keyword relationships, for example, the three sentences together are:

Almost all of them were well-educated men of means who were leaders in their communities.
Many were also prominent in national affairs.
Virtually every one had taken part in the American Revolution; at least 29 had served in the Continental Army, most of them in positions of command.

One can quickly determine that the concept here is that "these men (in this case, delegates of the Federal Convention in Philadephia, determined by inspecting prior sentences) were well educated, prominent in national affairs, and almost all had taken part in the American Revolution or served in the Continental Army and most were in positions of command." When working with a complex document, keyword adjacency allows you to quickly create a concept from the surrounding text, which may have been missed in the overall complexity of the text.

Also note that double-clicking on a keyword in the visualization shows all the sentences containing that keyword as well as updating the visualization. For example, when double-clicking on "well-educated men", the program reveals:

Keyword Directed Graph

The second visualization is a directed graph of keyword associations. To explain this, let's start with something basic using the sentence "Many were also prominent in national affairs":

What this graph shows is that this sentence has only one keyword, which is "national affairs." Because this keyword does not appear in any other sentences, there are no further links.

Now let's look at this sentence, a little bit earlier in the text:

"As a sanctuary for Baptists, Rhode Island's absence at the Convention in part explains the absence of Baptist affiliation among those who did attend."

This sentence can also be found by clicking on the keyword "Baptist affiliation."

Here we have a more complex graph. Starting with the sentence "As a sanctuary..." we see that it has two keywords:

Baptist affiliation
Rhode Island

"Baptist affiliation" is not contained in any other sentences and therefore does not have any child nodes. However, "Rhode Island" is contained in one or more sentences, having two other keywords:

delegates
Convention delegates

The keyword "delegates" is used in one or more sentences containing keywords:

United States Declaration
Constitutional Convention
large group
Founding Fathers
United States

These graphs can become complex, as illustrated by the starting text:

"The Founding Fathers of the United States of America were political leaders and statesmen who participated in the American Revolution by signing the United States Declaration of Independence, taking part in the American Revolutionary War, and establishing the United States Constitution."

There are two constants, not exposed at the moment in the UI, that limit the depth and breadth of the directed graph:

const int FDG_DEPTH_LIMIT = 3;
const int FDG_NODE_KEYWORD_LIMIT = 5;

The keyword association directed graph is a very interesting way of mapping out the relationship between concepts that occur within sentences. One can quickly discover additional paths for investigating concepts based on how keywords are associated with each other, which I've found helps to build a broader picture of what the text is discussing. So, for example, while adjacent keywords usually stay within a closely knit thought process, the keyword association graph allows one to explore more loosely coupled concepts around the central theme.

Double-clicking on a keyword's node (the blue circle) in the visualization shows all the sentences containing that keyword as well as updating the visualization.

Relevance Weighting

Keyword font size reflects the relevance (as determined by AlchemyAPI) of the keyword. So, for example, because the keyword "United States" has the highest relevance (0.92971), it is displayed in a large font. The relevance scale is from 0 to 1 and adjusts the font by multiplying the relevance (minus the minimum relevance) by 16 and adding that value to the base font size of 8:

font = new Font(
  FontFamily.GenericSansSerif, 
  (float)(8.0 + (Program.app.keywordRelevanceMap[keyword] - Program.app.minRelevance) * FONT_WEIGHT_MULTIPLIER));

The Code

While there's nothing complex about the code, I'll discuss the basic processes here.

Document Analysis

The program analyzes web pages (as opposed to document text that you input yourself) from the URL that you enter on the main form. You may discover that you will get a "content exceeded" error message for some pages, as there is a size limit to content that AlchemyAPI processes.

The processing has three parts:

Obtaining the scraped content using AlchemyAPI's URLGetText method.
Obtaining the keywords from that content using AlchemyAPI's TextGetRankedKeywords method.
Performing a keyword-sentence relationship lookup map pre-process.

/// <summary>
/// Analyze the document, extracting the text the keywords, and create the keyword-sentence maps.
/// </summary>
protected async void Process(object sender, EventArgs args)
{
  btnProcess.Enabled = false;
  ClearAllGrids();
  string url = tbUrl.Text;
  sbStatus.Text = "Acquiring page content...";
  try
  {
    pageText = await Task.Run(() => GetUrlText(url));
    pageSentences = ParseOutSentences(pageText);

    sbStatus.Text = "Acquiring keywords from AlchemyAPI...";
    dsKeywords = GetKeywords(url, pageText);

    sbStatus.Text = "Processing results...";
    dvKeywords = new DataView(dsKeywords.Tables["keyword"]);
    CreateSortedKeywordList(dvKeywords);
    CreateSentenceKeywordMaps(dvKeywords);
    CreateKeywordRelevanceMap(dvKeywords); // Must be done before assigning the data source.
    sbStatus.Text = "Ready";
    dgvKeywords.DataSource = dvKeywords;
    lblAlchemyKeywords.Text = String.Format("Keywords: {0}", dvKeywords.Count);
    btnProcess.Enabled = true;
  }
  catch (Exception ex)
  {
    MessageBox.Show(ex.Message, "Processing Error", MessageBoxButtons.OK);
  }
  finally
  {
    sbStatus.Text = "Ready";
    btnProcess.Enabled = true;
  }
}

Several "mappings" are created between keywords, sentence indices, and relevance values to facilitate visualization of selected keywords:

/// <summary>
/// Create the sentence-keyword map (list of keywords in each sentence.)
/// Create the keyword-sentence map (list of sentence indices for each keyword.)
/// </summary>
/// <param name="dvKeywords"></param>
protected void CreateSentenceKeywordMaps(DataView dvKeywords)
{
  sentenceKeywordMap.Clear();
  keywordSentenceMap.Clear();

  // For each sentence, get all the keywords in that sentence.
  pageSentences.ForEachWithIndex((s, idx) =>
  {
    List<string> keywordsInSentence = new List<string>();
    sentenceKeywordMap[idx] = keywordsInSentence;
    string sl = s.ToLower();

    // For each of the returned keywords in the view...
    dvKeywords.ForEach(row =>
    {
      string keyword = row[0].ToString();

      if (sl.Contains(keyword.ToLower()))
      {
        // Add keyword to sentence-keyword map.
        keywordsInSentence.Add(keyword);

        // Add sentence to keyword-sentence map.
        List<int> sentences;
        
        if (!keywordSentenceMap.TryGetValue(keyword, out sentences))
        {
          // No entry for this keyword yet, so create the sentence indices list.
          sentences = new List<int>();
          keywordSentenceMap[keyword] = sentences;
        }

      sentences.AddIfUnique(idx);
      }
    });
  });
}

RichTextBox Display

When a keyword is selected, the sentences containing that keyword are displayed with that keyword highlighted.

/// <summary>
/// When a keyword is selected from the grid or the visualizator, we update RTB
/// to display the sentences containing the keyword and also the keyword relationship visualization.
/// </summary>
public void ShowKeywordSelection(string keyword)
{
  textboxEventsEnabled = false;
  ShowSentences(keyword);
  textboxEventsEnabled = true;
  rtbSentences.SelectionStart = 0;
  surface.NewKeyword(keyword);
  UpdateKeywordVisualization();
}

This is accomplished by parsing the sentence for the selected keyword and building the text in the RichTextBox as each keyword occurrence is encountered:

/// <summary>
/// Build the text, checking for keyword occurrence and if found, colorizing the keyword.
/// </summary>
protected void ShowSentences(string keyword)
{
  rtbSentences.Clear();
  displayedSentenceIndices.Clear();

  pageSentences.ForEachWithIndex((sentence, sidx) =>
  {
    string s = sentence.ToLower();
    int idx = s.IndexOf(keyword.ToLower());
    bool found = idx >= 0;
    int start = 0;

    while (idx >= 0)
    {
      // Remember the index of this sentence, but we don't want duplicates.
      if (!displayedSentenceIndices.Contains(sidx))
      {
        displayedSentenceIndices.Add(sidx);
      }

      // Use master sentence to preserve casing.
      string substr = sentence.Substring(start, idx);
      rtbSentences.AppendText(substr);
      rtbSentences.AppendText(keyword, Color.Red);

      // Get remainder.
      s = s.Substring(idx + keyword.Length);
      start += idx + keyword.Length; // for master sentence.
      idx = s.IndexOf(keyword.ToLower());
    }

    if (found)
    {
      // Append the remainder.
      rtbSentences.AppendText(s);
      rtbSentences.AppendText("\n\n");
    }
  });
}

Adjacent Sentence Keyword Visualization

The code for generating the visualization of adjacent sentence keyword visualization first draws the previous keywords, then the next keywords, and then the current keyword, so that the present keyword appears above the connecting lines:

protected void DrawNeighboringSentenceKeywords(Graphics gr)
{
  try
  {
    // Get location of keyword in the center of the
    Point ctr = new Point(Size.Width / 2, Size.Height / 2);

    keywordLocationMap.Clear();
    DrawPreviousKeywords(gr, ctr);
    DrawNextKeywords(gr, ctr);
    DrawKeyword(gr, keyword); // Last, so that text appears above lines.
  }
  catch (Exception ex)
  {
    System.Diagnostics.Debug.WriteLine(ex.Message);
  }
}

The previous and next keywords are predetermined when the user clicks on a keyword in the keyword list or filters the sentences containing that keyword to a single sentence:

protected void UpdateKeywordVisualization()
{
  List<SentenceInfo> prevKeywords = GetPreviousSentencesKeywords();
  List<SentenceInfo> nextKeywords = GetNextSentencesKeywords();
  surface.PreviousKeywords(prevKeywords);
  surface.NextKeywords(nextKeywords);

  if (directedGraph)
  {
    UpdateDirectedGraph();
  }

  surface.Invalidate(true);
}

Ultimately, given the sentence index, this is a simple lookup and processing into the a list of SentenceInfo instances.

protected List<SentenceInfo> GetKeywordsInSentence(int idx)
{
  List<SentenceInfo> ret = new List<SentenceInfo>();
  sentenceKeywordMap[idx].ForEach(k => ret.Add(new SentenceInfo() 
      { Keyword = k, Index = idx, Relevance = keywordRelevanceMap[k] }));

  return ret;
}

If the selected keyword does not appear in the current sentence, the visualization will render the center with empty brackets "[ ]":

Keyword Directed Graph

As discussed earlier, this is a recursive search of keyword as determined by their associative occurrences in sentences. The algorithm is limited in depth and breadth by two constants:

const int FDG_DEPTH_LIMIT = 3;
const int FDG_NODE_KEYWORD_LIMIT = 5;

Also, duplicate keywords are omitted during the traversal. The algorithm begins with keywords in the current sentence and recurses, for each keyword, to other sentences containing that keyword. In those sentences, the associated keywords determine the next level of recursion:

protected void UpdateDirectedGraph()
{
  mDiagram.Clear();
  parsedKeywords.Clear();

  string ctrSentence = FirstThreeWords(pageSentences[displayedSentenceIndices[0]]);
  Node node = new TextNode(surface, ctrSentence);
  ((TextNode)node).Brush = surface.greenBrush;
  mDiagram.AddNode(node);

  // Get the keywords of all sentences for the current sentence or sentences containing the selected keyword.
  List<SentenceInfo> keywords = GetSentencesKeywords();
  keywords = keywords.RemoveDuplicates((si1, si2) => si1.Keyword.ToLower() == si2.Keyword.ToLower()).ToList();
  parsedKeywords.AddRange(keywords.Select(si => si.Keyword.ToLower()));
  AddKeywordsToGraphNode(node, keywords, 0);
  mDiagram.Arrange();
}

protected void AddKeywordsToGraphNode(Node node, List<SentenceInfo> keywords, int depth)
{
  if (depth < FDG_DEPTH_LIMIT)
  {
    int idx = 0;

    foreach(SentenceInfo si in keywords)
    {
      // Limit # of keywords we display.
      if (idx++ < FDG_NODE_KEYWORD_LIMIT)
      {
        Node child = new TextNode(surface, si.Keyword);
        node.AddChild(child);

        // Get all sentences indices containing this keyword:
        List<int> containingSentences = keywordSentenceMap[si.Keyword];

        // Now get the related keywords for each of those sentences. 
        List<SentenceInfo> relatedKeywords = new List<SentenceInfo>();

        containingSentences.ForEach(cs =>
        {
          // Get the unique and previously not processed keywords in the sentence.
          List<SentenceInfo> si3 = GetKeywordsInSentence(cs).Where(sik => !parsedKeywords.Contains(sik.Keyword.ToLower())).ToList();
          si3 = si3.RemoveDuplicates((si1, si2) => si1.Keyword.ToLower() == si2.Keyword.ToLower()).ToList();
          relatedKeywords.AddRange(si3);
          parsedKeywords.AddRange(si3.Select(sik=>sik.Keyword.ToLower()));
        });

        if (relatedKeywords.Count > 0)
        {
          AddKeywordsToGraphNode(child, relatedKeywords, depth + 1);
        }
      }
      else
      {
        break;
      }
    }
  }
}

I refer you to Brad Smith's excellent blog⁴ on force directed graphs for further reading on the algorithm that generates the graph.

Going Deeper

As a research tool, it is also useful to create relationships between documents. This requires building a database of documents and extracted keywords/concepts so that a program such as the one presented here can correlate keywords/concepts between documents, enabling the user to investigate a concept beyond the scope of one single document. I may at some point add this capability!!

Conclusion

In actual practice, I find that this program is actually a very effective tool for focusing on specific points in an article or blog. It is actually quite useful in and of itself to navigate a document a sentence at a time because it helps reduce the clutter of the entire document. The adjacent sentence keyword visualization helps in exploring related keywords within the same "thought", facilitating the quick construction of a primary concept. Using the keyword association directed graph, the primary concept can be expanded to include other peripheral concepts. It is quite enjoyable and instructive to work with a document in this way.

References

1. http://wn.rsarchive.org/Books/GA024/English/AP1985/GA024_c04.htmll

2. http://en.wikipedia.org/wiki/Meritocracy<

3. http://en.wikipedia.org/wiki/Force-directed_graph_drawing

4. http://www.brad-smith.info/blog/archives/129

5. http://en.wikipedia.org/wiki/Founding_Fathers_of_the_United_States