Click here to Skip to main content
11,640,599 members (65,003 online)
Click here to Skip to main content

The Semantic Web and Natural Language Processing

, 16 Jul 2014 CPOL 11.5K 20
Rate this:
Please Sign up or sign in to vote.
Using AlchemyAPI, process and filter RSS feeds in the Higher Order Programming Environment

To Run This Application...

  1. You will need to register with AlchemyAPI to obtain an API key and this key must be placed in the file "alchemyapikey.txt" in the bin\Debug (or bin\Release) folder.
  2. Download the code from https://github.com/cliftonm/HOPE
  3. Checkout the branch "semantic-feed-reader".  Bug fixes related to this article will be made on this branch.
  4. When you launch HOPE, load the applet called "NewFeedReaderTabbed"
  5. The various display forms may disappear behind the HOPE application main window -- move/resize the main window to get it out of the way. Also, the forms initially display on top of each other. Arrange them as you wish and then save the applet--the size and positions of the main window and display forms are persisted.
  6. If you're interested in other API's for, say, C++, Android, Java, Ruby, etc..., visit the page.

Introduction

To state the obvious, there is a vast amount of information "in the cloud", and it grows every millisecond.  Some of it is rather static, like a wikipedia page, news article or blog, and some of it is very dynamic, like stock tickers, weather, and tweets.  And again stating the obvious, from a usability standpoint, the integrated means to chew that information such that what is presented to the user are only things that have meaning to that user simply do not exist, or if they do, they're limited to "here google, filter my news items by these categories."  But if, for example, I want to be alerted to when someone blogs about Visual Studio 14 (or whatever version of VS is in CTP when you read this article), well, good luck with that.

We can look at the Semantic Web:

By encouraging the inclusion of semantic content in web pages, the Semantic Web aims at converting the current web, dominated by unstructured and semi-structured documents into a "web of data". (http://en.wikipedia.org/wiki/Semantic_Web)

But adoption of this movement is morbidly slow and probably will not deliver enough semantic information about the content to be actually useful.

That leaves us with "Natural Language Processing", or NLP:

"enabling computers to derive meaning from human or natural language input" (http://en.wikipedia.org/wiki/Natural_language_processing).

Using NLP, we can extract the actual semantic meaning of the content.  What this article explores is integrating one NLP service (AlchemyAPI) with webpage scraping (a feature of AlchemyAPI) to extract and persist said semantic meaning.  Given a basic set of functionality, there are many features that can then be further developed (such as tracking / reporting on trends) from the semantic meaning once it has been derived from content.  These additional features may be explored in future articles.  Specifically what will be presented here is:

  1. Using the SyndicationFeed class to acquire feed items
  2. Extracting the semantic meaning using AlchemyAPI's NLP
  3. Persisting feed item links and each item's associated meaning
  4. Providing a simple UI presentation for exploring feed items and their associated semantics

RSS is a specific niche tool and one needs to be able to use other tools for non-feed content.  Rather than develop a monolithic application glued to RSS, I'm also going to be demonstrating a D-Tier approach (distributed, dimensional, dynamic) for building this application, using the Higher Order Programming Environment (HOPE) as the platform of choice.  This can be leveraged to include other means of acquiring content, using other NLP processors (such as OpenCalais or Semantria), and developing components for working with the semantic meaning in other unique and interesting ways.  If you are unfamiliar with my previous articles on HOPE, please read the introductory article.  As such, I will be interweaving discussions regarding HOPE development with the primary topic of this article.

AlchemyAPI

AlchemyAPI is one of several NLP's.  For my particular purposes, they are attractive for the following reasons:

  • Fast response -- of the services I've looked at, they have the fastest response times
  • A free option -- NLP providers can be expensive!  While OpenCalais is free, AlchemyAPI provides a richer analysis and is free for 1000 transactions per day.
  • Built in web page scraper -- I certainly don't want to write one, so this feature is crucial.  AlchemyAPI's web page scraper looks quite good.  Some of the other services either tie in with expensive options.  OpenCalais is assocaited with SemanticProxy however the demo faults (out of memory) and I have not tried the programmatic interface.
  • Painless API -- the .NET API provided by AlchemyAPI is painless to use and the XML format can be directly parsed into a .NET DataSet object.  In my review of Semantria and OpenCalais, this was definitely not the case -- I encountered bugs in the .NET OpenCalais API and the complexity of Semantria's API was frustrating, though Semantria has been very helpful in guiding me through the issues.  I will be posting a complete review of all three of these NLP services in a separate article.

Based on the aforementioned criteria, the choice was rather clear.

Why Higher Order Programming Environment?

Why am I writing this in the HOPE framework?  Several reasons:

  • I want to continue promoting and extending the capabilities of this framework
  • I want to avoid a monolithic application.  NLP can be applied to many things beyond RSS feeds and I want a platform that allows me to plug and play, and I mean really "play" with different configurations, NLP providers, etc., for extracting semantic meaning.  HOPE is designed for precisely this kind of Lego-building.
  • Visualizing NLP results is a uncharted territory.  While I only use boring data table lists, there is a rich field of visualization to explore with regards to NLP results.  Again, HOPE is an excellent framework for plugging in different visualizers and playing with them.
  • In my opinion, writing synchronous, single threaded monolithic applications is a dead end, and HOPE represents a very interesting alternative for creating distributed, dynamic, and dimensional applications that promotes non-deterministic UI's and behaviors: it the user, not the developer that determines the behavior and visualization of the applet.
  • It's fun, and it's easy.

Still interested?  Then let's begin with feed readers, move on to visualizers, and then parsing feed content with NLP.

The Feed Reader Receptor

(a receptor with feed items ready to be processed.)

In HOPE, behaviors are written in autonomous receptors.  We can start with a very simple receptor that loads acquires the feed items and emits them. 

The RSSFeedItem Semantic Structrure

We need to define the protocol for a feed item, which is done in XML:

<SemanticTypeStruct DeclType="RSSFeedItem">
  <Attributes>
    <NativeType Name="FeedName" ImplementingType="string"/>
    <NativeType Name="Title" ImplementingType="string"/>
    <SemanticElement Name="URL"/> <!-- the link -->
    <NativeType Name="Description" ImplementingType="string"/>
    <NativeType Name="Authors" ImplementingType="string"/>
    <NativeType Name="Categories" ImplementingType="string"/>
    <NativeType Name="PubDate" ImplementingType="DateTime"/>
  </Attributes>
</SemanticTypeStruct>

If you're new to HOPE, one of the foundational concepts is that all data is itself semantic, which has pros and cons in this first cut and is always an interesting decision point: should the types always be semantic elements or can the be native types?  I'll leave that question for another discussion.

Receptor Implementation

The three things of interest to note here:

  • There is a configuration UI so the user can specify the feed name and URL.
  • Note how user-configurable properties are decorated with the UserConfigurableProperty attribute, so the serializer knows what to persist when the applet is saved / loaded.
  • The feed is loaded asynchronously, and when the task completes, the feed items are emitted.
public class FeedReader : BaseReceptor
{
  public override string Name { get { return "Feed Reader"; } }
  public override bool IsEdgeReceptor { get { return true; } }
  public override string ConfigurationUI { get { return "FeedReaderConfig.xml"; } }

  [UserConfigurableProperty("Feed URL:")]
  public string FeedUrl { get; set; }

  [UserConfigurableProperty("Feed Name:")]
  public string FeedName {get;set;}

  protected SyndicationFeed feed;

  public FeedReader(IReceptorSystem rsys)
    : base(rsys)
  {
    AddEmitProtocol("RSSFeedItem");
  }

  /// <summary>
  /// If specified, immmediately acquire the feed and start emitting feed items.
  /// </summary>
  public override void EndSystemInit()
  {
    base.EndSystemInit();
    AcquireFeed();
  }

  /// <summary>
  /// When the user configuration fields have been updated, re-acquire the feed.
  /// </summary>
  public override void UserConfigurationUpdated()
  {
    base.UserConfigurationUpdated();
    AcquireFeed();
  }

  /// <summary>
  /// Acquire the feed and emit the feed items. 
  /// </summary>
  protected async void AcquireFeed()
  {
    if (!String.IsNullOrEmpty(FeedUrl))
    {
      try
      {
        SyndicationFeed feed = await GetFeedAsync(FeedUrl);
        EmitFeedItems(feed);
      }
      catch (Exception ex)
      {
        EmitException("Feed Reader Receptor", ex);
      }
    }
  }


  /// <summary>
  /// Acquire the feed asynchronously.
  /// </summary>
  protected async Task<SyndicationFeed> GetFeedAsync(string feedUrl)
  {
    SyndicationFeed feed = await Task.Run(() =>
    {
      XmlReader xr = XmlReader.Create(feedUrl);
      SyndicationFeed sfeed = SyndicationFeed.Load(xr);
      xr.Close();

      return sfeed;
    });

    return feed;
  }

  /// <summary>
  /// Emits only new feed items for display.
  /// </summary>
  protected void EmitFeedItems(SyndicationFeed feed)
  {
    feed.Items.ForEach(item =>
    {
      CreateCarrier("RSSFeedItem", signal =>
        {
          signal.FeedName = FeedName;
          signal.Title = item.Title.Text;
          signal.URL.Value = item.Links[0].Uri.ToString();
          signal.Description = item.Summary.Text;
          signal.Authors = String.Join(", ", item.Authors.Select(a => a.Name).ToArray());
          signal.Categories = String.Join(", ", item.Categories.Select(c => c.Name).ToArray());
          signal.PubDate = item.PublishDate.LocalDateTime;
        });
    });
  }
}

Feed Reader User Configuration

A very simple UI is used to configure the feed (note that this configuration is persisted when the HOPE applet is saved.)  Because the UI is defined in XML, it can be easily customized for other appearances -- this customizability is a particular strength of HOPE.  The parser used is a derivative of MycroXaml which I wrote about 10 years ago.

The salient point here is the explicit binding of control properties to the receptor instance's properties.

<?xml version="1.0" encoding="utf-8" ?>
<MycroXaml Name="Form"
  xmlns:wf="System.Windows.Forms, System.Windows.Forms, Version=1.0.5000.0, Culture=neutral, PublicKeyToken=b77a5c561934e089"
  xmlns:r="Clifton.Receptor, Clifton.Receptor"
  xmlns:def="def"
  xmlns:ref="ref">
  <wf:Form Text="Feed Reader Configuration" Size="480, 190" StartPosition="CenterScreen" ShowInTaskbar="false" MinimizeBox="false" MaximizeBox="false">
    <wf:Controls>
      <wf:Label Text="Feed Name:" Location="20, 23" Size="70, 15"/>
      <wf:TextBox def:Name="tbFeedName" Location="92, 20" Size="150, 20"/>
      <wf:Label Text="Feed URL:" Location="20, 48" Size="70, 15"/>
      <wf:TextBox def:Name="tbFeedUrl" Location="92, 45" Size="250, 20"/>
      <wf:CheckBox def:Name="ckEnabled" Text="Enabled?" Location="20, 120" Size="80, 25"/>
      <wf:Button Text="Save" Location="360, 10" Size="80, 25" Click="OnReceptorConfigOK"/>
      <wf:Button Text="Cancel" Location="360, 40" Size="80, 25" Click="OnReceptorConfigCancel"/>
    </wf:Controls>
    <r:PropertyControlMap def:Name="ControlMap">
      <r:Entries>
        <r:PropertyControlEntry PropertyName="FeedUrl" ControlName="tbFeedUrl" ControlPropertyName="Text"/>
        <r:PropertyControlEntry PropertyName="FeedName" ControlName="tbFeedName" ControlPropertyName="Text"/>
      </r:Entries>
    </r:PropertyControlMap>
  </wf:Form>
</MycroXaml>

Receptor and Carrier

Once the asynchronous function returns, we note that there are several carriers (one for each item listed in the feed) awaiting to be processed.  We can inspect their signals by hovering the mouse over one of the carriers (the yellow triangle) which displays the signal in the property grid:

The Feed Item Viewer

Next, we need a way to view feeds.  Rather than write a specific feed reader viewer, I'm going instead to implement a general purpose "carrier viewer" that will display the carrier signals in a DataGridView control.  As a general purpose receptor, this will be useful for other applications as well.  The only thing we'll need to configure is the protocol (the semantic structure) that the viewer should listen for.

Configuring the Feed Item Viewer

As with the feed reader, we have a small XML file (not shown) that lets us specify the protocol we want to monitor.  In our case, it's "RSSFeedItem."

The Code

The code is again quite simple, with the addition of removing the old protocol if the user changes it.

public class CarrierListViewer : BaseReceptor
{
  public override string Name { get { return "Carrier List Viewer"; } }
  public override bool IsEdgeReceptor { get { return true; } }
  public override string ConfigurationUI { get { return "CarrierListViewerConfig.xml"; } }

  [UserConfigurableProperty("Protocol Name:")]
  public string ProtocolName { get; set; }

  protected string oldProtocol;
  protected DataView dvSignals;
  protected DataGridView dgvSignals;
  protected Form form;

  public CarrierListViewer(IReceptorSystem rsys)
    : base(rsys)
  {
  }

  public override void Initialize()
  {
    base.Initialize();
    InitializeUI();
  }

  public override void EndSystemInit()
  {
    base.EndSystemInit();
    CreateViewerTable();
    ListenForProtocol();
  }

  /// <summary>
  /// Instantiate the UI.
  /// </summary>
  protected void InitializeUI()
  {
    // Setup the UI:
    MycroParser mp = new MycroParser();
    form = mp.Load<Form>("CarrierListViewer.xml", this);
    dgvSignals = (DataGridView)mp.ObjectCollection["dgvRecords"];
    form.Show();
  }

  /// <summary>
  /// When the user configuration fields have been updated, reset the protocol we are listening for.
  /// </summary>
  public override void UserConfigurationUpdated()
  {
    base.UserConfigurationUpdated();
    CreateViewerTable();
    ListenForProtocol();
  }

  /// <summary>
  /// Create the table and column definitions for the protocol.
  /// </summary>
  protected void CreateViewerTable()
  {
    if (!String.IsNullOrEmpty(ProtocolName))
    {
      DataTable dt = new DataTable();
      ISemanticTypeStruct st = rsys.SemanticTypeSystem.GetSemanticTypeStruct(ProtocolName);
      st.AllTypes.ForEach(t =>
      {
        dt.Columns.Add(new DataColumn(t.Name));
      });

    dvSignals = new DataView(dt);
    dgvSignals.DataSource = dvSignals;
    }
  }

  /// <summary>
  /// Remove the old protocol (if it exists) and start listening to the new.
  /// </summary>
  protected void ListenForProtocol()
  {
    if (!String.IsNullOrEmpty(oldProtocol))
    {
      RemoveReceiveProtocol(oldProtocol);
    }

    oldProtocol = ProtocolName;
    AddReceiveProtocol(ProtocolName, (Action<dynamic>)((signal) => ShowSignal(signal)));
  }

  /// <summary>
  /// Add a record to the existing view showing the signal's content.
  /// </summary>
  /// <param name="signal"></param>
  protected void ShowSignal(dynamic signal)
  {
    try
    {
      DataTable dt = dvSignals.Table;
      DataRow row = dt.NewRow();
      ISemanticTypeStruct st = rsys.SemanticTypeSystem.GetSemanticTypeStruct(ProtocolName);

      st.AllTypes.ForEach(t =>
        {
          object val = t.GetValue(rsys.SemanticTypeSystem, signal);
          row[t.Name] = val;
        });

      dt.Rows.Add(row);
    }
    catch (Exception ex)
    {
      EmitException("Carrier List Viewer Receptor", ex);
    }
  }
}

Displaying Feed Items

We can now drop the Carrier List Viewer onto the surface, double-click on it to configure the protocol, and we immediately note that it is now wired up as a receiver of what the Feed Reader receptor emits:

A small XML file declares the UI (again, easily configured to some other presentation or third party control):

<MycroXaml Name="Form"
  xmlns:wf="System.Windows.Forms, System.Windows.Forms, Version=1.0.5000.0, Culture=neutral, PublicKeyToken=b77a5c561934e089"
  xmlns:def="def"
  xmlns:ref="ref">
  <wf:Form Text="List Viewer" Size="500, 300" StartPosition="CenterScreen" ShowInTaskbar="false" MinimizeBox="false" MaximizeBox="false">
    <wf:Controls>
      <wf:DataGridView def:Name="dgvRecords" Dock="Fill"
        AllowUserToAddRows="false"
        AllowUserToDeleteRows="false"
        ReadOnly="true"
        SelectionMode="FullRowSelect"
        RowHeadersVisible="False"/>
    </wf:Controls>
  </wf:Form>
</MycroXaml>

And here's a result from the Code Project article feed:

Configuring Feed Readers (Introducing Membranes)

Let's pause here for a bit and see what we can do with HOPE now.  For example, we can create multiple feed readers, all feeding into one list viewer:

And here's a sample listing:

But let's say you want a list just for Code Project.  We can do that with a new feature of HOPE called "membranes."  While I'm not going to go into the full details of membranes yet, you can read up on the idea under Membrane Computing.  An overview of the idea is this: carriers (the protocols and their signals) are contained within a membrane and can only permeate the membrane (moving in or moving out) if the membrane has been configured to be permeable to that protocol.  So, we can use membranes for "islands of computation:"

Resulting in separate feed item lists:

Working With Semantic Types

Another thing we can add to the viewer is the ability to emit semantic types when the user double-clicks on a line.  Remember that when we defined the RSSFeedItem semantic type, the URL was itself a semantic type:

<SemanticElement Name="URL"/>

We can look for all semantic type attributes and emit them, letting some other receptor do something with them.  We inspect the protocol the viewer listens to for semantic elements and add them to the emitter list:

// Add other semantic type emitters:
RemoveEmitProtocols();
ISemanticTypeStruct st = rsys.SemanticTypeSystem.GetSemanticTypeStruct(ProtocolName);
st.SemanticElements.ForEach(se => AddEmitProtocol(se.Name));

and, when we double click, the receptor iterates through semantic elements of the protocol it is representing and issues carriers whose signal is the value for that semantic element:

/// <summary>
/// Emit a semantic protocol with the value in the selected row and the column determined by the semantic element name.
/// </summary>
protected void OnCellContentDoubleClick(object sender, DataGridViewCellEventArgs e)
{
  ISemanticTypeStruct st = rsys.SemanticTypeSystem.GetSemanticTypeStruct(ProtocolName);

  st.SemanticElements.ForEach(se =>
  {
     CreateCarrier(se.Name, signal => se.SetValue(rsys.SemanticTypeSystem, signal, dvSignals[e.RowIndex][se.Name].ToString()));
  });
}

In the APOD web scraper article, I had created a simple receptor that listens for the semantic type "URL" and launches the browser with that URL, so we can re-use that receptor here:

Notice how we need only one URL receptor.  Each membrane is made permeable to the URL protocol:

This allows the URL protocol to permeate out of the membrane, thus connecting the carrier list viewer (which at runtime configured itself as emitting the URL protocol) to the URL receptor.  Now we have two separate feed item lists and a way to go to the feed item in the browser by double-clicking on an item in either list.

Protocol Semantic Sub-Elements

A new feature in HOPE is the ability to create carriers on semantic elements of a parent carrier.  For example, because the protocol RSSFeedItem contains the semantic element "URL", when the "RSSFeedItem" signal is emitted, a second carrier for the semantic element "URL" is created as well.  When this behavior of HOPE is enabled, you can immediately see the effects in our current feed reader applet:

Notice the additional pathways from the Feed Reader Receptor directly to the URL receptor.  This feature is experimental but is definitely a useful and quite interesting to explore the behavior of carrier protocol-signals.  Indeed, as it is implemented in the above configuration, this has the interesting effect of opening every feed item's page in the browser.  However, this is not what we want, so instead, we'll create a child membrane around just the feed readers to prevent the URL from permeating the membrane and being received by the URL Receptor:

For each membrane around a Feed Reader Receptor, we configure it so that only the RSSFeedItem protocol permeates the membrane. 

This gives us the desired behavior -- only the Carrier List Viewer's emitting of the "URL" protocol is received by the URL Receptor.

Applying Natural Language Processing to the Feed Items

The feature of creating carriers for semantic elements within a protocol can be taken advantage of however by the NLP, for which we definitely do want processing of each feed item's URL.  As mentioned earlier, I'm using AlchemyAPI as the NLP service.  Notice that I combined the two feed readers on the right into a single child membrane and how the Alchemy Receptor is now associated to the Feed Reader receptors because the Alchemy Receptor is listening for "URL" protocols:

Note that, because of how we've configured the feed readers into two separate "systems", it is not possible to have only a single Alchemy Receptor -- this would require allowing the URL protocol to permeate the feed reader membrane, which would then lead us back to the issue described earlier.  However, is this really an issue?  In fact, not necessarily, especially if you consider the advantages of a distributed system as well as leveraging asynchronous behaviors.  Furthermore, if the multiple instances are actually a problem, at some point the HOPE framework may allow you to specify logical receptors, which would then support a single instance (or more) in the underlying implementation.

The Alchemy API Receptor Code

Alchemy API provides three results from the NLP in its more-or-less default configuration: Entities, Keywords, and Concepts, each having unique attributes, as illustrated in this screenshot from the article comparing three NLP services:

Salient points:

  • AlchemyAPI allows us to directly pass in the URL, as it has a built in content scraper.  This saves us a lot of effort in either extracting the content ourselves (a daunting task) or using a third party service. 
  • To acquire the entities, keywords, and concepts, we have to make three separate calls.  Note how I'm increasing the limits of the entries returned (the default is 50) to the maximum, 250. 
  • Note that I have a "TEST" compiler conditional, as I don't want to hit AlchemyAPI during testing of the entire applet, nor do I want to wait the 4 or 5 seconds it takes AlchemyAPI to return with the data.  The test datasets were previously acquired and serialized. 
  • AlchemyAPI returns a very nicely formatted XML document that can be read directly into a .NET DataSet.  I'm ignoring some of the information in that DataSet, which you may wish to explore.

Here's the complete code for Alchemy Receptor:

public class Alchemy : BaseReceptor
{
  public override string Name { get { return "Alchemy"; } }
  public override bool IsEdgeReceptor { get { return true; } }

  protected AlchemyAPI.AlchemyAPI alchemyObj;

  public Alchemy(IReceptorSystem rsys)
    : base(rsys)
  {
    AddEmitProtocol("AlchemyEntity");
    AddEmitProtocol("AlchemyKeyword");
    AddEmitProtocol("AlchemyConcept");

    AddReceiveProtocol("URL",
      // cast is required to resolve Func vs. Action in parameter list.
      (Action<dynamic>)(signal => ParseUrl(signal)));
  }

  public override void Initialize()
  {
    base.Initialize();
    InitializeAlchemy();
  }

  protected void InitializeAlchemy()
  {
    alchemyObj = new AlchemyAPI.AlchemyAPI();
    alchemyObj.LoadAPIKey("alchemyapikey.txt");
  }

  /// <summary>
  /// Calls the AlchemyAPI to parse the URL. The results are 
  /// emitted to an NLP Viewer receptor and to the database for
  /// later querying.
  /// </summary>
  /// <param name="signal"></param>
  protected async void ParseUrl(dynamic signal)
  {
    string url = signal.Value;

    DataSet dsEntities = await Task.Run(() => { return GetEntities(url); });
    DataSet dsKeywords = await Task.Run(() => { return GetKeywords(url); });
    DataSet dsConcepts = await Task.Run(() => { return GetConcepts(url); });

    dsEntities.Tables["entity"].IfNotNull(t => Emit("AlchemyEntity", t));
    dsKeywords.Tables["keyword"].IfNotNull(t => Emit("AlchemyKeyword", t));
    dsConcepts.Tables["concept"].IfNotNull(t => Emit("AlchemyConcept", t));
  }

  protected void Emit(string protocol, DataTable data)
  {
    data.ForEach(row =>
      {
        CreateCarrierIfReceiver(protocol, signal =>
          {
            // Use the protocol as the driver of the fields we want to emit.
            ISemanticTypeStruct st = rsys.SemanticTypeSystem.GetSemanticTypeStruct(protocol);
            st.AllTypes.ForEach(se =>
              {
                object val = row[se.Name];

                if (val != null && val != DBNull.Value)
                {
                  se.SetValue(rsys.SemanticTypeSystem, signal, val);
                }
              });
          });
      });
  }

  protected DataSet GetEntities(string url)
  {
    DataSet dsEntities = new DataSet();
#if TEST
    // Using previously captured dataset
    dsEntities.ReadXml("alchemyEntityTestResponse.xml");
#else
    try
    {
      AlchemyAPI_EntityParams eparams = new AlchemyAPI_EntityParams();
      eparams.setMaxRetrieve(250);
      string xml = alchemyObj.URLGetRankedNamedEntities(url, eparams);
      TextReader tr = new StringReader(xml);
      XmlReader xr = XmlReader.Create(tr);
      dsEntities.ReadXml(xr);
      xr.Close();
      tr.Close();
    }
    catch(Exception ex)
    {
      EmitException("Alchemy Receptor", ex);
    }
#endif
    return dsEntities;
  }

  protected DataSet GetKeywords(string url)
  {
    DataSet dsKeywords = new DataSet();

#if TEST
    // Using previously captured dataset
    dsKeywords.ReadXml("alchemyKeywordsTestResponse.xml");
#else
    try
    {
      AlchemyAPI_KeywordParams eparams = new AlchemyAPI_KeywordParams();
      eparams.setMaxRetrieve(250);
      string xml = alchemyObj.URLGetRankedKeywords(url);
      TextReader tr = new StringReader(xml);
      XmlReader xr = XmlReader.Create(tr);
      dsKeywords.ReadXml(xr);
      xr.Close();
      tr.Close();
    }
    catch(Exception ex)
    {
      EmitException("Alchemy Receptor", ex);
    }
#endif
    return dsKeywords;
  }

  protected DataSet GetConcepts(string url)
  {
    DataSet dsConcepts = new DataSet();

#if TEST
    // Using previously captured dataset
    dsConcepts.ReadXml("alchemyConceptsTestResponse.xml");
#else
    try
    {
      AlchemyAPI_ConceptParams eparams = new AlchemyAPI_ConceptParams();
      eparams.setMaxRetrieve(250);
      string xml = alchemyObj.URLGetRankedConcepts(url);
      TextReader tr = new StringReader(xml);
      XmlReader xr = XmlReader.Create(tr);
      dsConcepts.ReadXml(xr);
      xr.Close();
      tr.Close();
    }
    catch(Exception ex)
    {
      EmitException("Alchemy Receptor", ex);
    }
#endif
    return dsConcepts;
  }
}

To display the results, we'll drop in Carrier List Viewer Receptors that list the NLP results from all feeds:

To accomplish this, we need to allow the AlchemyEntity, AlchemyKeyword, and AlchemyConcept protocols to permeate the membranes:

When we do this for both membranes surrounding the Alchemy Receptor, the visualizer then shows us that the Alchemy receptor is emitting protocols that the Carrier List Viewre Receptor is interested in.  Each Carrier List Viewer receptor on the bottom of the screenshot has been configured to receive the respective protocol.

Of course, we don't necessarily need to see all three types (entities, keywords, concepts) - this all depends on how you'd like to configure the applet.  You'll note above that I'm using three separate list viewers, one for each category of analysis.  Later on I'll be using a tabbed list viewer to manage all this information.

AlchemyAPI

This section specifically discusses the AlchemyAPI service.  Not everything that AlchemyAPI provides is discussed here -- just the most common features.  Specifically, "sentiment" and "relationships" are not covered, but you can read more about those on the AlchemyAPI website.

Given a document or URL, you can extract the semantic meaning into three categories: Entities, Keywords, and Concepts.

Entities

AlchemyAPI returns the following information for each entity:

text: this is the entity name (or, more specifically, the noun)

type: AlchemyAPI attempts to determine the entity type, which includes such labels as City, Company, Continent, Country, Crime, Degree, Facility, Field Terminology, Geographic Feature, Holiday, Job Title, Person, Operating System, Organization, PrintMedia, Product, Region, Sport, StateOrCounty, and Technology.  The complete list can be found here.

count: This is a count of the occurrences of the entity.  This count (common to all NLP's I've reviewed) utilizes a coreference feature called "anaphora resolution": "In the sentence Sally arrived, but nobody saw her, the pronoun her is anaphoric, referring back to Sally." (from wikipedia)

relevance: A relevance score from 0.0 - 1.0, where 1.0 is the most relevant.  According to Steve Herschleb, API Evangelist at AlchemyAPI: "The relevance score for each keyword ranks the general importance of each extracted keyword. How the score is actually calculate involves some pretty complex statistics, but the algorithm includes things like the word's position within the text, the other words around it, how many times it's used, etc." (source from Quora website)

Keywords

Keywords consist of the keyword text and relevance.  "Keywords are the important topics in your content and can be used to index data, generate tag clouds or for searching. AlchemyAPI's keyword extraction API is capable of finding keywords in text and and ranking them. The sentiment can then be determined for each extracted keyword." (source)  Note that I do not demonstrate sentiment in this applet -- performing sentiment analysis is a separate call that counts as a "transaction."

Concepts

Concepts are an interesting feature of AlchemyAPI: "

"AlchemyAPI employs sophisticated text analysis techniques to concept tag documents in a manner similar to how humans would identify concepts. The concept tagging API is capable of making high-level abstractions by understanding how concepts relate, and can identify concepts that aren't necessarily directly referenced in the text.  For example, if an article mentions CERN and the Higgs boson, it will tag Large Hadron Collider as a concept even if the term is not mentioned explicitly in the page. By using concept tagging you can perform higher level analysis of your content than just basic keyword identification." (source)

One of the interesting things about AlchemyAPI's concepts is its data linking.  You can read more about Linked Data here.  From the above screenshot, you can see that there are three linked data results from DBpedia, Freebase, and opencyc.  Depending on the content, AlchemyAPI will link to several other knowledge bases as well.

AlchemyAPI Exceptions

The exception routine in AlchemyAPI is rather poor -- it does not actually report the error that the server produced, which is definitely part of the resulting XML. 

A simple modification provides a much more meaningful result (in AlchemyAPI.cs, starting on line 955):

// OLD:
/*
if (status.InnerText != "OK")
{
  System.ApplicationException ex = new System.ApplicationException ("Error making API call.");

  throw ex;
}*/

// MTC 7/14/2014
// Much better, as it gives me the error message from the server.
if (status.InnerText != "OK")
{
  string errorMessage = "Error making API call.";

  try
  {
    XmlNode statusInfo = root.SelectSingleNode("/results/statusInfo");
    errorMessage = statusInfo.InnerText;
  }
  catch
  {
    // some problem with the statusInfo. Return the generic message.
  }

  System.ApplicationException ex = new System.ApplicationException(errorMessage);

  throw ex;
}

Happily, this fix will soon be incorporated into the API provided by AlchemyAPI.

Caching Content

Ideally we don't want to repeatedly scrape the same pages so for the moment (because I don't want to add the whole persistence piece in this article), I've added a simple caching mechanism to avoid exceeding one's daily limit of 1000 transactions:

/// <summary>
/// Return true if cached and populate the refenced DataSet parameter.
/// </summary>
protected bool Cached(string prefix, string url, ref DataSet ds)
{
  string urlHash = url.GetHashCode().ToString();
  string fn = prefix + "-" + urlHash + ".xml";

  bool cached = File.Exists(fn);

  if (cached)
  {
    ds.ReadXml(fn);
  }

  return cached;
}

/// <summary>
/// Cache the dataset.
/// </summary>
protected void Cache(string prefix, string url, DataSet ds)
{
  string urlHash = url.GetHashCode().ToString();
  string fn = prefix + "-" + urlHash + ".xml";
  ds.WriteXml(fn);
}

This is only a temporary measure, true data persistence to a database will be covered in part 2.

Content Limit Size

An error that you may also get is "content exceeds size limit".  I'll update this article once I know the exact limit.

Retrieve Limits

The default number of entities, keywords, and concepts retrieved by AlchemyAPI is 50.  You can increase this limit to a maximum of 250 as I've done in the Alchemy Receptor, for example with entities:

AlchemyAPI_EntityParams eparams = new AlchemyAPI_EntityParams();
eparams.setMaxRetrieve(250);
string xml = alchemyObj.URLGetRankedNamedEntities(url, eparams);

This is an important parameter with which to experiment, as I'm not sure how useful it is to increase this limit.  For example, when processing this wikipedia page on computer science, AlchemyAPI extracts 147 total entities.  This compares well with OpenCalais (155 entities) which has no default limit.  By contrast, Semantria defaults to 5 entities with a maximum retrieval of 50.

More With Receptors

To achieve my primary goal in this article, filtering feeds from the NLP results, we need to add some further behaviors, the first of which is simply a tabbed list viewer receptor that will enable easier management of all these lists.

Tabbed List Viewer Receptor

I'm not going to show the code (it's very similar to the Carrier List Viewer Receptor above), instead I'll just walk through the configuration and usage.

Configuration

After dropping the tabbed list viewer receptor onto the surface, we double-click on it and configure the tabs we want and the protocols that it lists.  The astute reader may realize that this will not work for RSSFeedItem protocols -- there is nothing to distinguish to feed items from one RSS feed from another.  This can only be accomplished by qualifying the signal's data, in this case with the feed name.  This feature is not currently implemented because it needs to be done in a general purpose manner.

Wiring it up

Once the protocols are defined, we can see how it is connected:

Results

The NLP results now display in a tabbed list form rather than in discrete list forms:

Associating the URL with NLP Results

The NLP result isn't very useful by itself.  We need to associate the URL for each result, which we can do by adding the semantic element to the Alchemy protocols:

<SemanticElement Name="URL"/>

and of course assigning that property to each result record that is emitted by the Alchemy Receptor:

signal.URL.Value = url;  // .Value because this is a semantic element and Value drills into the implementing native type.

Notice immediately what now happens:

Because the Alchemy protocols now include the semantic element "URL", the list viewer receptor and URL Receptor are now auto-magically wired up (well, it was implemented in a couple lines of code, as illustrated above in the single list viewer) such that, when the user double-clicks on an entry in the tabbed viewer, it emits all known semantic elements, of which "URL" is one (and the only one right now.)  Again, the astute reader will say, "but what about the URL's that are part of the Linked Data content, such as DBpedia?"  And that is a very good question which is not addressed in this article.

As a side-note, the beauty of the HOPE architecture is illustrated in the above behavior: the capability of the system is defined equally (if not more, actually) by the semantics of the protocols -- the richer your semantics become, the more interesting behaviors that can be created that work with those semantics.

A Filter Receptor

We finally get to the crux of the matter -- filtering feeds based on the NLP results.  To make this somewhat sophisticated, I'm going to use the NCalc Expression Evaluator so that we can do interesting things such as filtering entities or concepts not just by keywords but by a relevance threshold as well.  We'll do this as generically as possible.  First, the filtered protocol is emitted exactly as received, however it is necessary to use a different semantic protocol to avoid ambiguity between unfiltered and filtered results.  To some extent, this can be viewed as a potential flaw in the HOPE architecture, but this problem is common in publisher/subscriber systems, which is one of aspect of HOPE.  We will look at this issue at some point in the future.

Working with NCalc is very simple.  This code snippet of the Filter Receptor demonstrates setting up "variables" in NCalc and creating a custom function "contains":

protected void FilterSignal(string protocol, dynamic signal, List<string> filters)
{
  filters.ForEach(filter =>
  {
    try
    {
      Expression exp = new Expression(filter);

      // Assign the types in the semantic structure as variables.
      ISemanticTypeStruct st = rsys.SemanticTypeSystem.GetSemanticTypeStruct(protocol);

      st.AllTypes.ForEach(t =>
      {
        exp.Parameters[t.Name] = t.GetValue(rsys.SemanticTypeSystem, signal);
      });

      // Allow parsing of additional functions.
      exp.EvaluateFunction += OnEvaluateFunction;

      object result = exp.Evaluate();

      if (result is bool)
      {
        if ((bool)result)
        {
          // Copy the input signal to the Filtered[protocol] signal for emission.
          CreateCarrier("Filtered" + protocol, outSignal =>
          {
            st.AllTypes.ForEach(t =>
            {
              t.SetValue(rsys.SemanticTypeSystem, outSignal, t.GetValue(rsys.SemanticTypeSystem, signal));
            });
          });
        }
      }
    }
    catch (Exception ex)
    {
      EmitException(ex.Message + " with filter " + filter);
    }
  });
}

protected void OnEvaluateFunction(string name, FunctionArgs args)
{
  if (name.ToLower() == "contains")
  {
    string v1 = args.Parameters[0].Evaluate().ToString().ToLower();
    string v2 = args.Parameters[1].Evaluate().ToString().ToLower();

    args.Result = v1.Contains(v2);
  }
}

Configuration

The above screenshot illustrates a sample configuration of filtering protocols.  Certainly, more filters on the same protocols (or other protocols) can be added.

We display the filtered list in a tabbed list view, configured as such:

Wiring it up

Membranes are again used to ensure that protocols are received and emitted in a controlled manner:

We can now view both unfiltered and filtered feed items (note that I changed the filter criteria from the above screenshot):

Conclusion

Natural Language Processing is a unique way of parsing "big data", providing semantic meaning suitable for potentially complex machine processing that results in information specifically tailored for delivery to us humans.  However, such a lofty statement can only be achieved with the development of algorithms that process this information into something that actually has "meaning."  What I've demonstrated here is a very rudimentary process little better than keyword filtering, but hopefully it may inspire someone to use these services to develop the ideas further!   

While I've entangled the discussion of NLP with the Higher Order Programming Environment framework, personally I hope this may be inspiring to others as well to develop processing receptors and visualizations beyond simple lists.

There is still more work to be done in this demo which will be the focus of Part 2: persisting the NLP data to a database, querying, and improving the usability such as displaying whether a feed item is new or has been already read.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Marc Clifton
United States United States
Marc is the creator of two open source projects, MyXaml, a declarative (XML) instantiation engine and the Advanced Unit Testing framework, and Interacx, a commercial n-tier RAD application suite.  Visit his website, www.marcclifton.com, where you will find many of his articles and his blog.

Marc lives in Philmont, NY.

You may also be interested in...

Comments and Discussions

 
QuestionMy vote of 5 Pin
Pete O'Hanlon16-Jul-14 8:51
protectorPete O'Hanlon16-Jul-14 8:51 
AnswerRe: My vote of 5 Pin
Marc Clifton16-Jul-14 9:59
protectorMarc Clifton16-Jul-14 9:59 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Terms of Use | Mobile
Web01 | 2.8.150731.1 | Last Updated 16 Jul 2014
Article Copyright 2014 by Marc Clifton
Everything else Copyright © CodeProject, 1999-2015
Layout: fixed | fluid