(untagged)

CodeProject Article Scraping

John Simmons / outlaw programmer

0.00/5 (No votes)

11 Dec 2010

Scrape the My Articles page here on CodeProject to keep an eye on your articles.

Download CPAM - 365 KB

NOTICE - This code in this article is no longer viable due to recent (and somewhat radical) changes in the format of the CodeProject pages that are being scraped. For this reason, I have come up with a completely new article that exploits the new format changes. That article is here:

CodeProject Article Scraper, Revisited

I left this article on the sight to give folks the opportunity to compare coding styles, structure changes, and even scraping methodology.

Introduction

This article describes a method for scraping data off of the CodeProject My Articles page. There is currently no CodeProject API for retrieving this data, so this is the only way to get the info. Unfortunately, the format of this page could change at any time, and may break this code, so it's up to you to stay on top of this issue. This should be quite easy since I've done all the hard work for you - all you have to do is maintain it.

IMPORTANT NOTE: Check the History section at the bottom of this article and make sure you implement the bug fix(es) shown there.

The ArticleData Class

The ArticleDataclass contains the data for each article scraped off the web page. The most interesting aspect of this class is that it's derived from IComparableso that the generic list that contains the ArticleDataobjects can sort the list on any of the scraped values. There are several ways to sort a generic list, and I used the one that kept the referring code the cleanest. What I'm trying to say is that you should pick the way you want to do it. No method is more correct than any other, and is more a factor of programmer style and preference than anything else.

The Way I Did It

I chose to derive the AriticleData class from IComparable, and write the functions necessary to perform the sorting. This keeps the referencing code free of needless clutter, thus making the code easier to read. This is the way I like to do things. In my humble opinion, there is no point in bothering the programmer with needless minutia. Instead of posting the entire class in this article, I'll simply show you two of the sorting functions:

public class ArticleData : IComparable<ArticleData>
{
	// DATA MEMBERS

	// PROPERTIES

	#region Comparison delegates
	/// <summary>
	/// Title comparison for sort function
	/// </summary>
	public static Comparison<ArticleData> TitleCompare = delegate(ArticleData p1, ArticleData p2)
	{
		return (p1.SortAscending) ? p1.m_title.CompareTo(p2.m_title) : p2.m_title.CompareTo(p1.m_title);
	};
	/// <summary>
	/// Page views comparison for sort function
	/// </summary>
	public static Comparison<ArticleData> PageViewsCompare = delegate(ArticleData p1, ArticleData p2)
	{
		return (p1.SortAscending) ? p1.m_pageViews.CompareTo(p2.m_pageViews) : p2.m_pageViews.CompareTo(p1.m_pageViews);
	};


	// there are more comparison delegates here

	/// <summary>
	/// Default comparison (compares article ID) for sort function
	/// </summary>
	public int CompareTo(ArticleData other)
	{
		return ArticleID.CompareTo(other.ArticleID);
	}
	#endregion Comparison delegates

The ArticleUpdate class

This class is derived from the ArticleDataclass, and at first blush, it appears as if it's an exact duplicate of the ArticleDataclass, but that's not the case. To make the code truly useful, you need a way to identify changes since your last data scrape. For the purposes of this demo, that's what this class enables. I recognize that you might have different reasons for scraping the My Articles page, so you should be prepared to write your own class that performs the functionality your application requires. It's my guess that your implementation will be more extensive than my own.

The class has its own sort delegates. They're similar enough that I decided not to actually show them in this article because I think it would be redundant. The truly interesting methods n this class are:

ApplyChanges

This method is called from the scraper manager object (covered in the next section) when an article is scraped of the web page. If the article exists in the list of existing articles, we call this method to change the data to its existing values. If ANYTHING has changed for the article, this method returns true

public bool ApplyChanges(ArticleUpdate item, DateTime timeOfUpdate, bool newArticle)
{
	bool changed = false;

	// make them all the same
	this.m_title			= m_latestTitle;
	this.m_link			= m_latestLink;
	this.m_lastUpdated		= m_latestLastUpdated;
	this.m_description		= m_latestDescription;
	this.m_pageViews		= m_latestPageViews;
	this.m_rating			= m_latestRating;
	this.m_votes			= m_latestVotes;
	this.m_popularity		= m_latestPopularity;
	this.m_bookmarks		= m_latestBookmarks;

	// set new info
	this.m_latestTitle		= item.m_latestTitle;
	this.m_latestLink		= item.m_latestLink;
	this.m_latestDescription	= item.m_latestDescription;
	this.m_latestPageViews		= item.m_latestPageViews;
	this.m_latestRating		= item.m_latestRating;
	this.m_latestVotes		= item.m_latestVotes;
	this.m_latestPopularity		= item.m_latestPopularity;
	this.m_latestBookmarks		= item.m_latestBookmarks;

	// make a note of the last update time stamp
	this.m_timeUpdated		= timeOfUpdate;
	this.m_newArticle		= newArticle;

	// see if anything changed since the last update
	changed = (this.m_title		!= m_latestTitle	||
		   this.m_link		!= m_latestLink		||
		   this.m_lastUpdated	!= m_latestLastUpdated	||
		   this.m_description	!= m_latestDescription	||
		   this.m_pageViews	!= m_latestPageViews	||
		   this.m_rating	!= m_latestRating	||
		   this.m_votes		!= m_latestVotes	||
		   this.m_popularity	!= m_latestPopularity	||
		   this.m_bookmarks	!= m_latestBookmarks	||
		   this.m_newArticle	== true);

	m_changed = changed;

	return changed;
}

PropertyChanged

The PropertyChangedmethod allows you to see if a specific property has changed. Simply provide the property name, and handle the return value (trueif the property's value changed).

public bool PropertyChanged(string property)
{
	string originalProperty = property;
	property = property.ToLower();
	switch (property)
	{
		case "title"		: return (Title != LatestTitle);
		case "link"		: return (Link != LatestLink);
		case "description"	: return (Description != LatestDescription);
		case "pageviews"	: return (PageViews != LatestPageViews);
		case "rating"		: return (Rating != LatestRating);
		case "votes"		: return (Votes != LatestVotes);
		case "popularity"	: return (Popularity != LatestPopularity);
		case "bookmarks"	: return (Bookmarks != LatestBookmarks);
		case "lastupdated"	: return (LastUpdated != LatestLastUpdated);
	}
	// if we get here, the property is invalid
	throw new Exception(string.Format("Unknown article property - '{0}'", originalProperty));
}

HowChanged

This method accepts a property name, and returns a ChangeTypeenumerator indicating if the new value is equal to, greater than, or less than the last value that was scraped.

public ChangeType HowChanged(string property)
{
	ChangeType changeType = ChangeType.None;

	string originalProperty = property;
	property = property.ToLower();

	switch (property)
	{
		case "title": 
			break;

		case "link": 
			break;

		case "description": 
			break;

		case "pageviews": 
			{
				if (PageViews != LatestPageViews)
				{
					changeType = ChangeType.Up;
				}
			}
			break;

		case "rating": 
			{
				if (Rating > LatestRating)
				{
					changeType = ChangeType.Down;
				}
				else
				{
					if (Rating < LatestRating)
					{
						changeType = ChangeType.Up;
					}
				}
			}
			break;

		case "votes": 
			{
				if (Votes != LatestVotes)
				{
					changeType = ChangeType.Up;
				}
			}
			break;

		case "popularity": 
			{
				if (Popularity > LatestPopularity)
				{
					changeType = ChangeType.Down;
				}
				else
				{
					if (Popularity < LatestPopularity)
					{
						changeType = ChangeType.Up;
					}
				}
			}
			break;

		case "bookmarks": 
			{
				if (Bookmarks > LatestBookmarks)
				{
					changeType = ChangeType.Down;
				}
				else
				{
					if (Bookmarks < LatestBookmarks)
					{
						changeType = ChangeType.Up;
					}
				}
			}
			break;

		case "lastupdated": 
			break;

		default : throw new Exception(
				string.Format("Unknown article property - '{0}'", 
						originalProperty));
	}

	return changeType;
}

The ArticleScraper Class

To make things easy on myself, I put all of the scraping code into this class. The web page is requested, and then parsed to within an inch of its life. For purposes of this article, I placed no value in determining the category/sub-category under which the article is posted.

The RetrieveArticlesmethod is responsible for making the page request,a nd managing the parsing chores, which are themselves broken up into manageable chunks. During testing of the scraping code, I went to the My Articles page in a web browser, and saved the source code to a file. This allowed me to test without having to repeatedly hammer CodeProject during initial development of the parsing code. I decided to leave the code in the class to allow other programmers the same luxury. Here are the important bits (the text file specified in the code is provided with this articles download file):

	if (this.ArticleSource == ArticleSource.CodeProject)
	{
		// this code actually hits the codeproject website
		string url = string.Format("{0}{1}{2}", 
					   "http://www.codeproject.com/script/",
					   "Articles/MemberArticles.aspx?amid=", 
					   this.UserID);
		Uri uri = new Uri(url);
		WebClient webClient = new WebClient();
		string response = "";
		try
		{
			// added proxy support for those that need it - many thanks to Pete 
			// O'Hanlon for pointing this out.
			webClient.Proxy = WebRequest.DefaultWebProxy;
			webClient.Proxy.Credentials = CredentialCache.DefaultCredentials;
			// get the web page
			response = webClient.DownloadString(uri);
		}
		catch (Exception ex)
		{
			throw ex;
		}
		pageSource = response;
	}
	else
	{
		// this code loads a sample page source from a local text file
		StringBuilder builder = new StringBuilder("");
		string filename = System.IO.Path.Combine(Application.StartupPath, 
							"MemberArticles.txt");
		StreamReader reader = null;
		try
		{
			reader = File.OpenText(filename);
			string input = null;
			while ((input = reader.ReadLine()) != null)
			{
				builder.Append(input);
			}
		}
		catch (Exception ex)
		{
			throw ex;
		}
		finally
		{
			reader.Close();
		}

		pageSource = builder.ToString();
	}

Note - The line in the code above that builds the urlstring is formatted to prevent the containing <pre> tag from potentially forcing this articles page to require horizontal scrolling.

After getting the web page, the pageSourcevariable should contain something. If it does, we hit the following code (and we're still in the RetrieveArticlesmethod):

	int articleNumber = 0;
	bool found = true;

	while (found)
	{
		// establish our trigger points
		string articleStart = string.Format("<span id=\"ctl00_MC_AR_ctl{0}_MAS", 
					string.Format("{0:00}", articleNumber));
		// we use the beginning of the next article as the 
		// end of the current one
		string articleEnd   = string.Format("<span id=\"ctl00_MC_AR_ctl{0}_MAS", 
					string.Format("{0:00}", articleNumber + 1));

		// get the index of the start of the next article
		int startIndex = pageSource.IndexOf(articleStart);

		if (startIndex >= 0)
		{
			// delete everything that came before the starting index
			pageSource = pageSource.Substring(startIndex);
			startIndex = 0;

			// find the end of our articles data
			int endIndex = pageSource.IndexOf(articleEnd);

			// If we don't have an endIndex, then we've arrived 
			// at the final article in our list. 
			if (endIndex == -1)
			{
				endIndex = pageSource.IndexOf("<table");
				if (endIndex == -1)
				{
					endIndex = pageSource.Length - 1;
				}
			}

			// get the substring
			string data = pageSource.Substring(0, endIndex);

			// if we have data, process it
			if (data != "")
			{
				ProcessArticle(data, articleNumber);
			}
			else
			{
				found = false;
			}
			articleNumber++;
		}
		else
		{
			found = false;
		}
	} // while (found)

	CalculateAverages();

I guess I could have used LINQ to scrounge around in the XML, but when you get right down to it, we can't count on the HTML being valid, so it's simply more reliable to parse the text this way. I know, Chris, et al., work hard at making sure everything is just so, but they are merely human, and we know we can't count on humans to do it right every single time.

Processing an Article

By "process", I mean parsing out the HTML and digging the actual data out of the article's div. While fairly simple, it is admittedly tedious. We start out by getting the article's URL, which is a straightforward operation:

private string GetArticleLink(string data)
{
	string result = data;
	// find the beginning of the desired text
	int hrefIndex = result.IndexOf("href=\"") + 6;
	//find the end of the desired text
	int endIndex = result.IndexOf("\">", hrefIndex);
	// snag it
	result = result.Substring(hrefIndex, endIndex - hrefIndex).Trim();
	// return it
	return result;
}

Next, we clean the data, starting off by removing all of the HTML tags. A change was made to the source code to make the removal of HTML tags a little smarter. If the article title and/or description contain more than one pointy bracket, this method will be almost guaranteed to return only a portion of the actual text of the item in question. If you like, you can google for (and use) one of the many exhaustive HTML parsers available on the net. IMHO, it's not worth the effort considering this class' primary usage and consistently decent HTML we get from CodeProject.

private string RemoveHtmlTags(string data)
{
	int ltCount = CountChar(data, '<');
	int gtCount = CountChar(data, '>');

	// If the number of left and right pointy bracks are the same, we stand a 
	// reasonable chance that what we think are html tags really ARE html tags.
	if (ltCount == gtCount)
	{
		data = ForwardStrip(data);
	}
	else
	{
		// Otherwise, we have an errant pointy bracket, which we can almost 
		// always take care of depending on the order in which we search for 
		// tags.
		if (gtCount > ltCount)
		{
			data = BackwardStrip(ForwardStrip(data));
		}
		else
		{
			data = ForwardStrip(BackwardStrip(data));
		}
	}
	return data;
}


private int CountChar(string data, char value)
{
	int count = 0;
	for (int i = 0; i < data.Length; i++)
	{
		if (data[i] == value)
		{
			count++;
		}
	}
	return count;
}

private string ForwardStrip(string data)
{
	bool	found	= true;
	do
	{
		int tagStart = data.IndexOf("<");
		int tagEnd = data.IndexOf(">");
		if (tagEnd >= 0)
		{
			tagEnd += 1;
		}
		found = (tagStart >= 0 && tagEnd >= 0 && tagEnd-tagStart > 1);
		if (found)
		{
			string tag = data.Substring(tagStart, tagEnd - tagStart);
			data = data.Replace(tag, "");
		}
	} while (found);
	return data;
}

private string BackwardStrip(string data)
{
	bool	found	= true;
	do
	{
		int tagStart = data.LastIndexOf("<");
		int tagEnd = data.LastIndexOf(">");
		if (tagEnd >= 0)
		{
			tagEnd += 1;
		}
		found = (tagStart >= 0 && tagEnd >= 0 && tagEnd-tagStart > 1);
		if (found)
		{
			string tag = data.Substring(tagStart, tagEnd - tagStart);
			data = data.Replace(tag, "");
		}
	} while (found);
	return data;
}

Then, we remove all the extra stuff left behind:

private string CleanData(string data)
{
	// get rid of the HTML tags
	data = RemoveHtmlTags(data);

	// get rid of the crap that's left behind
	data = data.Replace("\t", "^").Replace(" ", "");
	data = data.Replace("\n","").Replace("\r", "");
	data = data.Replace(" / 5", "");
	while (data.IndexOf("  ") >= 0)
	{
		data = data.Replace("  ", " ");
	}
	while (data.IndexOf("^ ^") >= 0)
	{
		data = data.Replace("^ ^", "^");
	}
	while (data.IndexOf("^^") >= 0)
	{
		data = data.Replace("^^", "^");
	}
	data = data.Substring(1);
	data = data.Substring(0, data.Length - 1);
	return data;
}

After this, we're left with a pure list of data that describes the article, delimited with caret characters. All that's left is to create an ArticleUpdateitem and store it in our generic list.

private void ProcessArticle(string data, int articleNumber)
{
	string link	= GetArticleLink(data);
	data = CleanData(data);
	string[] parts = data.Split('^');
	string title = parts[0];
	string description = parts[7];
	string lastUpdated = GetDataField("Last Update", parts);
	string pageViews = GetDataField("Page Views", parts).Replace(",", "");
	string rating = GetDataField("Rating", parts);
	string votes = GetDataField("Votes", parts).Replace(",", "");
	string popularity = GetDataField("Popularity", parts);
	string bookmarks = GetDataField("Bookmark Count", parts);

	// create the AticleData item and add it to the list
	DateTime lastUpdatedDate;
	ArticleUpdate article = new ArticleUpdate();
	article.LatestLink = string.Format("http://www.codeproject.com{0}", link);
	article.LatestTitle = title;
	article.LatestDescription = description;
	if (DateTime.TryParse(lastUpdated, out lastUpdatedDate))
	{
		article.LatestLastUpdated = lastUpdatedDate;
	}
	else
	{
		article.LatestLastUpdated = new DateTime(1990, 1, 1);
	}
	article.LatestPageViews		= Convert.ToInt32(pageViews);
	article.LatestRating		= Convert.ToDecimal(rating);
	article.LatestVotes		= Convert.ToInt32(votes);
	article.LatestPopularity	= Convert.ToDecimal(popularity);
	article.LatestBookmarks		= Convert.ToInt32(bookmarks);

	AddOrChangeArticle(article);
}

private void AddOrChangeArticle(ArticleUpdate article)
{
	bool found = false;
	DateTime now = DateTime.Now;

	// apply changes
	for (int i = 0; i < m_articles.Count; i++)
	{
		ArticleUpdate item = m_articles[i];
		if (item.LatestTitle.ToLower() == article.LatestTitle.ToLower())
		{
			found = true;
			item.ApplyChanges(article, now, false);
			break;
		}
	}

	// if the article was not found, it must be new (or the title has changed), 
	// so we'll add it
	if (!found)
	{
		article.ApplyChanges(article, now, true);
		m_articles.Add(article);
	}

	// remove all articles that weren't updated this time around - we need to 
	// traverse the list in reverse order so we don't lose track of our index
	for (int i = m_articles.Count - 1; i == 0; i--)
	{
		ArticleUpdate item = m_articles[i];
		if (item.TimeUpdated != now)
		{
			m_articles.RemoveAt(i);
		}
	}
}

The Sample Application

The sample application is admittedly a rudimentary affair, and is honestly intended to show only one possible way to use the scraping code. I decided to use a WebBowsercontrol, but about halfway through the app, I began to regret that decision. However, I was afraid I'd become bored with the whole thing, and determined to soldier on.>/p>

You'll see that I didn't go to heroic lengths to pretty things up. For instance, I used PNG files for the graphics instead of GIF files. This means the transparency in the PNG files isn't handled correctly on systems running IE6 or earlier.

The application allows you to select the data on which to sort, and in what direction (ascending or descending). The default is the date last updated in descending order so that the newest articles appear first.

The WebBrowsercontrol displays the articles in a table, and uses icons to indicate changed data and certain statistical information regarding articles. The article titles are hyperlinks to the actual article's page, and that page is displayed within the WebBrowsercontrol. To go back to the article display, you have to click the Sort button because I didn't implement any of the forward/back functionality you find in a normal web browser.

The icons used are as follows:

- Indicates a new article. All articles will display as new when you initially start the application.

- Indicates the article with the best rating.

- Indicates the article with the worst rating.

- Indicates the article with the most votes.

- Indicates the article with the most page views.

- Indicates the most popular article.

- Indicates the article with the most bookmarks.

- Indicates that the associated field increased in value.

- Indicates that the associated field decreased in value.

Other controls on the form include the following.

Show New Info Only

This checkbox allows you to filter the list of articles so that only new articles, and articles that have new data are displayed.

Show Icons

This checkbox allows you to turn the display of icons on and off.

Automatic Refresh

This checkbox allows you to turn the automatic refrsh on and off. Once every hour, a BackgroundWorkerobject is used to refresh the article data.

Button - Refresh From CodeProject

This button allows you to manually refresh the article data (and this button is available even if auto-refresh is turned on).

Lastly, you can specify the user ID of the user for whom you would like to retrieve data. After specifying the new ID, hit the Refresh button.

Closing

This code is only intended to be used to retrieve your own articles - the scraper class accepts the user ID as a parameter, and that ID is currently set to my own. Make sure you change that before you start looking for your own articles.

I've tried to make this as maintainable as possible without forcing the programmer to do conceptual back-flips, but there's no way I can acommodate everyone's reading comprehension levels, so what I guess I'm saying is - you're pretty much on your own. I can't guarantee that I will be able to maintain this article in a timely fashion, but that shouldn't matter. We're all programmers here, and the stuff I've presented isn't rocket science. Besides, you have plenty of examples in the provided classes to modify and/or extend their functionality. Have fun.

Remember also, that the png files and css file need to be in the same folder as the executable, or it won't find them.

History

02/19/2010 (IMPORTANT!): There is a bug in the code that will cause the program to always tell you that it couldn't retrieve article information. After you download the code, make the following change: In the file ArticleScraper.cs find the line that looks like this (in the ProcessArticle()method):

string rating = GetDataField("Rating", parts);

and change it to this:

string rating = GetDataField("Rating", parts).Replace("/5", "");

10/14/2008: Addressed the following:

Added support for retrieving the web page via a proxy (thanks Pete O'Hanlon!).
Added code to throw any exception encountered during the web page retrieval process (thanks again Pete O'Hanlon!).
Added a slightly more thorough HTML parse to handle errant < and > in the title or description of the article (thanks ChandraRam!).
Embedded the icons as resources in the exe file. They will be copied to the app folder the first time the exe is run.
Added a new statistic item at the top of the form - "Articles Displayed".
Enclosed the stuff at the top ofthe form in group boxes to make it look more organized.

10/13/2008: Addressed the following:

Added the forgotten mostvotes.png image.
Modified code to use the mostvotes image.
Added a textbox to the form to allow you to specify the userID.
Fixed the form resizing issues.
The zip file now includes the debug folder with the images and css file.

10/13/2008: Original article posted.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here