|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Announcements
Chapters
Services
Feature Zones
|
IntroductionThis article describes a method for scraping data off of the CodeProject My Articles page. There is currently no CodeProject API for retrieving this data, so this is the only way to get the info. Unfortunately, the format of this page could change at any time, and may break this code, so it's up to you to stay on top of this issue. This should be quite easy since I've done all the hard work for you - all you have to do is maintain it. The ArticleData ClassThe The Way I Did ItI chose to derive the AriticleData class from IComparable, and write the functions necessary to perform the sorting. This keeps the referencing code free of needless clutter, thus making the code easier to read. This is the way I like to do things. In my humble opinion, there is no point in bothering the programmer with needless minutia. Instead of posting the entire class in this article, I'll simply show you two of the sorting functions: public class ArticleData : IComparable<ArticleData> { // DATA MEMBERS // PROPERTIES #region Comparison delegates /// <summary> /// Title comparison for sort function /// </summary> public static Comparison<ArticleData> TitleCompare = delegate(ArticleData p1, ArticleData p2) { return (p1.SortAscending) ? p1.m_title.CompareTo(p2.m_title) : p2.m_title.CompareTo(p1.m_title); }; /// <summary> /// Page views comparison for sort function /// </summary> public static Comparison<ArticleData> PageViewsCompare = delegate(ArticleData p1, ArticleData p2) { return (p1.SortAscending) ? p1.m_pageViews.CompareTo(p2.m_pageViews) : p2.m_pageViews.CompareTo(p1.m_pageViews); }; // there are more comparison delegates here /// <summary> /// Default comparison (compares article ID) for sort function /// </summary> public int CompareTo(ArticleData other) { return ArticleID.CompareTo(other.ArticleID); } #endregion Comparison delegates The ArticleUpdate classThis class is derived from the The class has its own sort delegates. They're similar enough that I decided not to actually show them in this article because I think it would be redundant. The truly interesting methods n this class are: ApplyChangesThis method is called from the scraper manager object (covered in the next section) when an article is scraped of the web page. If the article exists in the list of existing articles, we call this method to change the data to its existing values. If ANYTHING has changed for the article, this method returns public bool ApplyChanges(ArticleUpdate item, DateTime timeOfUpdate, bool newArticle) { bool changed = false; // make them all the same this.m_title = m_latestTitle; this.m_link = m_latestLink; this.m_lastUpdated = m_latestLastUpdated; this.m_description = m_latestDescription; this.m_pageViews = m_latestPageViews; this.m_rating = m_latestRating; this.m_votes = m_latestVotes; this.m_popularity = m_latestPopularity; this.m_bookmarks = m_latestBookmarks; // set new info this.m_latestTitle = item.m_latestTitle; this.m_latestLink = item.m_latestLink; this.m_latestDescription = item.m_latestDescription; this.m_latestPageViews = item.m_latestPageViews; this.m_latestRating = item.m_latestRating; this.m_latestVotes = item.m_latestVotes; this.m_latestPopularity = item.m_latestPopularity; this.m_latestBookmarks = item.m_latestBookmarks; // make a note of the last update time stamp this.m_timeUpdated = timeOfUpdate; this.m_newArticle = newArticle; // see if anything changed since the last update changed = (this.m_title != m_latestTitle || this.m_link != m_latestLink || this.m_lastUpdated != m_latestLastUpdated || this.m_description != m_latestDescription || this.m_pageViews != m_latestPageViews || this.m_rating != m_latestRating || this.m_votes != m_latestVotes || this.m_popularity != m_latestPopularity || this.m_bookmarks != m_latestBookmarks || this.m_newArticle == true); m_changed = changed; return changed; } PropertyChangedThe public bool PropertyChanged(string property) { string originalProperty = property; property = property.ToLower(); switch (property) { case "title" : return (Title != LatestTitle); case "link" : return (Link != LatestLink); case "description" : return (Description != LatestDescription); case "pageviews" : return (PageViews != LatestPageViews); case "rating" : return (Rating != LatestRating); case "votes" : return (Votes != LatestVotes); case "popularity" : return (Popularity != LatestPopularity); case "bookmarks" : return (Bookmarks != LatestBookmarks); case "lastupdated" : return (LastUpdated != LatestLastUpdated); } // if we get here, the property is invalid throw new Exception(string.Format("Unknown article property - '{0}'", originalProperty)); } HowChangedThis method accepts a property name, and returns a public ChangeType HowChanged(string property) { ChangeType changeType = ChangeType.None; string originalProperty = property; property = property.ToLower(); switch (property) { case "title": break; case "link": break; case "description": break; case "pageviews": { if (PageViews != LatestPageViews) { changeType = ChangeType.Up; } } break; case "rating": { if (Rating > LatestRating) { changeType = ChangeType.Down; } else { if (Rating < LatestRating) { changeType = ChangeType.Up; } } } break; case "votes": { if (Votes != LatestVotes) { changeType = ChangeType.Up; } } break; case "popularity": { if (Popularity > LatestPopularity) { changeType = ChangeType.Down; } else { if (Popularity < LatestPopularity) { changeType = ChangeType.Up; } } } break; case "bookmarks": { if (Bookmarks > LatestBookmarks) { changeType = ChangeType.Down; } else { if (Bookmarks < LatestBookmarks) { changeType = ChangeType.Up; } } } break; case "lastupdated": break; default : throw new Exception( string.Format("Unknown article property - '{0}'", originalProperty)); } return changeType; } The ArticleScraper ClassTo make things easy on myself, I put all of the scraping code into this class. The web page is requested, and then parsed to within an inch of its life. For purposes of this article, I placed no value in determining the category/sub-category under which the article is posted. The if (this.ArticleSource == ArticleSource.CodeProject) { // this code actually hits the codeproject website string url = string.Format("{0}{1}{2}", "http://www.codeproject.com/script/", "Articles/MemberArticles.aspx?amid=", this.UserID); Uri uri = new Uri(url); WebClient webClient = new WebClient(); string response = ""; try { // added proxy support for those that need it - many thanks to Pete // O'Hanlon for pointing this out. webClient.Proxy = WebRequest.DefaultWebProxy; webClient.Proxy.Credentials = CredentialCache.DefaultCredentials; // get the web page response = webClient.DownloadString(uri); } catch (Exception ex) { throw ex; } pageSource = response; } else { // this code loads a sample page source from a local text file StringBuilder builder = new StringBuilder(""); string filename = System.IO.Path.Combine(Application.StartupPath, "MemberArticles.txt"); StreamReader reader = null; try { reader = File.OpenText(filename); string input = null; while ((input = reader.ReadLine()) != null) { builder.Append(input); } } catch (Exception ex) { throw ex; } finally { reader.Close(); } pageSource = builder.ToString(); } Note - The line in the code above that builds the After getting the web page, the int articleNumber = 0; bool found = true; while (found) { // establish our trigger points string articleStart = string.Format("<span id=\"ctl00_MC_AR_ctl{0}_MAS", string.Format("{0:00}", articleNumber)); // we use the beginning of the next article as the // end of the current one string articleEnd = string.Format("<span id=\"ctl00_MC_AR_ctl{0}_MAS", string.Format("{0:00}", articleNumber + 1)); // get the index of the start of the next article int startIndex = pageSource.IndexOf(articleStart); if (startIndex >= 0) { // delete everything that came before the starting index pageSource = pageSource.Substring(startIndex); startIndex = 0; // find the end of our articles data int endIndex = pageSource.IndexOf(articleEnd); // If we don't have an endIndex, then we've arrived // at the final article in our list. if (endIndex == -1) { endIndex = pageSource.IndexOf("<table"); if (endIndex == -1) { endIndex = pageSource.Length - 1; } } // get the substring string data = pageSource.Substring(0, endIndex); // if we have data, process it if (data != "") { ProcessArticle(data, articleNumber); } else { found = false; } articleNumber++; } else { found = false; } } // while (found) CalculateAverages(); I guess I could have used LINQ to scrounge around in the XML, but when you get right down to it, we can't count on the HTML being valid, so it's simply more reliable to parse the text this way. I know, Chris, et al., work hard at making sure everything is just so, but they are merely human, and we know we can't count on humans to do it right every single time. Processing an ArticleBy "process", I mean parsing out the HTML and digging the actual data out of the article's div. While fairly simple, it is admittedly tedious. We start out by getting the article's URL, which is a straightforward operation: private string GetArticleLink(string data) { string result = data; // find the beginning of the desired text int hrefIndex = result.IndexOf("href=\"") + 6; //find the end of the desired text int endIndex = result.IndexOf("\">", hrefIndex); // snag it result = result.Substring(hrefIndex, endIndex - hrefIndex).Trim(); // return it return result; } Next, we clean the data, starting off by removing all of the HTML tags. A change was made to the source code to make the removal of HTML tags a little smarter. If the article title and/or description contain more than one pointy bracket, this method will be almost guaranteed to return only a portion of the actual text of the item in question. If you like, you can google for (and use) one of the many exhaustive HTML parsers available on the net. IMHO, it's not worth the effort considering this class' primary usage and consistently decent HTML we get from CodeProject. private string RemoveHtmlTags(string data) { int ltCount = CountChar(data, '<'); int gtCount = CountChar(data, '>'); // If the number of left and right pointy bracks are the same, we stand a // reasonable chance that what we think are html tags really ARE html tags. if (ltCount == gtCount) { data = ForwardStrip(data); } else { // Otherwise, we have an errant pointy bracket, which we can almost // always take care of depending on the order in which we search for // tags. if (gtCount > ltCount) { data = BackwardStrip(ForwardStrip(data)); } else { data = ForwardStrip(BackwardStrip(data)); } } return data; } private int CountChar(string data, char value) { int count = 0; for (int i = 0; i < data.Length; i++) { if (data[i] == value) { count++; } } return count; } private string ForwardStrip(string data) { bool found = true; do { int tagStart = data.IndexOf("<"); int tagEnd = data.IndexOf(">"); if (tagEnd >= 0) { tagEnd += 1; } found = (tagStart >= 0 && tagEnd >= 0 && tagEnd-tagStart > 1); if (found) { string tag = data.Substring(tagStart, tagEnd - tagStart); data = data.Replace(tag, ""); } } while (found); return data; } private string BackwardStrip(string data) { bool found = true; do { int tagStart = data.LastIndexOf("<"); int tagEnd = data.LastIndexOf(">"); if (tagEnd >= 0) { tagEnd += 1; } found = (tagStart >= 0 && tagEnd >= 0 && tagEnd-tagStart > 1); if (found) { string tag = data.Substring(tagStart, tagEnd - tagStart); data = data.Replace(tag, ""); } } while (found); return data; } Then, we remove all the extra stuff left behind: private string CleanData(string data) { // get rid of the HTML tags data = RemoveHtmlTags(data); // get rid of the crap that's left behind data = data.Replace("\t", "^").Replace(" ", ""); data = data.Replace("\n","").Replace("\r", ""); data = data.Replace(" / 5", ""); while (data.IndexOf(" ") >= 0) { data = data.Replace(" ", " "); } while (data.IndexOf("^ ^") >= 0) { data = data.Replace("^ ^", "^"); } while (data.IndexOf("^^") >= 0) { data = data.Replace("^^", "^"); } data = data.Substring(1); data = data.Substring(0, data.Length - 1); return data; } After this, we're left with a pure list of data that describes the article, delimited with caret characters. All that's left is to create an private void ProcessArticle(string data, int articleNumber) { string link = GetArticleLink(data); data = CleanData(data); string[] parts = data.Split('^'); string title = parts[0]; string description = parts[7]; string lastUpdated = GetDataField("Last Update", parts); string pageViews = GetDataField("Page Views", parts).Replace(",", ""); string rating = GetDataField("Rating", parts); string votes = GetDataField("Votes", parts).Replace(",", ""); string popularity = GetDataField("Popularity", parts); string bookmarks = GetDataField("Bookmark Count", parts); // create the AticleData item and add it to the list DateTime lastUpdatedDate; ArticleUpdate article = new ArticleUpdate(); article.LatestLink = string.Format("http://www.codeproject.com{0}", link); article.LatestTitle = title; article.LatestDescription = description; if (DateTime.TryParse(lastUpdated, out lastUpdatedDate)) { article.LatestLastUpdated = lastUpdatedDate; } else { article.LatestLastUpdated = new DateTime(1990, 1, 1); } article.LatestPageViews = Convert.ToInt32(pageViews); article.LatestRating = Convert.ToDecimal(rating); article.LatestVotes = Convert.ToInt32(votes); article.LatestPopularity = Convert.ToDecimal(popularity); article.LatestBookmarks = Convert.ToInt32(bookmarks); AddOrChangeArticle(article); } private void AddOrChangeArticle(ArticleUpdate article) { bool found = false; DateTime now = DateTime.Now; // apply changes for (int i = 0; i < m_articles.Count; i++) { ArticleUpdate item = m_articles[i]; if (item.LatestTitle.ToLower() == article.LatestTitle.ToLower()) { found = true; item.ApplyChanges(article, now, false); break; } } // if the article was not found, it must be new (or the title has changed), // so we'll add it if (!found) { article.ApplyChanges(article, now, true); m_articles.Add(article); } // remove all articles that weren't updated this time around - we need to // traverse the list in reverse order so we don't lose track of our index for (int i = m_articles.Count - 1; i == 0; i--) { ArticleUpdate item = m_articles[i]; if (item.TimeUpdated != now) { m_articles.RemoveAt(i); } } } The Sample ApplicationThe sample application is admittedly a rudimentary affair, and is honestly intended to show only one possible way to use the scraping code. I decided to use a You'll see that I didn't go to heroic lengths to pretty things up. For instance, I used PNG files for the graphics instead of GIF files. This means the transparency in the PNG files isn't handled correctly on systems running IE6 or earlier. The application allows you to select the data on which to sort, and in what direction (ascending or descending). The default is the date last updated in descending order so that the newest articles appear first. The The icons used are as follows:
Other controls on the form include the following. Show New Info OnlyThis checkbox allows you to filter the list of articles so that only new articles, and articles that have new data are displayed. Show IconsThis checkbox allows you to turn the display of icons on and off. Automatic RefreshThis checkbox allows you to turn the automatic refrsh on and off. Once every hour, a Button - Refresh From CodeProjectThis button allows you to manually refresh the article data (and this button is available even if auto-refresh is turned on). Lastly, you can specify the user ID of the user for whom you would like to retrieve data. After specifying the new ID, hit the Refresh button. ClosingThis code is only intended to be used to retrieve your own articles - the scraper class accepts the user ID as a parameter, and that ID is currently set to my own. Make sure you change that before you start looking for your own articles. I've tried to make this as maintainable as possible without forcing the programmer to do conceptual back-flips, but there's no way I can acommodate everyone's reading comprehension levels, so what I guess I'm saying is - you're pretty much on your own. I can't guarantee that I will be able to maintain this article in a timely fashion, but that shouldn't matter. We're all programmers here, and the stuff I've presented isn't rocket science. Besides, you have plenty of examples in the provided classes to modify and/or extend their functionality. Have fun. Remember also, that the png files and css file need to be in the same folder as the executable, or it won't find them. History10/14/2008: Addressed the following:
10/13/2008: Original article posted.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||