Click here to Skip to main content
11,410,529 members (56,971 online)
Click here to Skip to main content

The Code Project Forum Analyzer : Find out how much of a life you don't have!

, 26 Mar 2011 CPOL
Rate this:
Please Sign up or sign in to vote.
This is an unofficial Code Project application that can analyze forums over a range of posts to retrieve posting statistics for individual members.

Figure 01 - The app in action, analyzing Lounge posts

Figure 02 - Charting the exported CSV data in Excel


This is an unofficial Code Project application that can analyze forums over a range of posts to retrieve posting statistics for individual members. Like my other Code Project applications (and that of others like John and Luc), this app uses HTML scraping and parsing to extract the required information. So any change to the site layout/CSS can potentially break the functioning of this application. There's no workaround for that until the time that Code Project provides an official web-service that will expose this data.

Using the app

I've got a hardcoded list of forums that are shown in a combo-box. I chose the more important forums and also the ones with a decent amount of posts. You can also choose the number of posts you want to fetch and analyze. The app currently supports fetching 1,000 posts, 5,000 posts, or 10,000 posts. Anything more than 10,000 is not safe as you will then start seeing the effects of the heavy-loaded Code Project database servers, which means time outs and lost pages. This won't break the app but the app will be forced to skip pages which results in reduced statistical accuracy. Even with 10,000 posts you can still hit this on forums like Bugs/Suggestions or C++/CLI because some of the older pages have posts with malformed HTML which breaks the HTML parser. There is a status log at the bottom of the UI which will list such parsing errors and at the end of the analysis will also tell you how many posts were skipped.

Figure 03 - Forums with malformed HTML will result in skipped posts (49 in the screenshot)

CP has added stricter checks on the HTML that's allowed in posts, so you should see this less frequently as time progresses. Once the analysis is done, you can use the Export feature to save the results in a CSV file. You can now open this CSV file in Excel and do further analysis and statistical charting.

Handy Tip

If you hover the mouse over a display name it'll highlight the display name and the mouse cursor turns into a hand. This means you can click the display name to open the user's CP profile in the default browser, and you can do this even as an analysis is in progress.

Exporting to CSV and foreign language member-names

It should correctly choose the comma-separator based on your current locale (thanks to Mika Wendelius for helping me get this right) but I do not save the file as Unicode. That's because Excel seems to have trouble with it and treats it as one big single column (instead of 3 separate columns). So right now I am using Encoding.Default which is a tad better than not using any, but you may run into weirdness in Excel if any of the member display names have Unicode characters. I have not figured out how to work around that and at the moment I don't know if I want to spend time researching a fix. If any of you know how to resolve that, I'd appreciate any suggestions you have for this.

Implementation details

The core class that fetches all the data from the website is the ForumAnalyzer class. It uses the excellent HtmlAgilityPack for HTML parsing. Here are some of the more interesting methods in this class.

private void InitMaxPosts()
    string html = GetHttpPage(GetFetchUrl(1), this.timeOut);

    HtmlDocument document = new HtmlDocument();

    HtmlNode trNode = document.DocumentNode.SelectNodes(
    if (trNode != null)
        if (trNode.ChildNodes.Count > 2)
            var node = trNode.ChildNodes[2];
            string data = node.InnerText;
            int start = data.IndexOf("of");
            if (start != -1)
                int end = data.IndexOf('(', start);
                if (end != -1)
                    if (end - start - 2 > 0)
                        var extracted = data.Substring(start + 2, end - start - 2);
                          CultureInfo.InvariantCulture, out maxPosts);

That's used to determine that maximum number of posts in the forum. Since the pages are dynamic, you won't get an error if you try fetching pages beyond the last page, but it will waste time and bandwidth and also mess up the statistics. So it's important to make sure that we know what the maximum number of pages we can fetch safely.

public ICollection<Member> FetchPosts(int from)
    if (from > maxPosts)
        throw new ArgumentOutOfRangeException("from");

    string html = GetHttpPage(GetFetchUrl(from), this.timeOut);

    HtmlDocument document = new HtmlDocument();

    Collection<Member> members = new Collection<Member>();
    foreach (HtmlNode tdNode in document.DocumentNode.SelectNodes(
        if (tdNode.ChildNodes.Count > 0)
            var aNode = tdNode.ChildNodes[0];
            int id;
            if (aNode.Attributes.Contains("href") 
                && TryParse(aNode.Attributes["href"].Value, out id))
                members.Add(new Member(id, aNode.InnerText));
    return members;

This is where the post data is extracted. It takes advantage of the Frm_MsgAuthor CSS class that Code Project uses. Now that I've mentioned this here, I bet Murphy's laws will reveal themselves and Chris will rename that class arbitrarily. Note that this class will just return post data in big chunks so it's up to the caller to actually do any analysis on the data. I do that in my application's view-model. This decision may be questioned by some people who may feel that a separate wrapper should have done the calculations and the VM should have then merely accessed that. If so, yeah, they're probably right but for such a simple app I went for simplicity versus design purity.

The code that uses the ForumAnalyzer class is called from a background worker thread, and when it completes, a second background worker is spawned to sort the results. I've also used Parallel.For from the Task Parallel Library which gave me a significant speed boost. Initial runs took 8-9 minutes on my connection (15 Mbps, boosted to 24 Mbps) but once I added the Parallel.For, this went up to a little over a minute. Going from around 8 minutes to 1 minute was quite impressive! There were a few side effects though which I will talk about after this code listing.

private void Fetch()
    canFetch = false;
    canExport = false;

    var dispatcher = Application.Current.MainWindow.Dispatcher;
    Stopwatch stopWatch = new Stopwatch();
    BackgroundWorker worker = new BackgroundWorker();
    worker.DoWork += (sender, e) =>
            ForumAnalyzer analyzer = new ForumAnalyzer(this.SelectedForum);

            dispatcher.Invoke((Action)(() =>
                this.TimeElapsed = TimeSpan.FromSeconds(0).ToString(timeSpanFormat);
                this.Total = 0;
                AddLog(new LogInfo("Started fetching posts..."));

            Dictionary<int, MemberPostInfo> results = 
                  new Dictionary<int, MemberPostInfo>();

            ParallelOptions options = new ParallelOptions() 
              { MaxDegreeOfParallelism = 8 };
            Parallel.For(0, Math.Min((int)(PostCount)this.PostsToFetch, 
                analyzer.MaxPosts) / postsPerPage, options, (i) =>
                ICollection<Member> members = null;
                int trials = 0;

                while (members == null && trials < 5)
                        members = analyzer.FetchPosts(i * postsPerPage + 1);

                if (members == null)
                    dispatcher.Invoke((Action)(() =>
                        AddLog(new LogInfo(
                            "Http connection failure", i, postsPerPage));


                if (members.Count < postsPerPage)
                    dispatcher.Invoke((Action)(() =>
                        AddLog(new LogInfo(
                            "Html parser failure", i, postsPerPage - members.Count));

                lock (results)
                    foreach (var member in members)
                        if (results.ContainsKey(member.Id))
                            results[member.Id] = new MemberPostInfo() 
                              { Id = member.Id, DisplayName = member.DisplayName, 
                                  PostCount = 1 };

                            dispatcher.Invoke((Action)(() =>

                    dispatcher.Invoke((Action)(() =>
                        this.Total += members.Count;
                        this.TimeElapsed = stopWatch.Elapsed.ToString(timeSpanFormat);

    worker.RunWorkerCompleted += (s, e) =>

            BackgroundWorker sortWorker = new BackgroundWorker();
            sortWorker.DoWork += (sortSender, sortE) =>
                var temp = this.results.OrderByDescending(
                    ks => ks.PostCount).ToArray();

                dispatcher.Invoke((Action)(() =>
                    AddLog(new LogInfo("Sorting results..."));
                    foreach (var item in temp)

            sortWorker.RunWorkerCompleted += (sortSender, sortE) =>
                    AddLog(new LogInfo("Task completed!"));
                    canFetch = true;
                    canExport = true;



When I added the Parallel.For, the first thing I noticed was that the number of errors and time outs significantly went up to the point where the results were almost useless. What was happening was that I was fighting CP's built-in flood protection and I realized that if I spawned too many connections in parallel, this just would not work. With some trial and error, I finally reduced the maximum level of concurrency to 8 which gave me the best results. Coincidentally I have 4 cores with hyper threading, so that's 8 virtual CPUs - so this was perfect for me I guess. Note that this is pure coincidence, the reason I had to reduce the parallelism was the CP flood-prevention system and not the number of cores I had.

A major side effect of using Parallel.For was that I lost the ability to fetch pages in a serial order. Had I not used the parallel loops, I could detect an HTML parsing error, and then skip that one post and continue with the post after that one. But with concurrent loops, if I hit an error, I am forced to skip the rest of the page. Of course, it's not impossible to handle this correctly by spawning off a side-task that will fetch just the skipped posts minus the malformed one, but this seriously increased the complexity of the code. I decided that I can live with 50-100 lost posts out of 10,000. That's less than a 1% deviation in accuracy which I thought was acceptable for the application.

Well, that's it. Thanks for reading this article and for trying out the application. As usual, I appreciate any and all feedback, criticism and comments.



Thanks to the following CPians for their help with testing the application! Really, really appreciate that. I only listed the first 3 folks, but many others helped too. So thanks goes to all of them, and I apologize for not listing everybody here (I didn't realize so many would be so helpful)!


  • March 26, 2011 - Article first published


This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


About the Author

Nish Nishant

United States United States
Nish Nishant is a Software Architect/Consultant based out of Columbus, Ohio. He has over 15 years of software industry experience in various roles including Lead Software Architect, Principal Software Engineer, and Product Manager. Nish is a recipient of the annual Microsoft Visual C++ MVP Award since 2002 (13 consecutive awards as of 2014).

Nish is an industry acknowledged expert in the Microsoft technology stack. He authored
C++/CLI in Action for Manning Publications in 2005, and had previously co-authored
Extending MFC Applications with the .NET Framework for Addison Wesley in 2003. In addition, he has over 140 published technology articles on and another 250+ blog articles on his
WordPress blog. Nish is vastly experienced in team management, mentoring teams, and directing all stages of software development.

Contact Nish : You can reach Nish on his google email id voidnish.

Website and Blog

Comments and Discussions

BugEditorial Bug PinmemberBassam Abdul-Baki30-Sep-11 7:00 
GeneralRe: Editorial Bug PinmvpNishant Sivakumar30-Sep-11 8:03 
BugBug! Bug! Bug! PinmemberNagy Vilmos22-Jun-11 4:21 
GeneralRe: Bug! Bug! Bug! PinmvpNishant Sivakumar22-Jun-11 4:23 
GeneralRe: Bug! Bug! Bug! PinmvpJohn Simmons / outlaw programmer30-Jun-11 3:22 
GeneralRe: Bug! Bug! Bug! PinmvpNishant Sivakumar30-Jun-11 5:12 
GeneralMy vote of 5 PinmemberNagy Vilmos13-Jun-11 6:06 
GeneralRe: My vote of 5 PinmvpNishant Sivakumar22-Jun-11 4:23 
GeneralMy vote of 5 Pinmemberashishkumar0081-Jun-11 8:02 
GeneralRe: My vote of 5 PinmvpNishant Sivakumar22-Jun-11 4:24 
GeneralQuestion for you Nish PinmvpSacha Barber25-May-11 2:00 
GeneralRe: Question for you Nish PinmvpNishant Sivakumar25-May-11 2:04 
GeneralRe: Question for you Nish PinmvpSacha Barber25-May-11 3:17 
GeneralRe: Question for you Nish PinmvpNishant Sivakumar25-May-11 3:19 
GeneralRe: Question for you Nish PinmvpPete O'Hanlon26-May-11 10:43 
GeneralRe: Question for you Nish PinmvpSacha Barber26-May-11 12:25 
GeneralMy vote of 5 PinmemberZeeroC00l25-Apr-11 20:28 
GeneralRe: My vote of 5 PinmvpNishant Sivakumar26-Apr-11 2:28 
GeneralMy vote of 5 PinmemberOshtri Deka24-Apr-11 7:42 
GeneralRe: My vote of 5 PinmvpNishant Sivakumar24-Apr-11 7:49 
GeneralMy vote of 5 PinmemberPompeyBoy319-Apr-11 7:10 
GeneralRe: My vote of 5 PinmvpNishant Sivakumar19-Apr-11 7:12 
GeneralRe: My vote of 5 PinmemberPompeyBoy319-Apr-11 7:14 
GeneralRe: My vote of 5 PinmvpNishant Sivakumar19-Apr-11 7:15 
GeneralMy vote of 5 Pinmemberkoolprasad200319-Apr-11 1:03 
GeneralRe: My vote of 5 PinmvpNishant Sivakumar19-Apr-11 3:11 
GeneralMy vote of 5 PinmvpAbhijit Jana19-Apr-11 0:54 
GeneralRe: My vote of 5 PinmvpNishant Sivakumar19-Apr-11 3:11 
GeneralMy vote of 5 PinmemberSChristmas12-Apr-11 4:19 
GeneralRe: My vote of 5 PinmvpNishant Sivakumar12-Apr-11 4:25 
GeneralMy vote of 5 Pinmembermkgoud12-Apr-11 4:15 
GeneralRe: My vote of 5 PinmvpNishant Sivakumar12-Apr-11 4:16 
GeneralMy vote of 5 Pinmember d@nish 6-Apr-11 6:38 
GeneralRe: My vote of 5 PinmvpNishant Sivakumar6-Apr-11 6:40 
GeneralRe: My vote of 5 Pinmember d@nish 6-Apr-11 7:57 
GeneralRe: My vote of 5 PinmvpNishant Sivakumar6-Apr-11 8:00 
GeneralMy vote of 5 PinmvpAspDotNetDev5-Apr-11 11:10 
GeneralRe: My vote of 5 PinmvpNishant Sivakumar5-Apr-11 11:11 
GeneralMy vote of 5 PinmemberManfred R. Bihy5-Apr-11 5:14 
GeneralRe: My vote of 5 PinmvpNishant Sivakumar5-Apr-11 5:45 
GeneralMy vote of 5 PinmemberPetr Pechovic5-Apr-11 3:04 
GeneralRe: My vote of 5 PinmvpNishant Sivakumar5-Apr-11 5:00 
GeneralMy vote of 5 PinmvpMarcelo Ricardo de Oliveira4-Apr-11 4:25 
GeneralRe: My vote of 5 PinmvpNishant Sivakumar4-Apr-11 4:28 
GeneralMy vote of 5 Pinmvpthatraja27-Mar-11 5:18 
GeneralRe: My vote of 5 PinmvpNishant Sivakumar27-Mar-11 5:26 
GeneralMy vote of 5 Pinmembermaq_rohit27-Mar-11 4:54 
GeneralRe: My vote of 5 PinmvpNishant Sivakumar27-Mar-11 5:03 
GeneralMy vote of 5 PinmemberSlacker00727-Mar-11 3:41 
AnswerRe: My vote of 5 PinmvpNishant Sivakumar27-Mar-11 3:49 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Terms of Use | Mobile
Web04 | 2.8.150414.5 | Last Updated 26 Mar 2011
Article Copyright 2011 by Nish Nishant
Everything else Copyright © CodeProject, 1999-2015
Layout: fixed | fluid