Click here to Skip to main content
14,305,128 members

How to create a spam filter or automatic category sort algorithm with your mail application

Rate this:
5.00 (9 votes)
Please Sign up or sign in to vote.
5.00 (9 votes)
29 Jul 2012MIT
This article describe about automatic category filters in mail applications.


In this article, I will show you a hint and concept about how to create a spam filter on your mail application. And I will show you as an advanced topic, how you can filter your mail based on whether the mail tells about a sport or not (and technology or not, business or not, etc.).

You can get all libraries that include mail, Twitter, Facebook, dropbox, Windows Live at

Bayesian filter

How to filter based on if the mail is spam or not? Please read this article: A-Naive-Bayesian-Spam-Filter-for-C.

With this library, you can easily filter spam. But one thing you must do is prepare two types of text data: one is data that is spam, the other is data that is not spam.

How to create spam data?

My idea is here

  1. Create a mailbox on GMail
  2. Register your mail address at a fishy site
  3. Wait for spam mail

After a few days, perhaps you will receive spam mail to your mailbox. Then you can read all the mails and save them to a text file. Here is some sample code.

private static void CreateSpamDataFromMailbox()
    Console.WriteLine("Update spam data?press y");
    if (Console.ReadLine() != "y") { return; }

    MailMessage mg = null;
    List<MailMessage> spamList = new List<MailMessage>();
    StringBuilder sb = new StringBuilder(1024 * 32);

    using (Pop3Client cl = new Pop3Client("", 995, "", ""))
        cl.Ssl = true;
        cl.AuthenticateMode = Pop3AuthenticateMode.Auto;
        var bl = cl.Authenticate();
        if (bl == true)
            var l = cl.ExecuteList();
            for (int i = 0; i < l.Count && i < 100; i++)
                mg = cl.GetMessage(l[i].MailIndex);

    String fileName = "Spam.txt";
    if (File.Exists(fileName) == true)
    File.WriteAllText(fileName, sb.ToString());

How to create normal mail data?

My idea is to gather some from the Send Mail folder. If you create a mail application, probably your application has functionality to send mail. When send mail, you can save it to local disk or database. And create a service to gather data as normal mail data.

Advanced filter to find if the mail is about sports or not

Bayesian filter can filter if an article talks about sports or not. But how do we gather data about sports? My idea is to gather it from popular news sites. I will show you how to gather data from BBC.

To get sports data from BBC, this page is good: BBC Sports.

You can get the HTML of this page like shown below:

HttpClient cl = new HttpClient();
HttpRequestCommand cm = new HttpRequestCommand("");
cm.MethodName = HttpMethodName.Get;
String htmlText = cl.GetBodyText(cm);

This page includes very little  text about sports. The actual text is included in a sub page linked from the headlines.

Image 1

Click "Cavendish targets gold with GB 'dream team'", and you can see this page:

Image 2

This page includes actual text about sports.


At first you must gather the URL list linked from headlines. How can we achieve it? I recommend HtmlAgilityPack that is a great library to parse HTML text.

You can get it from here: HtmlAgilityPack.

You can parse the URL list from BBC Sports top page with HtmlAgilityPack. The page image and HTML DOM is like below:

Image 3

Image 4

Here is code to get the URL list from this HTML. See the red square in the above image and the below sample code.

HttpClient cl = new HttpClient();
HttpRequestCommand cm = new HttpRequestCommand("");
cm.MethodName = HttpMethodName.Get;
String htmlText = cl.GetBodyText(cm);

HtmlDocument doc = new HtmlDocument();

HtmlNodeCollection nodes = 

List<String> urlList = new List<string>();
foreach (HtmlNode node in nodes)

You can get the URL list by using the SelectNodes method.

HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes(

See class=""type-a-headline-list-1 and a[@data-published]. HtmlAgilityPack parses HTML and filters div tag whose class attribute is "type-a-headline-list-1". Next, filter a tag that has a attribute "data-published" inside the div tag.

With this URL, you can get the HTML of the detail article.

Image 5

Now you parse this page the same way. This page's HTML DOM is shown here.

Image 6

To get the article text, you create code like below:

private static String GetArticleText(String url)
    StringBuilder sb = new StringBuilder(8096);
    HttpClient cl = new HttpClient();
    HttpRequestCommand cm = new HttpRequestCommand(url);
    cm.MethodName = HttpMethodName.Get;
    String htmlText = cl.GetBodyText(cm);

    HtmlDocument doc = new HtmlDocument();

    HtmlNodeCollection nodes = 

    if (nodes != null)
        foreach (HtmlNode node in nodes)
    return sb.ToString();

You must note that node.InnerText is HTML encoded. So you decode this text to get the correct text. I use the System.Web.HttpUtility class to do it.

Now you can get text about sports in local disk like this:

    Mark Cavendish is confident that he can win gold at London 2012, with the help of his "dream team".

    The Manx-born cyclist is in action in Saturday's road race, 
less than a week after winning his third stage 

     at this year's Tour de France.

    "It's doable," he said. "I couldn't do it if I was doing this alone but I need four of the strongest bike riders in the world to help me.

    "And I have got four of the strongest bike readers in the world to help me."

    Bradley Wiggins, who made 
history in Paris when he became the first Briton to win the Tour, 

     Ian Stannard, 



German GP 2012: Alonso wins at Hockenheim
    But after arriving in Hungary, Vettel denied calling Hamilton stupid and blamed the media, saying journalists had misheard him. 

    "If I say after the race that I thought it was unnecessary and then it gets quoted that 
    I said he is stupid, it's quite disappointing because sometimes I have a mouth, I say 
    a couple of words, you have ears, and in that process it seems mistakes sometimes happen," Vettel said. 

    "If you look at the rules, it's clear you are allowed to do it [unlap yourself]. I said it was unnecessary. 

    "I was hunting Fernando, it was a couple of laps to the stop, it didn't help me, it probably helped Jenson, but that's racing. 

    "I'm not complaining. I said it was unnecessary from a racing point of view to distract the leaders no matter who it was, and that's it."

Now you can filter if the mail is about sports or not with this data. You can leverage this idea for other categories like business, technology, and so on.


This article shows how to gather data for a Bayesian filter. Low quality data causes errors. The most important thing is to gather high quality data based on your requirements to achieve accuracy enhancement. And spam will keep changing to override our filter in future. So you need to make continued efforts to update your data.


Related articles are listed here:


  • 2012/07/27: First post.


This article, along with any associated source code and files, is licensed under The MIT License


About the Author

Web Developer
Japan Japan
I'm Working at Software Company in Tokyo.

Comments and Discussions

QuestionSpam filter add Pin
Member 1238507511-Apr-16 4:23
memberMember 1238507511-Apr-16 4:23 
GeneralNice work Pin
Shuqian Ying29-Jul-12 19:17
memberShuqian Ying29-Jul-12 19:17 
GeneralRe: Nice work Pin
Higty26-Aug-12 14:55
memberHigty26-Aug-12 14:55 
GeneralRe: Nice work Pin
Shuqian Ying26-Aug-12 16:48
memberShuqian Ying26-Aug-12 16:48 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Posted 28 Jul 2012

Tagged as


18 bookmarked