Click here to Skip to main content
Click here to Skip to main content
Go to top

How to create a spam filter or automatic category sort algorithm with your mail application

, 29 Jul 2012
Rate this:
Please Sign up or sign in to vote.
This article describe about automatic category filters in mail applications.

Introduction

In this article, I will show you a hint and concept about how to create a spam filter on your mail application. And I will show you as an advanced topic, how you can filter your mail based on whether the mail tells about a sport or not (and technology or not, business or not, etc.).

You can get all libraries that include mail, Twitter, Facebook, dropbox, Windows Live at http://higlabo.codeplex.com/.

Bayesian filter

How to filter based on if the mail is spam or not? Please read this article: A-Naive-Bayesian-Spam-Filter-for-C.

With this library, you can easily filter spam. But one thing you must do is prepare two types of text data: one is data that is spam, the other is data that is not spam.

How to create spam data?

My idea is here

  1. Create a mailbox on GMail
  2. Register your mail address at a fishy site
  3. Wait for spam mail

After a few days, perhaps you will receive spam mail to your mailbox. Then you can read all the mails and save them to a text file. Here is some sample code.

private static void CreateSpamDataFromMailbox()
{
    Console.WriteLine("Update spam data?press y");
    if (Console.ReadLine() != "y") { return; }

    MailMessage mg = null;
    List<MailMessage> spamList = new List<MailMessage>();
    StringBuilder sb = new StringBuilder(1024 * 32);

    using (Pop3Client cl = new Pop3Client("pop.gmail.com", 995, "", ""))
    {
        cl.Ssl = true;
        cl.AuthenticateMode = Pop3AuthenticateMode.Auto;
        var bl = cl.Authenticate();
        if (bl == true)
        {
            var l = cl.ExecuteList();
            for (int i = 0; i < l.Count && i < 100; i++)
            {
                mg = cl.GetMessage(l[i].MailIndex);
                sb.AppendLine(mg.BodyText);
                sb.AppendLine();
            }
        }
    }

    String fileName = "Spam.txt";
    if (File.Exists(fileName) == true)
    {
        File.Delete(fileName);
    }
    File.WriteAllText(fileName, sb.ToString());
}

How to create normal mail data?

My idea is to gather some from the Send Mail folder. If you create a mail application, probably your application has functionality to send mail. When send mail, you can save it to local disk or database. And create a service to gather data as normal mail data.

Advanced filter to find if the mail is about sports or not

Bayesian filter can filter if an article talks about sports or not. But how do we gather data about sports? My idea is to gather it from popular news sites. I will show you how to gather data from BBC.

To get sports data from BBC, this page is good: BBC Sports.

You can get the HTML of this page like shown below:

HttpClient cl = new HttpClient();
HttpRequestCommand cm = new HttpRequestCommand("http://www.bbc.co.uk/sport/0/");
cm.MethodName = HttpMethodName.Get;
String htmlText = cl.GetBodyText(cm);

This page includes very little  text about sports. The actual text is included in a sub page linked from the headlines.

Click "Cavendish targets gold with GB 'dream team'", and you can see this page:

This page includes actual text about sports.

HtmlAgilityPack

At first you must gather the URL list linked from headlines. How can we achieve it? I recommend HtmlAgilityPack that is a great library to parse HTML text.

You can get it from here: HtmlAgilityPack.

You can parse the URL list from BBC Sports top page with HtmlAgilityPack. The page image and HTML DOM is like below:

Here is code to get the URL list from this HTML. See the red square in the above image and the below sample code.

HttpClient cl = new HttpClient();
HttpRequestCommand cm = new HttpRequestCommand("http://www.bbc.co.uk/sport/0/");
cm.MethodName = HttpMethodName.Get;
String htmlText = cl.GetBodyText(cm);

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlText);

HtmlNodeCollection nodes = 
  doc.DocumentNode.SelectNodes(@"//div[@class=""type-a-headline-list-1""]//li//a[@data-published]");

List<String> urlList = new List<string>();
foreach (HtmlNode node in nodes)
{
    urlList.Add(node.Attributes["href"].Value);
}

You can get the URL list by using the SelectNodes method.

HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes(
    @"//div[@class=""type-a-headline-list-1""]//li//a[@data-published]");

See class=""type-a-headline-list-1 and a[@data-published]. HtmlAgilityPack parses HTML and filters div tag whose class attribute is "type-a-headline-list-1". Next, filter a tag that has a attribute "data-published" inside the div tag.

With this URL, you can get the HTML of the detail article.

Now you parse this page the same way. This page's HTML DOM is shown here.

To get the article text, you create code like below:

private static String GetArticleText(String url)
{
    StringBuilder sb = new StringBuilder(8096);
    HttpClient cl = new HttpClient();
    HttpRequestCommand cm = new HttpRequestCommand(url);
    cm.MethodName = HttpMethodName.Get;
    String htmlText = cl.GetBodyText(cm);

    HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(htmlText);

    HtmlNodeCollection nodes = 
      doc.DocumentNode.SelectNodes(
      @"//div[@class=""story-body""]//div[@class=""article""]//p");

    if (nodes != null)
    {
        foreach (HtmlNode node in nodes)
        {
            sb.AppendLine(HttpUtility.HtmlDecode(node.InnerText));
        }
    }
    return sb.ToString();
}

You must note that node.InnerText is HTML encoded. So you decode this text to get the correct text. I use the System.Web.HttpUtility class to do it.

Now you can get text about sports in local disk like this:

    Mark Cavendish is confident that he can win gold at London 2012, with the help of his "dream team".

    The Manx-born cyclist is in action in Saturday's road race, 
    
less than a week after winning his third stage 

     at this year's Tour de France.

    "It's doable," he said. "I couldn't do it if I was doing this alone but I need four of the strongest bike riders in the world to help me.

    "And I have got four of the strongest bike readers in the world to help me."

    Bradley Wiggins, who made 
    
history in Paris when he became the first Briton to win the Tour, 

     Ian Stannard, 
    

........................................................................   

    

German GP 2012: Alonso wins at Hockenheim
    But after arriving in Hungary, Vettel denied calling Hamilton stupid and blamed the media, saying journalists had misheard him. 

    "If I say after the race that I thought it was unnecessary and then it gets quoted that 
    I said he is stupid, it's quite disappointing because sometimes I have a mouth, I say 
    a couple of words, you have ears, and in that process it seems mistakes sometimes happen," Vettel said. 

    "If you look at the rules, it's clear you are allowed to do it [unlap yourself]. I said it was unnecessary. 

    "I was hunting Fernando, it was a couple of laps to the stop, it didn't help me, it probably helped Jenson, but that's racing. 

    "I'm not complaining. I said it was unnecessary from a racing point of view to distract the leaders no matter who it was, and that's it."

Now you can filter if the mail is about sports or not with this data. You can leverage this idea for other categories like business, technology, and so on.

Consider

This article shows how to gather data for a Bayesian filter. Low quality data causes errors. The most important thing is to gather high quality data based on your requirements to achieve accuracy enhancement. And spam will keep changing to override our filter in future. So you need to make continued efforts to update your data.

Reference

Related articles are listed here:

History

  • 2012/07/27: First post.

License

This article, along with any associated source code and files, is licensed under The MIT License

Share

About the Author

Higty
Web Developer
Japan Japan
I'm Working at Software Company in Tokyo.

Comments and Discussions

 
GeneralNice work PinmemberShuqian Ying29-Jul-12 19:17 
First I think it is a nice technical article teaching how statistical filters works and sharing the code.
 
From a practical view however, I belief your approach works better for spam messages which try hard to hide their nature of being spams. Messages with specific topics usually come from subscription to particular sites, they usually have rich meta information about the topic and very regular patterns in their meta-properties and/or contents, it would be better to use a sql category of query language to filter because they are much more accurate than statistical filters. If there is a perception that using structured query in an user interface that can be used by ordinary user is hard to develop, it is no longer true. If your approach can be combined the structured query method, it would give a user more control over how to filtering...
 
Try to download and install the following program for a "test drive" to see what I meant. I belief you will never want to go back to the old ways of filtering after using it Smile | :) . Of cause I could be wrong. But anyway...
Having way too many emails to deal with? Try our SQLized solution: Email Aggregation Manager[^] which gets your email sorted, found and organized beyond known precision.

GeneralRe: Nice work PinmemberHigty26-Aug-12 14:55 
GeneralRe: Nice work PinmemberShuqian Ying26-Aug-12 16:48 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Mobile
Web03 | 2.8.140916.1 | Last Updated 30 Jul 2012
Article Copyright 2012 by Higty
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid