How to Create a Spam Filter or Automatic Category Sort Algorithm with Your Mail Application

Higty

5.00/5 (9 votes)

Jul 28, 2012

MIT

3 min read

41995

1223

This article describes automatic category filters in mail applications.

Download source - 1.7 MB

Introduction

In this article, I will show you a hint and concept about how to create a spam filter on your mail application. And I will show you as an advanced topic, how you can filter your mail based on whether the mail tells about a sport or not (and technology or not, business or not, etc.).

You can get all libraries that include mail, Twitter, Facebook, dropbox, Windows Live at http://higlabo.codeplex.com/.

Bayesian Filter

How to filter based on if the mail is spam or not? Please read this article: A-Naive-Bayesian-Spam-Filter-for-C.

With this library, you can easily filter spam. But one thing you must do is prepare two types of text data: one is data that is spam, the other is data that is not spam.

How to Create Spam Data?

My idea is here:

Create a mailbox on GMail.
Register your mail address at a fishy site.
Wait for spam mail.

After a few days, perhaps you will receive spam mail to your mailbox. Then you can read all the mails and save them to a text file. Here is some sample code.

private static void CreateSpamDataFromMailbox()
{
    Console.WriteLine("Update spam data?press y");
    if (Console.ReadLine() != "y") { return; }

    MailMessage mg = null;
    List<MailMessage> spamList = new List<MailMessage>();
    StringBuilder sb = new StringBuilder(1024 * 32);

    using (Pop3Client cl = new Pop3Client("pop.gmail.com", 995, "", ""))
    {
        cl.Ssl = true;
        cl.AuthenticateMode = Pop3AuthenticateMode.Auto;
        var bl = cl.Authenticate();
        if (bl == true)
        {
            var l = cl.ExecuteList();
            for (int i = 0; i < l.Count && i < 100; i++)
            {
                mg = cl.GetMessage(l[i].MailIndex);
                sb.AppendLine(mg.BodyText);
                sb.AppendLine();
            }
        }
    }

    String fileName = "Spam.txt";
    if (File.Exists(fileName) == true)
    {
        File.Delete(fileName);
    }
    File.WriteAllText(fileName, sb.ToString());
}

How to Create Normal Mail Data?

My idea is to gather some from the Send Mail folder. If you create a mail application, probably your application has functionality to send mail. When sending mail, you can save it to local disk or database. And create a service to gather data as normal mail data.

Advanced Filter to Find if the Mail Is About Sports or Not

Bayesian filter can filter if an article talks about sports or not. But how do we gather data about sports? My idea is to gather it from popular news sites. I will show you how to gather data from BBC.

To get sports data from BBC, this page is good: BBC Sports.

You can get the HTML of this page like shown below:

HttpClient cl = new HttpClient();
HttpRequestCommand cm = new HttpRequestCommand("http://www.bbc.co.uk/sport/0/");
cm.MethodName = HttpMethodName.Get;
String htmlText = cl.GetBodyText(cm);

This page includes very little text about sports. The actual text is included in a sub page linked from the headlines.

Click "Cavendish targets gold with GB 'dream team'", and you can see this page:

This page includes actual text about sports.

HtmlAgilityPack

At first, you must gather the URL list linked from headlines. How can we achieve it? I recommend HtmlAgilityPack that is a great library to parse HTML text.

You can get it from here: HtmlAgilityPack.

You can parse the URL list from BBC Sports top page with HtmlAgilityPack. The page image and HTML DOM is like below:

Here is the code to get the URL list from this HTML. See the red square in the above image and the below sample code.

HttpClient cl = new HttpClient();
HttpRequestCommand cm = new HttpRequestCommand("http://www.bbc.co.uk/sport/0/");
cm.MethodName = HttpMethodName.Get;
String htmlText = cl.GetBodyText(cm);

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlText);

HtmlNodeCollection nodes = 
  doc.DocumentNode.SelectNodes
  (@"//div[@class=""type-a-headline-list-1""]//li//a[@data-published]");

List<String> urlList = new List<string>();
foreach (HtmlNode node in nodes)
{
    urlList.Add(node.Attributes["href"].Value);
}

You can get the URL list by using the SelectNodes method.

HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes(
    @"//div[@class=""type-a-headline-list-1""]//li//a[@data-published]");

See class=""type-a-headline-list-1 and a[@data-published]. HtmlAgilityPack parses HTML and filters div tag whose class attribute is "type-a-headline-list-1". Next, filter a tag that has a attribute "data-published" inside the div tag.

With this URL, you can get the HTML of the detail article.

Now you parse this page the same way. This page's HTML DOM is shown here.

To get the article text, you create code like below:

private static String GetArticleText(String url)
{
    StringBuilder sb = new StringBuilder(8096);
    HttpClient cl = new HttpClient();
    HttpRequestCommand cm = new HttpRequestCommand(url);
    cm.MethodName = HttpMethodName.Get;
    String htmlText = cl.GetBodyText(cm);

    HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(htmlText);

    HtmlNodeCollection nodes = 
      doc.DocumentNode.SelectNodes(
      @"//div[@class=""story-body""]//div[@class=""article""]//p");

    if (nodes != null)
    {
        foreach (HtmlNode node in nodes)
        {
            sb.AppendLine(HttpUtility.HtmlDecode(node.InnerText));
        }
    }
    return sb.ToString();
}

You must note that node.InnerText is HTML encoded. So you decode this text to get the correct text. I use the System.Web.HttpUtility class to do it.

Now you can get text about sports in local disk like this:

Mark Cavendish is confident that he can win gold at London 2012,
with the help of his "dream team".

The Manx-born cyclist is in action in Saturday's road race,
less than a week after winning his third stage

at this year's Tour de France.

"It's doable," he said. "I couldn't do it if I was doing this alone
but I need four of the strongest bike riders in the world to help me.

"And I have got four of the strongest bike readers in the world to help me."

Bradley Wiggins, who made

history in Paris when he became the first Briton to win the Tour,

Ian Stannard,

........................................................................

German GP 2012: Alonso wins at Hockenheim
But after arriving in Hungary, Vettel denied calling Hamilton stupid
and blamed the media, saying journalists had misheard him.

"If I say after the race that I thought it was unnecessary and then it gets quoted that
I said he is stupid, it's quite disappointing because sometimes I have a mouth, I say
a couple of words, you have ears, and in that process it seems mistakes
sometimes happen," Vettel said.

"If you look at the rules, it's clear you are allowed to do it [unlap yourself].
I said it was unnecessary.

"I was hunting Fernando, it was a couple of laps to the stop,
it didn't help me, it probably helped Jenson, but that's racing.

"I'm not complaining. I said it was unnecessary from a racing point of view
to distract the leaders no matter who it was, and that's it."

Now you can filter if the mail is about sports or not with this data. You can leverage this idea for other categories like business, technology, and so on.

Consider

This article shows how to gather data for a Bayesian filter. Low quality data causes errors. The most important thing is to gather high quality data based on your requirements to achieve accuracy enhancement. And spam will keep changing to override our filter in future. So you need to make continued efforts to update your data.

Reference

History

27^th July, 2012: First post