Click here to Skip to main content
15,860,859 members
Articles / Web Development / ASP.NET
Article

A Naive Bayesian Spam Filter for C#

Rate me:
Please Sign up or sign in to vote.
4.87/5 (37 votes)
6 Feb 2008CPOL5 min read 266.2K   12.7K   103   69
A C# implementation of Paul Graham's Naive Bayesian Spam Filter algorithm.
BayesianCS demo screenshot

Introduction

This is a C# implementation of Paul Graham's Naive Bayesian Spam Filter algorithm. It is suitable for incorporation into an ASP.NET Blogging, Forum, Email or Wiki application.

Background

I run a little Travel Blogging website called Blogabond that has been getting more and more attention from spammers over the years. At first, I was able to stem the tide with simple anti-robot measures to reject posts from things that were obviously not Web browsers. Soon after, I had to implement a simple silent human-detection script to run behind the scenes and ensure that a real person was sitting at a real keyboard and typing blog entries in by hand.

This approach worked really well for a long time. Every once in a while, some ambitious travel agency would start posting advertisements that I would have to delete by hand until they got the message that it wasn't working. Still, behind the scenes, about 10,000 automated comment spams were getting knocked out of the sky every day. Not bad.

It's 2008 now, and the game has changed. We're starting to see a new breed of spam showing up on Blogabond, and it's getting worse every day. This is human-powered spam. Delivered in person by a real person behind a real keyboard someplace where wages are low enough that advertisers can afford to hire rooms of workers to copy/paste comment spam by hand. None of the automated human-detection tricks work against this, because it's not automated anymore.

Time to get Bayesian

Modern email clients all use Bayesian spam filtering, so that's what I figured I needed to implement. Googling up "Bayesian C#", I was amazed to find that nobody has put out a Naive Bayesian Spam Filter for C# that you can simply drop into your codebase. What is the story here? The technology has been around since 2002. Is it really that scary to implement? Must be. Still, it's getting really annoying having to moderate Blogabond by hand. I think I'll give it a shot. You know what? It wasn't really that hard.

I'm not going to go into detail on the algorithm itself. After all, mine is just a straight implementation of Paul Graham's original Naive Bayesian Spam filtering algorithm, and I don't pretend to have anything interesting to add to his analysis.

Using the Code

In the zip file attached to this article, you'll find two classes that make the whole thing work. There's a Corpus class that holds lists of words, along with counts of how often they appear in a given piece of text. There's also a SpamFilter class that takes two of those Corpuses (Corpi? Corpuses's?) and crashes them against each other to produce a list of probabilities that a document containing a given word will be spam.

Once you've populated the SpamFilter, you can feed it other documents and ask it if it thinks it's looking at spam or not. In my testing, I found that it's actually pretty good at it. It found about 6% false negatives (Spams that didn't get flagged as such), and only 0.2% false positives (good messages mistakenly flagged as spam) out of the 10,000 blog entries I fed it. Actually, it did such a good job that all but one of its "false positives" were actually genuine spam that had slipped past my attempts at moderation.

I've included a sample WinForms application so that you can see the filter in action. It has a couple of text files that it reads in to populate the SpamFilter with enough good and bad content to make a useful demonstration. I've also provided 3 sample blog entries to test against. One is an actual blog entry, one is an obvious spam, and another is a well written spam that snuck through as a false negative. The sample application lets you edit the text of a message to see the effect of adding more or less "bad" content.

Putting it into Production

This is all great for testing purposes, but how do you go about putting this cool filtering live? Let me give you a quick rundown on how we're doing this for Blogabond today:

We keep a static SpamFilter object living in memory on the server, thus saving us the trouble of rebuilding one every time we need it. Once a day, a job runs that rebuilds the SpamFilter from the database contents, stores it in memory, and saves out a backup copy via the .ToFile() method. If we ever find the SpamFilter object missing (as a result of the server cycling itself), we'll just pull up the last state using the .FromFile() method.

When a new blog entry is saved, we run it through the SpamFilter and set an IsSpam bit on the blog entry as necessary. All of our display code knows to check this bit and either suppress display or deliver a 404 response for entries that have been flagged as spam. With one exception: we have a one-minute window after a new spam entry is posted where we'll display it as though it weren't spam. This is enough time for the Spammer to review the entry and congratulate himself on a job well done, but not enough time for the page to be indexed by search engines. We'll naturally also exclude any spam entries from RSS feeds, sitemaps, and Blog Ping service requests.

As part of the administration tools for the site, we have a list of all recent posts along with their status. This lets us quickly flip false positives and negatives back to their correct state, and generally keep a handle on what's going on with the site.

Conclusion

I've dumped this code out onto the Internet for general consumption in the hopes that people will find it useful. I see a lot of blogging and forum sites that are clearly running ASP.NET and are hopelessly overrun by comment spam. With luck, somebody might pick up these two simple classes and turn them into something useful. If you do so, please let me know how it works for you!

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Founder Expat Software
United States United States
Jason Kester is the founder of Expat Software, a small development and consulting house staffed by expatriate Americans in various remote yet comfortable parts of the world. He takes 9 months vacation every year and so should you.

Comments and Discussions

 
GeneralRe: how to store tokens ? Pin
fcis200823-Apr-08 11:06
fcis200823-Apr-08 11:06 
GeneralRe: how to store tokens ? Pin
Jason Kester26-Apr-08 11:42
Jason Kester26-Apr-08 11:42 
Generalgreat Pin
yantingting9-Apr-08 20:29
yantingting9-Apr-08 20:29 
Generaltraining the filter Pin
fcis20084-Apr-08 5:59
fcis20084-Apr-08 5:59 
GeneralRe: training the filter Pin
Steve Goodwin4-Apr-08 15:52
Steve Goodwin4-Apr-08 15:52 
GeneralRe: training the filter Pin
fcis20082-Jul-08 8:16
fcis20082-Jul-08 8:16 
GeneralRe: training the filter Pin
Steve Goodwin2-Jul-08 13:27
Steve Goodwin2-Jul-08 13:27 
GeneralI think there is a mistake of your algorithm Pin
Mironcito3-Apr-08 5:18
Mironcito3-Apr-08 5:18 
Hey,
At your code, there are the two lines:
double goodfactor = Min(1, (double)g / (double)_ngood);
double badfactor = Min(1, (double)b / (double)_nbad);
where _ngood and _nbad are the number of the tokens of each kind.
In the original algorithm, those numbers represent the numbers of the parsed email messages, and not the tokens count.
Am I right ?
GeneralRe: I think there is a mistake of your algorithm Pin
fcis20083-Apr-08 10:07
fcis20083-Apr-08 10:07 
GeneralRe: I think there is a mistake of your algorithm Pin
Jason Kester30-Apr-08 6:32
Jason Kester30-Apr-08 6:32 
GeneralRe: I think there is a mistake of your algorithm Pin
Jason Kester27-Jun-08 23:50
Jason Kester27-Jun-08 23:50 
GeneralYour got my five Pin
Mironcito2-Apr-08 6:20
Mironcito2-Apr-08 6:20 
Questionhow to implement it into email? Pin
inderabangsawan1-Mar-08 1:28
inderabangsawan1-Mar-08 1:28 
GeneralRe: how to implement it into email? Pin
fcis200828-Mar-08 13:31
fcis200828-Mar-08 13:31 
GeneralHElp Pin
fcis200818-Feb-08 12:49
fcis200818-Feb-08 12:49 
GeneralRe: HElp Pin
Jason Kester19-Feb-08 0:16
Jason Kester19-Feb-08 0:16 
GeneralHelp Pin
fcis200816-Feb-08 15:52
fcis200816-Feb-08 15:52 
GeneralRe: Help Pin
Jason Kester17-Feb-08 3:13
Jason Kester17-Feb-08 3:13 
GeneralRe: Help Pin
fcis200817-Feb-08 7:42
fcis200817-Feb-08 7:42 
GeneralGreat article! Pin
Eduard Keilholz15-Feb-08 4:24
Eduard Keilholz15-Feb-08 4:24 
QuestionRelated project? Pin
Gerard van Ewijk12-Feb-08 7:34
Gerard van Ewijk12-Feb-08 7:34 
AnswerRe: Related project? Pin
Jason Kester17-Feb-08 3:09
Jason Kester17-Feb-08 3:09 
GeneralNice job! Any advice on how to mark somethings as "not spam" Pin
John Schroedl12-Feb-08 3:40
professionalJohn Schroedl12-Feb-08 3:40 
GeneralRe: Nice job! Any advice on how to mark somethings as "not spam" Pin
Jason Kester17-Feb-08 3:05
Jason Kester17-Feb-08 3:05 
GeneralThanks! Pin
Jeepy7-Feb-08 7:17
Jeepy7-Feb-08 7:17 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.