Click here to Skip to main content
11,709,220 members (45,192 online)
Click here to Skip to main content

A Naive Bayesian Spam Filter for C#

, 6 Feb 2008 CPOL 156.5K 8.8K 101
Rate this:
Please Sign up or sign in to vote.
A C# implementation of Paul Graham's Naive Bayesian Spam Filter algorithm.
BayesianCS demo screenshot

Introduction

This is a C# implementation of Paul Graham's Naive Bayesian Spam Filter algorithm. It is suitable for incorporation into an ASP.NET Blogging, Forum, Email or Wiki application.

Background

I run a little Travel Blogging website called Blogabond that has been getting more and more attention from spammers over the years. At first, I was able to stem the tide with simple anti-robot measures to reject posts from things that were obviously not Web browsers. Soon after, I had to implement a simple silent human-detection script to run behind the scenes and ensure that a real person was sitting at a real keyboard and typing blog entries in by hand.

This approach worked really well for a long time. Every once in a while, some ambitious travel agency would start posting advertisements that I would have to delete by hand until they got the message that it wasn't working. Still, behind the scenes, about 10,000 automated comment spams were getting knocked out of the sky every day. Not bad.

It's 2008 now, and the game has changed. We're starting to see a new breed of spam showing up on Blogabond, and it's getting worse every day. This is human-powered spam. Delivered in person by a real person behind a real keyboard someplace where wages are low enough that advertisers can afford to hire rooms of workers to copy/paste comment spam by hand. None of the automated human-detection tricks work against this, because it's not automated anymore.

Time to get Bayesian

Modern email clients all use Bayesian spam filtering, so that's what I figured I needed to implement. Googling up "Bayesian C#", I was amazed to find that nobody has put out a Naive Bayesian Spam Filter for C# that you can simply drop into your codebase. What is the story here? The technology has been around since 2002. Is it really that scary to implement? Must be. Still, it's getting really annoying having to moderate Blogabond by hand. I think I'll give it a shot. You know what? It wasn't really that hard.

I'm not going to go into detail on the algorithm itself. After all, mine is just a straight implementation of Paul Graham's original Naive Bayesian Spam filtering algorithm, and I don't pretend to have anything interesting to add to his analysis.

Using the Code

In the zip file attached to this article, you'll find two classes that make the whole thing work. There's a Corpus class that holds lists of words, along with counts of how often they appear in a given piece of text. There's also a SpamFilter class that takes two of those Corpuses (Corpi? Corpuses's?) and crashes them against each other to produce a list of probabilities that a document containing a given word will be spam.

Once you've populated the SpamFilter, you can feed it other documents and ask it if it thinks it's looking at spam or not. In my testing, I found that it's actually pretty good at it. It found about 6% false negatives (Spams that didn't get flagged as such), and only 0.2% false positives (good messages mistakenly flagged as spam) out of the 10,000 blog entries I fed it. Actually, it did such a good job that all but one of its "false positives" were actually genuine spam that had slipped past my attempts at moderation.

I've included a sample WinForms application so that you can see the filter in action. It has a couple of text files that it reads in to populate the SpamFilter with enough good and bad content to make a useful demonstration. I've also provided 3 sample blog entries to test against. One is an actual blog entry, one is an obvious spam, and another is a well written spam that snuck through as a false negative. The sample application lets you edit the text of a message to see the effect of adding more or less "bad" content.

Putting it into Production

This is all great for testing purposes, but how do you go about putting this cool filtering live? Let me give you a quick rundown on how we're doing this for Blogabond today:

We keep a static SpamFilter object living in memory on the server, thus saving us the trouble of rebuilding one every time we need it. Once a day, a job runs that rebuilds the SpamFilter from the database contents, stores it in memory, and saves out a backup copy via the .ToFile() method. If we ever find the SpamFilter object missing (as a result of the server cycling itself), we'll just pull up the last state using the .FromFile() method.

When a new blog entry is saved, we run it through the SpamFilter and set an IsSpam bit on the blog entry as necessary. All of our display code knows to check this bit and either suppress display or deliver a 404 response for entries that have been flagged as spam. With one exception: we have a one-minute window after a new spam entry is posted where we'll display it as though it weren't spam. This is enough time for the Spammer to review the entry and congratulate himself on a job well done, but not enough time for the page to be indexed by search engines. We'll naturally also exclude any spam entries from RSS feeds, sitemaps, and Blog Ping service requests.

As part of the administration tools for the site, we have a list of all recent posts along with their status. This lets us quickly flip false positives and negatives back to their correct state, and generally keep a handle on what's going on with the site.

Conclusion

I've dumped this code out onto the Internet for general consumption in the hopes that people will find it useful. I see a lot of blogging and forum sites that are clearly running ASP.NET and are hopelessly overrun by comment spam. With luck, somebody might pick up these two simple classes and turn them into something useful. If you do so, please let me know how it works for you!

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Jason Kester
Founder Expat Software
United States United States
Jason Kester is the founder of Expat Software, a small development and consulting house staffed by expatriate Americans in various remote yet comfortable parts of the world. He takes 9 months vacation every year and so should you.

You may also be interested in...

Comments and Discussions

 
QuestionQuestion about values above 1 Pin
Paula_Dia21-May-15 10:27
memberPaula_Dia21-May-15 10:27 
QuestionBayesian filters and CodeProject problem with spam: could you help? Pin
Sergey Alexandrovich Kryukov1-Jan-13 20:06
mvpSergey Alexandrovich Kryukov1-Jan-13 20:06 
QuestionMaybe naive, but interesting enough. My 5 Pin
Sergey Alexandrovich Kryukov1-Jan-13 20:04
mvpSergey Alexandrovich Kryukov1-Jan-13 20:04 
I just voted 5. The work is interesting enough.

I would ask you to take a look at one quite practical and important thing. I just published a short article devoted to the spamming problem we experiencing right now on CodeProject. It has a pending status right now, but I hope it will be approved by the time you can take a look at it:

http://www.codeproject.com/Tips/519762/A-Plan-for-Spam[^]

I have and idea how to abate a spamming crisis. Could you consider the problem, review my idea and tell me what do you think. Perhaps you get some other ideas and could provide some help. The spamming problem right now is really serious; and I'm thinking very seriously at how I can help.

Thank you for your attention for this important matter,
—SA

Sergey A Kryukov

QuestionAny updates on how to fix the results coming back? Pin
Mark Bridgett19-Oct-12 3:52
memberMark Bridgett19-Oct-12 3:52 
QuestionChange Token Pattern Pin
Member 927934419-Jul-12 11:23
memberMember 927934419-Jul-12 11:23 
QuestionProbability for definitely ok is above 1 Pin
Member 861199311-Apr-12 19:49
memberMember 861199311-Apr-12 19:49 
QuestionCriteria of assigning probabilities Pin
raphu29-Mar-12 3:36
memberraphu29-Mar-12 3:36 
GeneralMy vote of 5 Pin
RatmilTorres13-Mar-11 3:34
memberRatmilTorres13-Mar-11 3:34 
GeneralMy vote of 1 Pin
diagnose11-Mar-11 3:57
memberdiagnose11-Mar-11 3:57 
GeneralUpload new version [modified] Pin
Link88822-Apr-10 0:29
memberLink88822-Apr-10 0:29 
GeneralOne problem with reloading the test data Pin
sandeep222925-Mar-10 10:37
membersandeep222925-Mar-10 10:37 
GeneralWanted to say thank you. Pin
MrJOeM4-Feb-10 14:37
memberMrJOeM4-Feb-10 14:37 
Questionhow to implement Naive Bayes algorithm for sentiment analysis Pin
dinspi5-Aug-09 1:16
memberdinspi5-Aug-09 1:16 
AnswerRe: how to implement Naive Bayes algorithm for sentiment analysis Pin
jamie_maguire15-Sep-13 2:51
memberjamie_maguire15-Sep-13 2:51 
Generalbig problem with your code Pin
Unruled Boy3-Mar-09 18:43
memberUnruled Boy3-Mar-09 18:43 
GeneralRe: big problem with your code Pin
Jason Kester17-Apr-09 23:26
memberJason Kester17-Apr-09 23:26 
Questionwhat does the [[L191]] stand for? Pin
Unruled Boy2-Mar-09 18:25
memberUnruled Boy2-Mar-09 18:25 
AnswerRe: what does the [[L191]] stand for? Pin
Unruled Boy2-Mar-09 18:36
memberUnruled Boy2-Mar-09 18:36 
AnswerRe: what does the [[L191]] stand for? Pin
Jason Kester17-Apr-09 23:16
memberJason Kester17-Apr-09 23:16 
GeneralThanks!! Pin
Waleed Eissa11-Dec-08 20:09
memberWaleed Eissa11-Dec-08 20:09 
GeneralRe: Thanks!! Pin
Jason Kester18-Dec-08 20:11
memberJason Kester18-Dec-08 20:11 
GeneralFew glitches Pin
tasmisr13-Nov-08 5:22
membertasmisr13-Nov-08 5:22 
GeneralRe: Few glitches Pin
Jason Kester4-Dec-08 7:35
memberJason Kester4-Dec-08 7:35 
GeneralRe: Few glitches Pin
tasmisr4-Dec-08 7:51
membertasmisr4-Dec-08 7:51 
GeneralRe: Few glitches Pin
Jason Kester18-Dec-08 20:04
memberJason Kester18-Dec-08 20:04 
GeneralRe: Few glitches Pin
Jason Kester17-Apr-09 23:39
memberJason Kester17-Apr-09 23:39 
QuestionRe: Few glitches Pin
redevries30-Mar-09 21:05
memberredevries30-Mar-09 21:05 
QuestionExcellent - web integration example? Pin
Ensonix Ryan30-Jul-08 4:32
memberEnsonix Ryan30-Jul-08 4:32 
GeneralExcellent Stuff Pin
David Barrett5-Jun-08 7:09
memberDavid Barrett5-Jun-08 7:09 
General10000 blog files Pin
jk.mehta6-May-08 4:24
memberjk.mehta6-May-08 4:24 
GeneralNewbie question Pin
mgp2230-Apr-08 6:43
membermgp2230-Apr-08 6:43 
GeneralRe: Newbie question Pin
Jason Kester4-May-08 2:54
memberJason Kester4-May-08 2:54 
GeneralRe: Newbie question Pin
mgp224-May-08 6:21
membermgp224-May-08 6:21 
GeneralRe: Newbie question Pin
Jason Kester6-May-08 2:17
memberJason Kester6-May-08 2:17 
GeneralRe: Newbie question Pin
mgp226-May-08 5:30
membermgp226-May-08 5:30 
GeneralBug in Corpus.LoadFromFile Pin
AdeMiller20-Apr-08 7:34
memberAdeMiller20-Apr-08 7:34 
GeneralRe: Bug in Corpus.LoadFromFile Pin
Jason Kester23-Apr-08 10:41
memberJason Kester23-Apr-08 10:41 
Questionhow to store tokens ? Pin
fcis200811-Apr-08 13:40
memberfcis200811-Apr-08 13:40 
AnswerRe: how to store tokens ? Pin
Jason Kester23-Apr-08 10:46
memberJason Kester23-Apr-08 10:46 
GeneralRe: how to store tokens ? Pin
fcis200823-Apr-08 11:06
memberfcis200823-Apr-08 11:06 
GeneralRe: how to store tokens ? Pin
Jason Kester26-Apr-08 11:42
memberJason Kester26-Apr-08 11:42 
Generalgreat Pin
yantingting9-Apr-08 20:29
memberyantingting9-Apr-08 20:29 
Generaltraining the filter Pin
fcis20084-Apr-08 5:59
memberfcis20084-Apr-08 5:59 
GeneralRe: training the filter Pin
Steve Goodwin4-Apr-08 15:52
memberSteve Goodwin4-Apr-08 15:52 
GeneralRe: training the filter Pin
fcis20082-Jul-08 8:16
memberfcis20082-Jul-08 8:16 
GeneralRe: training the filter Pin
Steve Goodwin2-Jul-08 13:27
memberSteve Goodwin2-Jul-08 13:27 
GeneralI think there is a mistake of your algorithm Pin
Mironcito3-Apr-08 5:18
memberMironcito3-Apr-08 5:18 
GeneralRe: I think there is a mistake of your algorithm Pin
fcis20083-Apr-08 10:07
memberfcis20083-Apr-08 10:07 
GeneralRe: I think there is a mistake of your algorithm Pin
Jason Kester30-Apr-08 6:32
memberJason Kester30-Apr-08 6:32 
GeneralRe: I think there is a mistake of your algorithm Pin
Jason Kester27-Jun-08 23:50
memberJason Kester27-Jun-08 23:50 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

| Advertise | Privacy | Terms of Use | Mobile
Web03 | 2.8.150819.1 | Last Updated 6 Feb 2008
Article Copyright 2008 by Jason Kester
Everything else Copyright © CodeProject, 1999-2015
Layout: fixed | fluid