Click here to Skip to main content
15,441,629 members
Articles / Web Development / ASP.NET
Posted 6 Feb 2008


103 bookmarked

A Naive Bayesian Spam Filter for C#

Rate me:
Please Sign up or sign in to vote.
4.87/5 (37 votes)
6 Feb 2008CPOL5 min read
A C# implementation of Paul Graham's Naive Bayesian Spam Filter algorithm.
BayesianCS demo screenshot


This is a C# implementation of Paul Graham's Naive Bayesian Spam Filter algorithm. It is suitable for incorporation into an ASP.NET Blogging, Forum, Email or Wiki application.


I run a little Travel Blogging website called Blogabond that has been getting more and more attention from spammers over the years. At first, I was able to stem the tide with simple anti-robot measures to reject posts from things that were obviously not Web browsers. Soon after, I had to implement a simple silent human-detection script to run behind the scenes and ensure that a real person was sitting at a real keyboard and typing blog entries in by hand.

This approach worked really well for a long time. Every once in a while, some ambitious travel agency would start posting advertisements that I would have to delete by hand until they got the message that it wasn't working. Still, behind the scenes, about 10,000 automated comment spams were getting knocked out of the sky every day. Not bad.

It's 2008 now, and the game has changed. We're starting to see a new breed of spam showing up on Blogabond, and it's getting worse every day. This is human-powered spam. Delivered in person by a real person behind a real keyboard someplace where wages are low enough that advertisers can afford to hire rooms of workers to copy/paste comment spam by hand. None of the automated human-detection tricks work against this, because it's not automated anymore.

Time to get Bayesian

Modern email clients all use Bayesian spam filtering, so that's what I figured I needed to implement. Googling up "Bayesian C#", I was amazed to find that nobody has put out a Naive Bayesian Spam Filter for C# that you can simply drop into your codebase. What is the story here? The technology has been around since 2002. Is it really that scary to implement? Must be. Still, it's getting really annoying having to moderate Blogabond by hand. I think I'll give it a shot. You know what? It wasn't really that hard.

I'm not going to go into detail on the algorithm itself. After all, mine is just a straight implementation of Paul Graham's original Naive Bayesian Spam filtering algorithm, and I don't pretend to have anything interesting to add to his analysis.

Using the Code

In the zip file attached to this article, you'll find two classes that make the whole thing work. There's a Corpus class that holds lists of words, along with counts of how often they appear in a given piece of text. There's also a SpamFilter class that takes two of those Corpuses (Corpi? Corpuses's?) and crashes them against each other to produce a list of probabilities that a document containing a given word will be spam.

Once you've populated the SpamFilter, you can feed it other documents and ask it if it thinks it's looking at spam or not. In my testing, I found that it's actually pretty good at it. It found about 6% false negatives (Spams that didn't get flagged as such), and only 0.2% false positives (good messages mistakenly flagged as spam) out of the 10,000 blog entries I fed it. Actually, it did such a good job that all but one of its "false positives" were actually genuine spam that had slipped past my attempts at moderation.

I've included a sample WinForms application so that you can see the filter in action. It has a couple of text files that it reads in to populate the SpamFilter with enough good and bad content to make a useful demonstration. I've also provided 3 sample blog entries to test against. One is an actual blog entry, one is an obvious spam, and another is a well written spam that snuck through as a false negative. The sample application lets you edit the text of a message to see the effect of adding more or less "bad" content.

Putting it into Production

This is all great for testing purposes, but how do you go about putting this cool filtering live? Let me give you a quick rundown on how we're doing this for Blogabond today:

We keep a static SpamFilter object living in memory on the server, thus saving us the trouble of rebuilding one every time we need it. Once a day, a job runs that rebuilds the SpamFilter from the database contents, stores it in memory, and saves out a backup copy via the .ToFile() method. If we ever find the SpamFilter object missing (as a result of the server cycling itself), we'll just pull up the last state using the .FromFile() method.

When a new blog entry is saved, we run it through the SpamFilter and set an IsSpam bit on the blog entry as necessary. All of our display code knows to check this bit and either suppress display or deliver a 404 response for entries that have been flagged as spam. With one exception: we have a one-minute window after a new spam entry is posted where we'll display it as though it weren't spam. This is enough time for the Spammer to review the entry and congratulate himself on a job well done, but not enough time for the page to be indexed by search engines. We'll naturally also exclude any spam entries from RSS feeds, sitemaps, and Blog Ping service requests.

As part of the administration tools for the site, we have a list of all recent posts along with their status. This lets us quickly flip false positives and negatives back to their correct state, and generally keep a handle on what's going on with the site.


I've dumped this code out onto the Internet for general consumption in the hopes that people will find it useful. I see a lot of blogging and forum sites that are clearly running ASP.NET and are hopelessly overrun by comment spam. With luck, somebody might pick up these two simple classes and turn them into something useful. If you do so, please let me know how it works for you!


This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Written By
Founder Expat Software
United States United States
Jason Kester is the founder of Expat Software, a small development and consulting house staffed by expatriate Americans in various remote yet comfortable parts of the world. He takes 9 months vacation every year and so should you.

Comments and Discussions

QuestionSpam filter using naive Bayes in java Pin
Member 123753527-Mar-16 5:12
MemberMember 123753527-Mar-16 5:12 
QuestionQuestion about values above 1 Pin
Paula_Dia21-May-15 10:27
MemberPaula_Dia21-May-15 10:27 
QuestionBayesian filters and CodeProject problem with spam: could you help? Pin
Sergey Alexandrovich Kryukov1-Jan-13 20:06
mvaSergey Alexandrovich Kryukov1-Jan-13 20:06 
QuestionMaybe naive, but interesting enough. My 5 Pin
Sergey Alexandrovich Kryukov1-Jan-13 20:04
mvaSergey Alexandrovich Kryukov1-Jan-13 20:04 
QuestionAny updates on how to fix the results coming back? Pin
Mark Bridgett19-Oct-12 3:52
MemberMark Bridgett19-Oct-12 3:52 
QuestionChange Token Pattern Pin
Member 927934419-Jul-12 11:23
MemberMember 927934419-Jul-12 11:23 
QuestionProbability for definitely ok is above 1 Pin
Member 861199311-Apr-12 19:49
MemberMember 861199311-Apr-12 19:49 
QuestionCriteria of assigning probabilities Pin
raphu29-Mar-12 3:36
Memberraphu29-Mar-12 3:36 
GeneralMy vote of 5 Pin
RatmilTorres13-Mar-11 3:34
MemberRatmilTorres13-Mar-11 3:34 
GeneralMy vote of 1 Pin
diagnose11-Mar-11 3:57
Memberdiagnose11-Mar-11 3:57 
GeneralUpload new version [modified] Pin
Link88822-Apr-10 0:29
MemberLink88822-Apr-10 0:29 
GeneralOne problem with reloading the test data Pin
sandeep222925-Mar-10 10:37
Membersandeep222925-Mar-10 10:37 
GeneralWanted to say thank you. Pin
MrJOeM4-Feb-10 14:37
MemberMrJOeM4-Feb-10 14:37 
Questionhow to implement Naive Bayes algorithm for sentiment analysis Pin
dinspi5-Aug-09 1:16
Memberdinspi5-Aug-09 1:16 
AnswerRe: how to implement Naive Bayes algorithm for sentiment analysis Pin
jamie_maguire15-Sep-13 2:51
Memberjamie_maguire15-Sep-13 2:51 
Generalbig problem with your code Pin
Huisheng Chen3-Mar-09 18:43
MemberHuisheng Chen3-Mar-09 18:43 
GeneralRe: big problem with your code Pin
Jason Kester17-Apr-09 23:26
MemberJason Kester17-Apr-09 23:26 
Questionwhat does the [[L191]] stand for? Pin
Huisheng Chen2-Mar-09 18:25
MemberHuisheng Chen2-Mar-09 18:25 
AnswerRe: what does the [[L191]] stand for? Pin
Huisheng Chen2-Mar-09 18:36
MemberHuisheng Chen2-Mar-09 18:36 
AnswerRe: what does the [[L191]] stand for? Pin
Jason Kester17-Apr-09 23:16
MemberJason Kester17-Apr-09 23:16 
GeneralThanks!! Pin
Waleed Eissa11-Dec-08 20:09
MemberWaleed Eissa11-Dec-08 20:09 
GeneralRe: Thanks!! Pin
Jason Kester18-Dec-08 20:11
MemberJason Kester18-Dec-08 20:11 
GeneralFew glitches Pin
tasmisr13-Nov-08 5:22
Membertasmisr13-Nov-08 5:22 
GeneralRe: Few glitches Pin
Jason Kester4-Dec-08 7:35
MemberJason Kester4-Dec-08 7:35 
GeneralRe: Few glitches Pin
tasmisr4-Dec-08 7:51
Membertasmisr4-Dec-08 7:51 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.