This is a C# implementation of Paul Graham's Naive Bayesian Spam Filter algorithm. It is suitable for incorporation into an ASP.NET Blogging, Forum, Email or Wiki application.
I run a little Travel Blogging website called Blogabond that has been getting more and more attention from spammers over the years. At first, I was able to stem the tide with simple anti-robot measures to reject posts from things that were obviously not Web browsers. Soon after, I had to implement a simple silent human-detection script to run behind the scenes and ensure that a real person was sitting at a real keyboard and typing blog entries in by hand.
This approach worked really well for a long time. Every once in a while, some ambitious travel agency would start posting advertisements that I would have to delete by hand until they got the message that it wasn't working. Still, behind the scenes, about 10,000 automated comment spams were getting knocked out of the sky every day. Not bad.
It's 2008 now, and the game has changed. We're starting to see a new breed of spam showing up on Blogabond, and it's getting worse every day. This is human-powered spam. Delivered in person by a real person behind a real keyboard someplace where wages are low enough that advertisers can afford to hire rooms of workers to copy/paste comment spam by hand. None of the automated human-detection tricks work against this, because it's not automated anymore.
Time to get Bayesian
Modern email clients all use Bayesian spam filtering, so that's what I figured I needed to implement. Googling up "Bayesian C#", I was amazed to find that nobody has put out a Naive Bayesian Spam Filter for C# that you can simply drop into your codebase. What is the story here? The technology has been around since 2002. Is it really that scary to implement? Must be. Still, it's getting really annoying having to moderate Blogabond by hand. I think I'll give it a shot. You know what? It wasn't really that hard.
I'm not going to go into detail on the algorithm itself. After all, mine is just a straight implementation of Paul Graham's original Naive Bayesian Spam filtering algorithm, and I don't pretend to have anything interesting to add to his analysis.
Using the Code
In the zip file attached to this article, you'll find two classes that make the whole thing work. There's a
Corpus class that holds lists of words, along with counts of how often they appear in a given piece of text. There's also a
SpamFilter class that takes two of those Corpuses (Corpi? Corpuses's?) and crashes them against each other to produce a list of probabilities that a document containing a given word will be spam.
Once you've populated the
SpamFilter, you can feed it other documents and ask it if it thinks it's looking at spam or not. In my testing, I found that it's actually pretty good at it. It found about 6% false negatives (Spams that didn't get flagged as such), and only 0.2% false positives (good messages mistakenly flagged as spam) out of the 10,000 blog entries I fed it. Actually, it did such a good job that all but one of its "false positives" were actually genuine spam that had slipped past my attempts at moderation.
I've included a sample WinForms application so that you can see the filter in action. It has a couple of text files that it reads in to populate the
SpamFilter with enough good and bad content to make a useful demonstration. I've also provided 3 sample blog entries to test against. One is an actual blog entry, one is an obvious spam, and another is a well written spam that snuck through as a false negative. The sample application lets you edit the text of a message to see the effect of adding more or less "bad" content.
Putting it into Production
This is all great for testing purposes, but how do you go about putting this cool filtering live? Let me give you a quick rundown on how we're doing this for Blogabond today:
We keep a static
SpamFilter object living in memory on the server, thus saving us the trouble of rebuilding one every time we need it. Once a day, a job runs that rebuilds the
SpamFilter from the database contents, stores it in memory, and saves out a backup copy via the
.ToFile() method. If we ever find the
SpamFilter object missing (as a result of the server cycling itself), we'll just pull up the last state using the
When a new blog entry is saved, we run it through the
SpamFilter and set an
IsSpam bit on the blog entry as necessary. All of our display code knows to check this bit and either suppress display or deliver a 404 response for entries that have been flagged as spam. With one exception: we have a one-minute window after a new spam entry is posted where we'll display it as though it weren't spam. This is enough time for the Spammer to review the entry and congratulate himself on a job well done, but not enough time for the page to be indexed by search engines. We'll naturally also exclude any spam entries from RSS feeds, sitemaps, and Blog Ping service requests.
As part of the administration tools for the site, we have a list of all recent posts along with their status. This lets us quickly flip false positives and negatives back to their correct state, and generally keep a handle on what's going on with the site.
I've dumped this code out onto the Internet for general consumption in the hopes that people will find it useful. I see a lot of blogging and forum sites that are clearly running ASP.NET and are hopelessly overrun by comment spam. With luck, somebody might pick up these two simple classes and turn them into something useful. If you do so, please let me know how it works for you!