|
|||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||
|
Announcements
Want a new Job?
Chapters
Services
Feature Zones
|
IntroductionThis is a C# implementation of Paul Graham's Naive Bayesian Spam Filter algorithm. It is suitable for incorporation into an ASP.NET Blogging, Forum, Email or Wiki application. BackgroundI run a little Travel Blogging website called Blogabond that has been getting more and more attention from spammers over the years. At first, I was able to stem the tide with simple anti-robot measures to reject posts from things that were obviously not Web browsers. Soon after, I had to implement a simple silent human-detection script to run behind the scenes and ensure that a real person was sitting at a real keyboard and typing blog entries in by hand. This approach worked really well for a long time. Every once in a while, some ambitious travel agency would start posting advertisements that I would have to delete by hand until they got the message that it wasn't working. Still, behind the scenes, about 10,000 automated comment spams were getting knocked out of the sky every day. Not bad. It's 2008 now, and the game has changed. We're starting to see a new breed of spam showing up on Blogabond, and it's getting worse every day. This is human-powered spam. Delivered in person by a real person behind a real keyboard someplace where wages are low enough that advertisers can afford to hire rooms of workers to copy/paste comment spam by hand. None of the automated human-detection tricks work against this, because it's not automated anymore. Time to get BayesianModern email clients all use Bayesian spam filtering, so that's what I figured I needed to implement. Googling up "Bayesian C#", I was amazed to find that nobody has put out a Naive Bayesian Spam Filter for C# that you can simply drop into your codebase. What is the story here? The technology has been around since 2002. Is it really that scary to implement? Must be. Still, it's getting really annoying having to moderate Blogabond by hand. I think I'll give it a shot. You know what? It wasn't really that hard. I'm not going to go into detail on the algorithm itself. After all, mine is just a straight implementation of Paul Graham's original Naive Bayesian Spam filtering algorithm, and I don't pretend to have anything interesting to add to his analysis. Using the CodeIn the zip file attached to this article, you'll find two classes that make the whole thing work. There's a Once you've populated the I've included a sample WinForms application so that you can see the filter in action. It has a couple of text files that it reads in to populate the Putting it into ProductionThis is all great for testing purposes, but how do you go about putting this cool filtering live? Let me give you a quick rundown on how we're doing this for Blogabond today: We keep a static When a new blog entry is saved, we run it through the As part of the administration tools for the site, we have a list of all recent posts along with their status. This lets us quickly flip false positives and negatives back to their correct state, and generally keep a handle on what's going on with the site. ConclusionI've dumped this code out onto the Internet for general consumption in the hopes that people will find it useful. I see a lot of blogging and forum sites that are clearly running ASP.NET and are hopelessly overrun by comment spam. With luck, somebody might pick up these two simple classes and turn them into something useful. If you do so, please let me know how it works for you!
|
||||||||||||||||||||||||||||||