Click here to Skip to main content
Click here to Skip to main content

A Plan for Spam

By , 2 Jan 2013
 

We are presently experiencing a hard pressure from a narrow group of "TV and Media" spammers who cynically challenge out ability to resist this kind of crime. Members of CodeProject are doing remarkable effort for extermination of unwanted parasites, but the measures taken seem to be not quite satisfactory. My reason for this short article is related to discussion of what we can do with between Chris Maunder and myself:
http://www.codeproject.com/Messages/4462716/Re-Live-streamers.aspx[^],
http://www.codeproject.com/Messages/4462726/Re-Live-streamers.aspx[^].

Several hours later, a fresh idea came to my mind, a variant of the ideas we already discussed. I would ask interested members to think about it and discuss it, criticize and support. Generally, we need some brain storm to help Chris and others to arm the site with suitable improved protection against spam, the way not threatening legitimate members and not boosting the overhead of using and maintaining the site too much.

I'm coming back to the idea of Bayesian filtering. I've successfully used it on my e-mails a while ago, but, after all, replaced it all by my own approach (this is not a place to discuss it because it cannot be applied to the site). I think, Bayesian filtering approach did not find its dominating place in e-mail services by some natural reasons, such as human operator/user overhead and unavoidable false negatives/positives of the method. However, I'm starting to think that if we use this idea, with a special twist (which can be further discussed), we can apply it for the protection of CodeProject.

This short article is named after the article "A Plan for Spam" by Paul Graham: http://www.paulgraham.com/spam.html[^].

See also another article: http://www.paulgraham.com/better.html[^].

I think, after reading of the articles the idea will be clear enough.

As to the implementation, please looks at this open-source product: http://nbayes.codeplex.com[^].

And this is a CodeProject article: A Naive Bayesian Spam Filter for C#[^].

That was just to demonstrate that the implementation won't be a big problem.

Still, the problem is: how to decide on the cancellation of the spammer's account? Don't we face the same problems: false negative/positive and excessive amount of the intervention of the administrator. Remember now, that I pointed out the main problem with the workload put on a human administrator: the requires chores are not automated, or not optimized to meet the goals.

Now, here is the main idea:

Let's invert the situation socially. Instead of making the decision on cancellation of a offender's account, let's make the potential offender applying for the "legalization" of a potentially spamming post. Hold on! Don't deny this idea from the very beginning, before I explain how it practically may look. I'm going to demonstrate that this can be done gently enough.

First of all, let's remember the starting point. At starting point, the filter is empty (or all available filters are empty), so, without intervention of the member caring about extermination of spam, nothing is filtered out, ever. The filters are started to populate as some member spots the spam and report it as such. It should be a special reporting action for spam, which feeds the spammed context into a filter. A filter starts populating and gradually acquires the ability to detect spamming content automatically. Yes, which some false positives/negatives. For the detail of this process, please come back to the articles by Paul Graham.

As a first step, the post content is not placed on the CodeProject content page (Questions & Answers, or something else). Instead, a potential offender gets the message on a page. Something like that:

CodeProject informs:
Sorry, we cannot place you post immediately. It contains some content detected by our filters as potential spam. The detection was bases on previous spam reports of CodeProject members. If you believe this is not spam, you will need to post your explanation here [URL]

The content goes to the database. On the request by the potential spammer, the page with legalization form is generated; and the report goes to the database, where the status of prospective post is stored. Again, it should not happen often; and legitimate members posting their messages will almost never get this message. I know this from my experience with Bayesian filtering for e-mail.

Now, by the request of the administrator, all the filtered members' messages will be generated on a single page. Usually, one glance on the messages will be enough to judge if this is spam or not. Importantly, this is quite unlikely that a real spammers will pledge for legalization of their contents. So, I think that the action most typically be will be "Yes to all" (pretty like in the movie "Bruce Almighty", 2003; no, this is not spam, I have no interest in promotion of this commercial product and cited it only for illustration of the protection method; I pledge for legalization of this post Smile | :) ). Of course, this "yes to all" is applied to the posts awaiting for approval/legalization. And it will be equally easy to have a single button "Remove all offending posts and member accounts" for all checked items.)

If you clearly imaging it, you will see that this procedure will be much easier than what we have now.

The access to this approval/legalization and member extermination procedure is a matter of some discussion. This aspect is not as important. I would suggest that the right for the final extermination of an offenders' accounts will be left to the administration, while the right for legalization and the right for extermination of offender's post (from this page; it is already there from the page of the question in Question & Answers forum) could be granted to members with some level of reputation.

Please discuss this idea and share your ideas. Maybe we can come up with some variant of my approach or something completely different.

Thank you for attention for this rather unpleasant matter and the effort already paid in order to sustain the site.

—SA

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Sergey Alexandrovich Kryukov
Architect
United States United States
Member
No Biography provided

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board.
Search this forum  
    Spacing  Noise  Layout  Per page   
GeneralMy vote of 5mvpMichael Haephrati8 Mar '13 - 2:27 
The idea is brilliant!
GeneralRe: My vote of 5mvpSergey Alexandrovich Kryukov8 Mar '13 - 5:39 
Thank you Michael.
 
In fact, there are questionable moments which needs to be thoroughly considered, but I hope, with decent work and proper support, such project could have everything too succeed or improve the situation well. All the issues are in the trade-off between quality of filtering and the efforts needed for support.
 
—SA

Sergey A Kryukov

QuestionIt's not a bad idea, but it can be taken further.protectorPete O'Hanlon8 Jan '13 - 6:03 
I've used Naive Bayes with some moderate success in the past, and it definitely has a place, but as you point out the big issue is how to decide on the automatic deletion of an account (poor Michael Martin had has account closed three times in the same day with the filter that Chris put in place last year).
 
Yes to a moderation queue, but this still leaves the issue that spammers create new accounts and post new messages. As an idea, to keep the noise down:
 
When a message is identified as spam, it goes into the moderation queue - but it is still visible, in place, for the account that created it. Other accounts don't see it in that place, they only see it in the moderation queue.
Sufficient votes - remove the message from the moderation queue. If the votes are for it not being spam, then all users get to see it in place. If the votes are for it being spam, then only the OP will be able to see it, don't remove it.
 
This isn't a new idea - FogBugz already implements a feature just like this - but it is surprisingly effective.

*pre-emptive celebratory nipple tassle jiggle* - Sean Ewington

"Mind bleach! Send me mind bleach!" - Nagy Vilmos

CodeStash - Online Snippet Management | My blog | MoXAML PowerToys | Mole 2010 - debugging made easier

AnswerRe: It's not a bad idea, but it can be taken further.mvpSergey Alexandrovich Kryukov8 Mar '13 - 5:41 
You have some good points here, thank you for explaining them. Perhaps we need to discuss it further.
 
—SA

Sergey A Kryukov

QuestionAlready tried - but maybe not well enoughadminChris Maunder7 Jan '13 - 15:07 
We actually implemented bayesian spam filtering a couple of months back and it performed terribly. The main issue was that our filter needed to be "trained" each application start, so we necessarily had to use a small set of messages for training which lead to predictably terrible results.
 
I do like the idea of putting "suspect" messages in an approval queue. Assuming members will be responsive to that queue, it goes a long way to solving the spam issue.
 
I'll be honest and say right now we have no resources to implement a decent spam filter. Would it make sense to open this up as a contest? Setting up an approval page should take me only a few hours, and plugging in a decent spam filter should be easy since we have the plumbing still in place. The only bits we'd need are the training, and the persistence of training results. I'm guessing that 5 minutes searching would get us all that.
cheers,
Chris Maunder
 
The Code Project | Co-founder
Microsoft C++ MVP

GeneralRe: Already tried - but maybe not well enoughmemberSoMad9 Jan '13 - 11:47 
I noticed something the other day (Saturday? It's all a little blurry). There were spam posts flooding into QA and I have no idea how many I flagged and deleted, but a couple of times they got to sit for a while (I have to walk the dogs, you know Sniff | :^) ) and when I came back and tried red-flagging them, they still had just 1 report.
 
At the same time, a couple of messages were posted in the forums, but they got killed off very rapidly. Did you have a filter running in the forums or were the few online members just more active there?
 

Soren Madsen
"When you don't know what you're doing it's best to do it quickly" - Jase #DuckDynasty

GeneralRe: Already tried - but maybe not well enoughmvpSergey Alexandrovich Kryukov9 Jan '13 - 13:04 
No, I did not use the filter in a forum, only for e-mail, and, generally, have experience with this kind of filters. Please look at my article. The point is different: it can be more effective for the forum. The problem is not there is too much spam and not enough active members to withstand it, but that it needs some automation, otherwise it will overwhelm us. I think my proposal is reasonable.
 
And let's not get pacified too much by the present low spamming activities. It may get worse.
 
Thank you,
—SA

Sergey A Kryukov

GeneralRe: Already tried - but maybe not well enoughmemberSoMad9 Jan '13 - 13:14 
That is not what I meant, I did read your article.
I was asking Chris if he had any kind of filter set up for the forums that day.
 
Soren Madsen
"When you don't know what you're doing it's best to do it quickly" - Jase #DuckDynasty

GeneralRe: Already tried - but maybe not well enoughmvpSergey Alexandrovich Kryukov9 Jan '13 - 13:15 
OK, I answered your question, hope you can think on my suggestions and offer some ideas, criticism, etc.
—SA

Sergey A Kryukov

GeneralRe: Already tried - but maybe not well enoughadminChris Maunder9 Jan '13 - 14:32 
I was playing with some search-and-destroy scripts.
 
They were nicely effective.
cheers,
Chris Maunder
 
The Code Project | Co-founder
Microsoft C++ MVP

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Permalink | Advertise | Privacy | Mobile
Web04 | 2.6.130523.1 | Last Updated 2 Jan 2013
Article Copyright 2013 by Sergey Alexandrovich Kryukov
Everything else Copyright © CodeProject, 1999-2013
Terms of Use
Layout: fixed | fluid