February 24, 2010
by Justin
As far as I understand it, comment spam can be broken into three main methods:
- Human: a genuine, real live person actually sits there and types a spam message into your comment form – unsurprisingly this is the least common form
- Bot: an automated script scans your website, finds your comment form, generates a spam message and posts it to your site
- Harvester: yet another script scans your website, finds your comment form and copies it – days, weeks or months later the offsite copy of your form is used to start submitting spam messages to your site
We’ll look at some methods for combating each of these in turn.
Only human
Other than comment moderation (wherein you give yourself the power of God over each and every message posted to your site) there’s not a great deal that can be done about the human-generated spam. Some filtering services, such as Akismet, will scan the content of messages to determine the likelihood of it being spam and deal with it accordingly. Akismet is closely associated with Wordpress, but can be used on other platforms and is free for non-commercial use.
I personally haven’t implemented it on my blog, but it’s doubtless a handy addition for sites that attract heavy traffic. It’s probably worth pointing out that using such a third-party service takes some of the control out of your hands, but is likely a small price to pay if human-generated spam is causing real problems.
You could also set up a login system and force users to create accounts in order to post comments. This might deter spammers, but it might also deter users (it certainly deters me since the last thing I want to do is to create yet another account somewhere just so I can post a brief comment on someone else's site). However, there's always the possibility of using OpenID, which enables people to either use an existing account or create a generic account for all OpenID supported platforms (and here's an interesting article covering how easy it could be for people to use OpenID).
article continues...
Death to the autobots
You’re probably all-too familiar with the godawful Captcha screens (and variants) that appear on many web forms. And, after that sentence, you’re probably also well aware of my feelings towards such techniques. Captcha is based up on the Turing test principle, which is really just a way of testing whether someone’s a real person or a machine. The typical method is to include an element on your web form that requires cognition, as opposed to mere calculation – machines can calculate, but they’re rubbish at thinking, even the ones that look like Scharzenegger.
Captcha works because you have to look at a picture and determine what words or letters are displayed on the picture. Captcha doesn’t work because to ensure the picture can’t simply be scanned and OCRed by a machine the words need to be obscured the to such a degree that even a human can barely work out what they say half the time. These days I tend to view the employment of Captcha as a usability failure.
Nevertheless, the idea of using a Turing test is sound. For my blog I use a simple maths sum: I give the user two numbers and ask them to add the two numbers up. This might cause minor issues to those who have genuine problems with basic maths, but a calculator is never far away on any computer. On my initial attempts I stupidly included the answer as a hidden field on the form – needless to say the spam kept rolling in. I quickly fixed that and, to make it a bit harder for bots, the current version requests the answer in digits, but phrases the question using words. Here’s the basic script:
$numbers = array("zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine");
$a = rand(0,9);
$b = rand(0,9);
$answer = $a+$b;
The word values for the numbers are placed in an array – those familiar with array structures will immediately see that the key for each entry in the array equates to the value (with the first key in a standard array always being 0). In other words, the key for “zero” is 0, the key for “one” is 1, and so on.
On the second and third lines we generate two random numbers between 1 and 9 (as a digit, this time). Note that we use a consistent range of 0 to 9 throughout, this range can always be expanded if needed, but the array and the range for the random selector need to match. On the fourth line we calculate the sum of the two random numbers selected: this value is used to verify the answer that the user enters on the form.
We do have a slight problem however: we need to tell the processing script which answer was generated by the form script otherwise how can it verify whether a user has entered the correct value or not? If your processing script is on the same page as your form then it’s not such an issue, but for added security I have the form and processor separated – therefore I need to send the user-entered answer as well as the generated (and correct) answer to my processing script in order that they can be compared.
To resolve this we use a random value to ‘salt’ the answer and then we encrypt it:
$s_answer = md5($salt.$answer);
The salt value can be anything you want – in the case of my blog it’s partly derived from something specific to both my domain and the blog post itself. Either way, the important thing is to ensure than the value is the same on both the form and processor script.
The relevant form field looks like this:
<input name="answer" id="answer" />
<input type="hidden" name="s_answer" value="".$s_answer."" />
<p>Please enter the sum of <strong>".$numbers[$a]." plus ".$numbers[$b]."</strong> in digits (e.g '19')</p>
The first line is the input field for the user to type in their answer. The second line contains the ‘salted’ answer as a hidden field. The third line tells the user what they need to do and displays the numbers (from the $numbers array) that need to be added together. Note how we use the randomly generated $a and $b values to pull the equivalent words from the $numbers array. You’ll be able to see the real life example at the bottom of this page.
We process the posted values as follows:
$user_answer = (int)$_POST['answer'];
$salt_answer = $_POST['s_answer'];
if( md5($salt.$user_answer) != $salt_answer ) {
$errmsg[] = "Please answer the security question correctly";
}
Simply put, if the salted version of the user’s answer does not match the salted answer posted from the hidden field then an error message is generated and the comment is not submitted.
If you prefer not to resort to basic maths then you could use a similar technique to ask simple questions (e.g. “What color is the sun? Yellow”). However, this may cause problems if someone decides the sun is white, or if they can’t spell yellow correctly, or if they’re think you’re referring to the red sun of Krypton, and so on. Maths is less ambiguous.
article continues...
Harvest of fail
The third main method of spamming blogs - if you’re still awake there - is to harvest (or copy) the comment form itself and subsequently post spam messages from a remote site. To combat this I originally used the HTTP_REFERER header to check that comments were being posted from my own domain, but since the HTTP_REFERER header can be easily altered or omitted (often, ironically, by internet security software) it’s not a reliable method. Accordingly I’ve had to implement a number of other techniques to try and prevent offsite spam.
The first is setting a session which contains the URL of the page containing the form:
$host = (substr($_SERVER["HTTP_HOST"],0,4) != "http") ? 'http://'.$_SERVER["HTTP_HOST"] : $_SERVER["HTTP_HOST"] ;
$request = $_SERVER["REQUEST_URI"];
$_SESSION[‘my_referring_page’] = $host.$request;
On your processing page you can simply check whether or not this session has been set and, if not, bounce the user out. As a further measure, should you be worried that the spammer might work out what you’ve named your session and duplicate it (and obviously I’ve used a different name in my actual code to that published above), you can easily check whether the referring page comes from the right domain or not using straightforward string comparison.
My second technique involves setting a random key/value pair on the form which is then verified on the processor page. This, once again, uses the random ‘salt’ value defined above. Furthermore, in order to foil spammers who might harvest your form and wait some time before posting spam, the key/value pair is only valid for a maximum of two hours. (In truth, the time limitation was a throwback to when the key/value pair was posted via the form: I now pass the key/value pair in a session, but it does no harm to leave the time limit in.)
Here’s how we generate the key and value:
$now = date('HDY');
$key = md5($now.$salt);
$value = md5($salt.$now);
$_SESSION[$key] = $value;
// values from one hour ago in case user posts on the hour
$prev_now = date(''HDY'',strtotime('1 hour ago'));
$prev_key = md5($prev_now.$salt);
$prev_value = md5($salt.$prev_now);
$_SESSION[$prev_key] = $prev_value;
You can define the date format for $now and $prev_now in any way you like – in fact the more random the better, it makes it less likely that a spammer would work out the resulting key/value.
The value I generate for $now includes the hour in order to limit the window – you can opt for a day or a month if you wish, but including a value for minutes and/or seconds will make it almost impossible for anyone to post comments as the key/value pair will expire within minutes and/or seconds. As you’ll see above I also consider the fact that a user might post on or near the hour itself, therefore I also calculate the equivalent key/value for the previous hour.
Remember – the above script must also be repeated on your processor script so the key/value pair can be verified. And here’s how we do that part:
if( (isset($_SESSION[$key])) && ($_SESSION[$key] == $value) ) {
// form passed antispam - unset the session
unset($_SESSION[$key]);
unset($_SESSION[$prev_key]);
} else {
// give them an hour's grace
if( (isset($_SESSION[$prev_ key])) && ($_SESSION[$prev_ key] == $prev_value) ) {
// ok we still passed security - unset the session
unset($_SESSION[$key]);
unset($_SESSION[$prev_ key]);
} else {
// otherwise probably a spammer - don't tell 'em!
$errmsg[] = ".";
}
}
Notice that we don’t give a particularly detailed error message. We do want to generate an error in order to prevent the comment from being posted, but we don’t really want to explain to the spammer that he’s been caught, or indeed how he’s been caught – that would just help him work out how to circumvent the spam trap.
article continues...
Final notes
I should point out that I don’t get particularly heavy traffic on my blog, so I can’t claim that these techniques will keep really determined, frequent, or excessive spammers at bay. However, even my blog, when I first set it up a few years ago, attracted several spam messages a month – sometimes more. I also noticed that once a spammer targeted a particular article more spam comments would almost certainly follow.
Now that I’ve implement these techniques I don’t get any spam. I’ve possibly had one or two that are clearly human generated spam, but the days of logging into my admin pages to see dozens of automated spam messages attached to my blog posts are long gone.
If you think there’s a flaw in any of the techniques described above I really, really want to hear about it – in fact, that’s one of the main reasons I’ve written this. Of course, I’d also love to hear if you think any of the above is likely to be of use to you. I haven’t attached the completed scripts to this article because you’d as likely need to do as much work trimming them to fit your site as to make it fruitless exercise. However, if there is demand I will look into uploading a basic version of the complete form and processor.
article continues...
Posted:
February 24, 2010 at 16:19
Permalink
Filed under:
Web Design
Author:
Justin (contact)
Last edit:
February 25, 2010 - 12:21
JayZee March 1, 2010 - 18:01
Brilliant article. I've been toying with the idea of using emotion as verification. For example, pointing the user to a news article (or forcing them to read a sentence) and asking them if it made them happy or sad. eg "Cute little puppies in a washing machine = sad" but "washing cute little puppies = happy"