Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

Makosuke

macrumors 604
Original poster
Aug 15, 2001
6,801
1,522
The Cool Part of CA, USA
Just wondering if folks have any favorites/recommendations for formmail-type scripts, in particular ones that have some captcha integration.

I've got a site with some relatively large forms that, to date, I've been feeding to a basic formmail app provided by my webhost. This, unsurprisingly, leads to preposterous amounts of spam. A nice little captcha would, theoretically at least, help with this, and having a proper hard-coded script would bypass exposing an email address through the page code.

I like the look of ReCaptcha (not to mention that it actually tries to do something useful with the needless work), and a search turned up this script as one that has built-in ReCaptcha support. It seems mature enough that it's probably not a security risk, and it's almost perfect for what I want, except the advanced "manual form" option doesn't auto-refill the fields if the submit fails (that is, if the captcha is botched or a required field is omitted).

That's the deal breaker, since the forms I'm doing are pretty large and with a lot of inlined instructions, so I need the customization but would hate for a small error to force a bunch of re-entry.

Suggestions?
 
I'm planning on writing up an article on this, but won't have it ready for a while as I get sidetracked often. I will point to a resource though if you're interested in programming the form your self.

http://phpsec.org/projects/

There's a PDF that tells about techniques that spammers use on these forms and ways to protect the forms and without using CAPTCHA as there tends to be accessibility issues with it. I personally use an equation style CAPTCHA (though technically not a true CAPTCHA by definition), and I don't get any spam at all, though I admit there isn't many attempts either. I don't claim it's perfect as nothing is, but I have created very strong techniques that protect forms. If I get my article written soon I'll post it. There are some good articles out there though if you're up for doing some programming, not sure on your level of knowledge of and comfort with PHP.
 
Roll-your-own is definitely one of the things I've considered, but I'm not strong at all with PHP or Perl scripting. Moreover, even if I could muddle through it, the potential downsides of a security flaw in such a script is large enough to make this an option-of-last-resort to me, particularly given how much of a target this sort of thing has become.

Given the choice between sorting through a 100:1 spam:real form submission ratio and a potential security breach, I can't say the security risk isn't tempting, but better my annoyance than a real problem. This is why I'm looking at a simple captcha to reduce the noise.

I'm no fan of captcha myself, and I'm not married to the visual style at all, but it seems a reasonable means of reducing formspam (or whatever it's called).

Interesting side note on creative captchas: I've seen a few Japanese sites that have "anti-illiterate" captchas for certain features. Basically they're trying to avoid people trawling for images or video downloads who can't actually read the site (which I find annoying, if somewhat understandable), so they have a *very* simple pop-culture question that any Japanese person would be able to answer but would totally befuddle a machine translation, or even a functionally literate foreigner without much pop-culture exposure.
 
Just as a small note concerning using other people's scripts that you find for forms. A lot of them do very little in terms of security, or at least reasonable security. May have to pay for a decent script. Reading that article linked before will at least give you some knowledge of what a script should be checking for so you can make a more informed decision if someone else's script will be doing enough for you.
 
I realize this isn't exactly the criteria you were asking for, but you should check out Mollom. It was started by Dries Buytaert , who also was the original Drupal CMS programmer. Though enitially only available for drupal installations, they have released the api and modules for php, ruby, .net, and a bunch of others are available.

I really like their approach and the concept. See if there's something for you there.
 
In case you hadn't seen my reply, I thought I would bring mollom back to people's attention - I'd be curious to know angelwatt's opinion of it. I've used it on a drupal blog before, but as its a smaller blog, the spam bots don't seem to have found it yet.

This service might be useful to others here, too, so I thought it deserved a bump. Disclosure: I am in no way affiliated with mollom or dries buyteart, just a drupal user.
 
I'd be curious to know angelwatt's opinion of it.

I haven't used Mollom myself, but from the feature list and what I could tell from their site here is my opinion.

  • They host the service on their own servers, which I see as good and bad. Bad in that I don't like leaving that much trust in a company, and also that connecting to another server could increase load times for my pages. It's good though in that as they make updates you don't have to re/install anything.
  • -
  • They only produce a CAPTCHA when it's unclear if the message is spam, which is nice, but the CAPTCHA tends to be one of the stronger catches for spam so, it could let more spam through by only providing CAPTCHA after it thinks there's spam.
  • -
  • They provide audio CPATCHA as well for the hearing impaired, a big plus
  • -
  • I like that they are coming up with a way of creating a quality score for messages. This is something I've been pondering a tinkering with. Would potentially be powerful in detecting spam. I already make use of this in a way when I search for nonsense words using regular expressions. Works very well so far even though my scoring is only boolean for the time being.
  • -
  • Their average efficiency percentage seems impressive, but with something like this a median score would have more meaning as the average could be skewed by having a lot of sites that have very little exposure (very few spam-bots trying to send spam through their site). Not saying they're misleading anyone, but I've taken a whole lot of statistics courses so know how easy it is to show the results that best reflect what you want.
  • -
  • The mentioning of supporting OpenID and using that to keep track of reputations sounds interesting and is something to keep and eye on as it progresses for Mollom and other sites that take advantage of OpenID.
Overall it seems like a nice service. Only reading the feature list I can't determine what type of user experience it provides, which is very important. Though that's still left in large part up to the site owner to create a good form that's usable. Still though, monitoring your statistics for your site and reporting spam that has gotten through needs a good user experience, which I can't judge from the feature list. They have no screen shots.
 
I personally think one of the most effective and simple ways to protect a form is to use invisible form elements as honeypots. It is completely transparent to the user, but a bot will go for it.

For example, say you have a form that collects name, email address, and a comment. You want to avoid spam bots. Instead of adding a captcha, which is annoying to users and potentially readable by bots anyway, you setup your form like this:

Code:
<p>
<label>Name</label><input type="text" name="xyzname" />
</p>

<p>
<label>Email</label><input type="text" name="xyzemail" />
</p>

<p>
<label>Comment</label><textarea name="xyzcomment"></textarea>
</p>

<p class="hideme">
<input type="text" name="name" />
<input type="text" name="email" />
<textarea name="comment"></textarea>
</p>

Your CSS .hideme class will display:none that p tag, so those form elements won't be visible to a human, but they're still on the page. The bots are going to go for those, because they're named "name", "email", and "comment". On the page that processes the submission, you can do a check and see if any of those fields are filled out -- if they are, the submission is a bot. A human will only fill out the first three fields -- which you can name completely random characters if you want, and then reassign the variables to something more sensible to continue your processing after the submission check.

This is potentially defeatable, but no more than a captcha I think and it is less annoying to the users (and simpler to implement). It works well for me in most situations.
 
This is potentially defeatable, but no more than a captcha I think and it is less annoying to the users (and simpler to implement). It works well for me in most situations.

Actually this would be much simpler to defeat in comparison to a CAPTCHA. The CAPTCHA'S are random whereas your hidden field is not. Though not saying it won't stop some. Using multiple techniques is the best strategy.
 
Actually this would be much simpler to defeat in comparison to a CAPTCHA. The CAPTCHA'S are random whereas your hidden field is not. Though not saying it won't stop some. Using multiple techniques is the best strategy.

I think you might be misunderstanding the point (or maybe I'm misunderstanding you). The idea of my strategy is that I *want* the bots to fill out the hidden fields -- and whenever bots see a "name", "email" field, etc, they always fill it out. Since I've named the *actual* fields I want humans to fill out something else, I'll know that if the field called "name" gets filled out it will be a bot -- because they're scraping the page code and can't tell that it's not visible, whereas a human will be looking at the page in a normal browser and CSS rules will hide those honeypot fields.
 
For 100% automated crawler bots, that hidden field method should work well, especially if there was no indication that a submit had failed. I like the concept, but it does have at least one usability issue--if someone is using an older browser without CSS, they're going to see the fields and get bot-ignored. I suppose you could put a hidden comment to ignore the fields in there, as well, which might take care of that.

I do wonder how much formspam gets verified by a human eye at some point, though; a significant percentage of what I get appears to have been tailored at least slightly to what it's being stuffed into. Maybe just smart bots, though--that sounds awfully time-consuming for a group as lazy as spammers. If the technique became at all popular it would fail, too, since the spammers would just have their bots check the CSS and ignore hidden fields.

On a related note, I ended up tweaking that DD script I linked a little so that I could use its generated forms for my needs, and I've gotten no spam thus far (compared with maybe 150 messages/day before). Maybe just not scraped yet, but ReCaptcha seems to be doing the job.
 
I think you might be misunderstanding the point (or maybe I'm misunderstanding you). The idea of my strategy is that I *want* the bots to fill out the hidden fields -- and whenever bots see a "name", "email" field, etc, they always fill it out. Since I've named the *actual* fields I want humans to fill out something else, I'll know that if the field called "name" gets filled out it will be a bot -- because they're scraping the page code and can't tell that it's not visible, whereas a human will be looking at the page in a normal browser and CSS rules will hide those honeypot fields.

I wasn't as clear as I could have been (though I was mainly saying that the technique is more easily defeated than CAPTCHA). Makosuke hit it though by saying that 100% automated bots won't generally be able to defeat the technique you spoke of. I have noticed however that a few of the bots have human components where the spammer has checked out the form in advance to determine if there are CSS-hidden form elements. I've received spam attempts that leave some fields empty (possibly the bot guessing as it was for a field with an uncommon name). Whether or not a spammer will take the time to tweak a bot for your site will generally depend on the popularity of your site. I've noticed spam attempts have been going down for my form. One of my techniques is that once spam is detected I pause the PHP for 10 minutes, which their bot craps out on, which likely has caused my form to be taken off some of the spamming lists :D

As far as the concern Makosuke brought up about legit users using browsers without support for CSS, I read an article at webaim that addresses this in part. Like you suggested, you supply a label for the field, which is also CSS-hidden, that tells these users that it should be left blank (specific text explanation can vary). This is also necessary for helping the visually impaired users using screen readers, which some times will read content that CSS is hiding.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.