Why Blacklists Are Bad
Spam, whether it be in email or in web site comments, has obviously become a huge problem. In response, one approach used to limit this abuse is the blacklist. A blacklist is a list of IP addresses or domains that have been used to send spam in the past, and email (or comments) coming from these addresses are then rejected flat out.
In theory this sounds great - once you send out spam, you are then banned for life. This likely does cut down on the amount of spam received. But there are several problems with this.
It’s retroactive - With a blacklist system, there is no way to block that “first spam” from a given offender (or even several pieces depending on your reaction time). Until they spam you once, you can’t identify the bad guys. But any human could easily identify the first piece of spam as such.
It can be bypassed - By simply using a different ip address, a blacklist system will have to start over and will miss the first piece (or pieces) of spam from each address.
Collateral Damage - Innocent bystanders are caught up in this. Spammers frequently will use shared ip addresses due to the requirement that they change addresses often, and if your IP has been flagged, now you are not able to send email/comments. This is similar to the concept of the presumption of guilt, leading to many innocent parties being jailed as they could not prove their innocence beyond a reasonable doubt (ok, the analogy is a stretch, but it gets the point across… ;)
To get around some of these limitations (particularly #1), people have joined together to create massive blacklists so that you may benefit from the experience of others. This helps block some spammers before they even bother you individually, but it requires extra work as you constantly update your blacklist, and as the list grows it leads to increased loads on your web/email server.
Others have begun using “captcha”s - small tests that are easy for humans but difficult for computers. These are the “enter the word contained in this picture” sort of tests. This is fairly effective, but now it places the burden on the visitor. A small burden, granted, but a hassle nonetheless.
Instead, I think that the more successful approaches to the spam problem will mimic how humans detect and ignore spam. Bayesian approaches evaluate the content of the message, and if it looks like spam it can be flagged as such and may require human approval before being accepted. This prevents the problem of collateral damage - a non-spam message coming from a host that has previously sent spam can still be identified as non-spam and allowed to pass. Anyone who has used the junk mail filters on Apple’s Mail program, or in Thunderbird has used a Bayesian filter. They aren’t perfect, but after some initial trainig, they are very effective. Similar approaches have been tried on web page comments systems as well.
Bayesian methods aren’t a perfect solution, but I prefer them to blacklists. Once trained, they block the bulk of the spam, without imposing any additional burden on the email sender/ web poster, and minimal (if any) burden on the recipient. I would much prefer to see such systems used more frequently, and get rid of the many blacklist solutions out there…
Another approach is that used by this wiki (for the time being, anyways). A spammer can easily add spam to this site, but any visitor can easily remove it. In fact, I believe there have been 2 episodes of spam being added to this site in the 3 months or so that the wiki has been active. I do expect that this will increase as more people become familiar with OddMuse, but as traffic to my wiki increases, the number of people who might be willing to delete that spam will increase as well.
Additionally, I have installed the [[Despam Extension]], which checks submitted content for obvious spam and prevents it from being accepted. This is sort of a content blacklist, but that is preferable (IMHO) to an IP based blacklist. As the blacklisted content consists primarily of URL’s, I don’t anticipate a problem.