Bogo Filter

08/02/2005 19:53:05

ComputerStuff

Bogofilter - Another Bayesian Spam Program

Previously, I had tried using POPFile as a spam filter. This program acts as a proxy server for your email accounts. Whenever you check email, POPFile scans each email and flags it as spam or not spam. Actually, you can train it to classify email into an arbitrary number of categories (at one point I had something like 25–30 categories).

POPFile works very well, but is somewhat awkward to use, and is slow if you try and run it as a commandline tool. Eventually, I decided that archiving my old email in categories was a bad idea, and now all my email is archived into simply incoming and outgoing. I dumped POPFile and began using the built in spam features of Mail.app and Thunderbird.

But, I really wanted to get a good spam filter that could be called by procmail. I wanted to be able to classify spam at that level, before a copy of all incoming mail is archived. It is possible to call POPFile from procmail, but it is slow, and less than ideal.

Then I found Bogofilter. Bogofilter is written for speed, according to its authors, using C and Berkeley DB. It is also designed for use from the commandline, which makes it more flexible. It can only classify email as spam or non-spam (and, optionally, unsure). But as I am no longer using the plethora of categories I used to have, this is fine.

Now, procmail calls bogofilter upon receiving email. Spam is routed to a separate folder for my review, and non-spam is routed to my inbox and a copy is made in an archive folder:

# Pass through bogofilter
:0 f
| /usr/local/bin/bogofilter -p
 
# File Junk mail without archiving a copy
:0:
* ^X-Bogosity:.*Yes
Junk

After reviewing my junk folder, I put confirmed spam in a separate folder. A cron job reviews this folder periodically, and any messages left there are sent to spamcop for reporting. Additionally, if a piece of spam gets past bogofilter, it is automatically added to my bogofilter database when processed for spamcop.

This way, I am performing a “train on error” - bogofilter only updates its database when it makes a mistake. Sort of a “leave well enough alone” philosophy. From what little I have read about bayesian programs, they typically work better when you train on errors, rather than when you train on every piece of data processed. But I am not an expert.

The procmail recipe for this is:

# Find mail I have marked as junk, but was missed by bogofilter
 
:0c
* ^X-Bogosity:.*No
{
    # Remove previous markings
    :0f
    | formail -I X-Bogosity
 
    # Then update the spam corpus with this message
    :0
    | /usr/local/bin/bogofilter -s
}

The only thing not covered so far is mail that is improperly flagged as spam. I simply created another folder, and placed a copy of mail there that needed to be reflagged as ham. Thus far, that has happened twice (I think). Not too bad.

Bogofilter does seem to be faster than POPFile, and I the ability to run it easily from the command-line makes it much more flexible. I also plan on looking into other uses for the engine (document classification, etc) in the future - there are a good deal of uses for Bayesian analysis, including various medical decision making tools…

Notes on Installation

I use fink (for OS X 10.3), and there is no package for bogofilter for this version of fink. I could not install it manually, until I installed a new copy of Berkeley DB by hand (i.e. NOT using fink) Once I did that, no problems!!

Similar Pages