How The Spam Filter Works

It surprises me how many people are visiting these pages, and how many want to know how I did it. It's a sordid tale of hacked-together shell scripts, but gather 'round and I'll tell you a chilling tale...

We have a mail server running FreeBSD. On that mail server runs procmail, a program that sorts mail with/using/by/whatever regular expressions. We filter for the obvious sort of spam. By obvious, I mean that body searches, generally, are out; too many and you end up taking fifteen minutes to grep some 10MB attachment that we'll reject anyway 'cos it's just too damned big, and meanwhile everything's backing up behind it, and the customers are crying, and the server's dropping packets, and the sysadmin starts beating me, and...Ahem. My point is that we filter more for Subject: lines than anything else.

Anyhow, I've set up a crontab on the mail server that, every fifteen minutes, counts how many messages are in /foo/spam and emails the results to Selenium here, a little 486 running Debian Linux. Procmail running on S. intercepts the mail, does some math, and outputs the total to a text file. Every 15 minutes MRTG runs, cat's the text file, and graphs the result. S. is running Boa as a web server (no particular reason for choosing Boa; just thought it would be interesting to run something that wasn't Apache for a change, and it seems to be doing well so far).

It's all pretty ugly, I'm the first to admit. The sysadmin here has suggested I set it all up w/SNMP, which is doubtless a very good suggestion. However, I don't know from SNMP, and while I want to learn it this is a way to get the graphing up and going. I take care of the procmail filters here, so I wanted to see how well I'm doing.

Occasionally I'm asked if I'll provide the filters we use for procmail. The answer is sorry, but no. Three reasons:

  1. I don't want spammers knowing what what we're filtering for.
  2. It changes all the time anyway.
  3. It's remarkably easy to catch a lot of spam if you follow my E-Z Bake Method listed here:

    tail -f /foo/log/procmail | vgrep

It really is that simple, if you're working at a place with any volume at all of email; we get something like one per second on average here, and it works quite well. You'll know when something isn't spam, and it's pretty easy to catch a lot of it. There's only so much variation that a spammer can put in their messages without losing that all-important volume, so most of it really is the same sort of thing over and over again.

Thanks for visiting. Any questions, feel free to write me. And now that you found this page, have a look at my own site over here:

saintaardvarkthecarpeted.com