Anti-Bayesian spam

If you get spam, you’ll probably have noticed the recent trend towards spam full of random words. This is intended to defeat Bayesian anti-spam filters which mark emails as spam if they have words in them which are found in other spam. And it does generally seem to get through SpamAssassin which is the filter I’m using on my mail server.

The reason it gets by is that the random words e.g. “storm antiquated biaxial genevieve askew evensong compressor foothill ludwig eyeglass irwin delano narcissist calumny messrs dan begin oratorical depict platitude“, are not in themselves spam-like. However, real non-spam email never has strings of words like that. Of the anti-Bayesian spam I’ve seen, none actually looks like real mail; there’s no grammar and too many consectutive long words (or maybe the people who send me email have small vocabularies?). There are too few occurances of “the”, “it”, “and” etc.

The next step towards defeating this spam could be to perform basic lexical analysis of the content, to see whether it looks like real text. There are a number of problems with this though:

  • Non-English languages may be much harder to handle.
  • The spammers may start including random sections of real text
  • It’s yet another load on the spam filters

The blacklists have taken a battering over the past few months, which various viruses being targetted at bringing them down, but they’re still one of the best weapons we have to stop spam.

Most of my spam arrives through Demon as my mail server blocks blacklisted hosts. I use fetchmail to pull my Demon mail onto my server, where it’s passed through SpamAssassin . SpamAssassin marks the stuff that it thinks is spam, which is then dumped into an IMAP folder for later checking. With Demon finally annoucing that they’re putting in some spam filtering, the level of spam I get should drop off even more.

Update: Just after posting this I got my first piece of spam marked with the Habeas mark which is a short verse of trademark poetry used to indicate that a sender is trustworthy. I’ve reported the spam to Habeas; the idea being that illegal use of their trademark means that they can sue the spammer. We’ll see…