Posted on November 15, 2015

I have spent the last two months working on bfilter, a standalone naive Bayesian filter program, written in C by the late Chris Lightfoot (aka Oggie).

I was drawn to bfilter because it is a tiny program that seemed to perform well. Oggie’s last (unpublished) version was 766 lines. (I’m afraid I’ve managed to push that up to 991 already, although I think some of that is redundant code that I haven’t yet removed.) Along the way, I found out a great deal about how Bayesian filters work. I refactored and rewrote vast swathes of bfilter, changing it into a generic multinomial naive Bayes classifier. This means that it can now sort documents into an arbitrary number of classes, instead of being limited to spam and real.

My intention now is to plug bfilter into flare, and see if it can be trained to sort my incoming mail (important classes will be spam, haskell, fedora, and everything-else).

Another thing I really ought to do, while it’s reasonably fresh in my mind, is write a critique of Paul Graham’s A Plan for Spam and its followups.

Oggie had followed the “plan” described by Graham very closely, and bfilter appeared to do its job reasonably well. Unfortunately, I discovered several surprises as I tweaked the program, and ended up concluding that some of Graham’s suggestions were insufficient or unhelpful. This all ought to be documented.

I’d also like to examine, 13 years on, Graham’s central thesis. He claimed that Bayesian filtering would protect most users from ever seeing spam - this seems largely to have come to pass. (I’m sure that the commodity mail providers, Google, Yahoo, etc use a variety of techniques to identify spam with near 100% success, but I’m also sure that Bayesians are one of the mainstays.) So the plan for spam has, in that sense, been a resounding success. As far as I can tell, the vast majority of mail users now see spam incredibly rarely.

However Graham believed that once this happy state of affairs had come to pass, it would no longer be worthwhile for the spammers to continue sending their messages. And this very clearly has not happened: it is still the case that a majority of email messages transmitted are spam. I’m not entirely clear why this should be so, but I have some ideas. Since the MPv6 project is, among other things, a new plan for spam, I should endeavour to examine this.

Anyway, whether or not I ever get round to doing that writing, the first thing is to actually get bfilter up and running on a real mailbox: mine.