crm114

Using CRM114 for spam filtering on Debian GNU/Linux

I've been using CRM114 as spam filter for a while now, and I'm quite happy with it. Due to bug #529720 though (incompatible upstream file format changes) I decided to start my setup from scratch with a recent CRM114 version from unstable. Here's a short HOWTO, hope it's useful for others.

First you need to install crm114 and set up a few files in your $HOME directory.

  $ sudo apt-get install crm114
  $ mkdir ~/.crm114
  $ cd ~/.crm114
  $ cp /usr/share/doc/crm114/examples/mailfilter.cf.gz .
  $ gunzip mailfilter.cf.gz
  $ cp /usr/share/crm114/mailtrainer.crm .
  $ touch rewrites.mfp priolist.mfp

Edit ~/.crm114/mailfilter.cf and set the following variables (some are optional, but that's what I currently use):

  :spw: /mypassword/
  :add_verbose_stats: /no/
  :add_extra_stuff: /no/
  :rewrites_enabled: /no/
  :spam_flag_subject_string: //
  :unsure_flag_subject_string: //
  :log_to_allmail.txt: /no/

The :log_to_allmail.txt: /no/ option should probably stay at "yes" for the first few days until you have tested your setup and everything works OK. The ~/.crm114/allmail.txt file will contain all your mails, in case something goes wrong.

Now set up empty spam and nonspam files like this:

  $ cssutil -b -r spam.css
  $ cssutil -b -r nonspam.css

Test the setup by invoking mailreaver.crm as follows, typing some test text and then pressing CTRL+d:

  $ /usr/share/crm114/mailreaver.crm -u ~/.crm114
  test
  [CTRL-d]
  ** ACCEPT: CRM114 PASS osb unique microgroom Matcher **
  CLASSIFY fails; success probability: 0.5000  pR: 0.0000
  Best match to file #0 (nonspam.css) prob: 0.5000  pR: 0.0000
  Total features in input file: 8
  #0 (nonspam.css): features: 1, hits: 0, prob: 5.00e-01, pR:   0.00
  #1 (spam.css): features: 1, hits: 0, prob: 5.00e-01, pR:   0.00
  X-CRM114-Version: 200904023-BlameSteveJobs ( TRE 0.7.6 (BSD) ) MF-35EB8B9A [pR: 0.0000]
  X-CRM114-CacheID: sfid-20090920_151224_574131_D290E589
  X-CRM114-Status: UNSURE (0.0000) This message is 'unsure'; please train it!

The output should look similar to the above. If there are errors instead, you should check your settings in ~/.crm114/mailfilter.cf.

Now you have to setup a procmail rule for crm114:

  :0fw: crm114.lock
  | /usr/share/crm114/mailreaver.crm -u /home/uwe/.crm114

  :0:
  * ^X-CRM114-Status: SPAM.*
  IN.spam-crm114

In my case this rule is also followed by a spamassassin rule, so all my mail goes through two different spam filters (will look into dspam and bogofilter also I guess, the more the better).

Finally, in .muttrc I have the following configs so I can press SHIFT+x to mark a mail as spam, and SHIFT+h to mark it as non-spam (ham).

macro index X '| formail -I X-CRM114-Status -I X-CRM114-Action -I X-CRM114-Version | /usr/share/crm114/mailreaver.crm -u /home/uwe/.crm114/ --spam'
macro index H '| formail -I X-CRM114-Status -I X-CRM114-Action -I X-CRM114-Version | /usr/share/crm114/mailreaver.crm -u /home/uwe/.crm114/ --good'
macro pager X '| formail -I X-CRM114-Status -I X-CRM114-Action -I X-CRM114-Version | /usr/share/crm114/mailreaver.crm -u /home/uwe/.crm114/ --spam'
macro pager H '| formail -I X-CRM114-Status -I X-CRM114-Action -I X-CRM114-Version | /usr/share/crm114/mailreaver.crm -u /home/uwe/.crm114/ --good'

Important: crm114 is most effective if you start with empty CSS files (as shown above) and only train it by marking mails as spam/ham when it gets them wrong. The process will take a few hours or maybe a day (depending on how many mails per day you get), then the misclassification rate gets very low...

Update 2009-09-23: Changed --spam/--nonspam to the correct options for mailreaver/mailtrainer, --spam/--good.

Syndicate content