SpamAssassin Setup

Introduction

This document describes how I use SpamAssassin to keep spam out of my inbox. SpamAssassin is a program for tagging email. It reads in an email message, classifies the email as either legitimate (ham) or illegitimate (spam), and writes the email out again with specific mail headers added. The added headers tell you whether SpamAssassin thinks the email is ham or spam. SpamAssassin does what it does using predefined rules that are associated with scores, either positive or negative. All rules that match a particular email contribute to that email’s total score. If the total score reaches a configurable value, then the email is identified as spam. In addition to its rules, SpamAssassin also includes a Bayesian filter. The Bayesian filter provides a convenient way for mail administrators to tailer SpamAssassin to the specific email processed by their systems. The score assigned to an email can also be influenced by SpamAssassin’s Auto-Whitelist feature. The Auto-Whitelist keeps track of the average score for email on a per sender basis, and pushes the score of subsequent email towards the recorded average of the sender.

I’ve managed to make use of many of SpamAssassin’s features. Herein you will find descriptions of:

  • Baysian filter
  • Auto-Whitelist
  • Razor-agents (Vipul’s Razor)
  • MySQL storage of Bayesian and Auto-Whitelist data
  • Use of sa-learn to train Baysian filter
  • Use of sa-update to retrieve SpamAssassin rulesets
  • Use of gpg keys with sa-update

Because SpamAssassin is never used in isolation, included are descriptions of how I’ve integrated SpamAssassin with my MTA, netqmail, and my MDA, maildrop.

Preliminaries

SpamAssassin does no routing of email based on its classification, all it does is the classifying. For this reason, SpamAssassin is always used with other programs that take advantage of the email headers added by SpamAssassin.

In the Arda Network, SpamAssassin is installed on Callisto, my mail server. It is integrated with netqmail, my MTA (Mail Transfer Agent), and maildrop, my MDA (Mail Delivery Agent). I also use SquirrelMail for webmail access and I use one of its many plugins to help train SpamAssassin’s Bayesian filter. You will find an overview, including a very nice diagram, of the Arda Network here.

SpamAssassin installations can be divided into two broad categories, per account and site wide. A per account install, as one would expect, invokes SpamAssassin for specific mail accounts and happens at mail delivery time. A site wide install applies to all, or almost all, mail accounts in an email system and is invoked when email arrives at a mail server. SpamAssassin in the Arda Network is installed site wide, it is invoked for all email arriving at Callisto from outside my network.

In addition to the two types of installation, SpamAssassin can be invoked two different ways. The first way is to run the perl script spamassassin. The spamassassin script contains all the functionality to classify an email passed to it. spamassassin is intended to be run whenever an email needs to be classified as ham or spam. The second way is to use spamc and spamd. spamd is also a perl script but runs as a long lived daemon process. spamc is a program written in C. It provides an interface between spamd, which does the classifying and tagging of emails, and other programs. The intention here is to avoid having to invoke the perl interpreter for each email to be examined by SpamAssassin. spamd can be left to run in the background continuously while only the small and fast spamc need be invoked when an email is to be examined.

Both methods of invoking SpamAssassin have their place. Using spamassassin, however, is really only an option for per account installs. For site wide installs in mail systems that handle anything above a minimal amount of email, you want to use spamd/spamc. And that’s what I use in the Arda Network.

Here is a list of the various software packages that I describe in this document.

  • daemontools 0.76
  • maildrop 2.0.1
  • MySQL 5.0.24
  • netqmail 1.05
  • Razor-agents 2.82
  • simscan 1.1
  • SpamAssassin 3.1.5

SpamAssassin Setup

Here is what my site configuration directory (–siteconfigpath option of spamd) looks like.

/usr/local/etc/mail/spamassassin

-rwxr--r--  1 root  wheel   373 Oct  8 23:25 getruleupdate.sh
-rw-r--r--  1 root  wheel  4568 Oct  7 08:43 imageinfo.cf
-rw-r--r--  1 root  wheel  1045 Sep 17 10:42 init.pre
-rw-r-----  1 root  wheel  2733 Sep 17 10:49 local.cf
drwx------  2 root  wheel   512 Oct  8 23:18 sa-update-keys
-rw-------  1 root  wheel   2522 Oct  8 22:03 sa-update-keys/sare.key
-r--------  1 root  wheel    112 Oct  8 22:19 sa-update-keys/update-channels.txt
-rw-r--r--  1 root  wheel  2180 Sep 17 10:44 v310.pre
-rw-r--r--  1 root  wheel   806 Sep 17 10:44 v312.pre

A stock SpamAssassin install will include only the three *.pre files and the local.cf file in this directory. I added the other files and the sa-update-keys directory myself.

The getruleupdate.sh script and sa-update-keys directory are used by sa-update and are explained in the Updating SpamAssassin Rules section. The imageinfo.cf file is used by the ImageInfo plugin. The Perl module associated with this plugin, ImageInfo.pm, went into SpamAssassin’s plugin directory. On my system, that’s here:

    /usr/local/lib/perl5/site_perl/5.8.8/Mail/SpamAssassin/Plugin

I use the ImageInfo plugin to trap stock spam composed mostly of gif, png, or jpeg images. You can find a link to the ImageInfo plugin in the Further Reading section below.

Here is the local.cf file.

# This is the right place to customize your installation of SpamAssassin.
#
# See 'perldoc Mail::SpamAssassin::Conf' for details of what can be
# tweaked.
#
# Only a small subset of options are listed below
#
###########################################################################

#   Add *****SPAM***** to the Subject header of spam e-mails
#
# rewrite_header Subject *****SPAM*****

#   Save spam messages as a message/rfc822 MIME attachment instead of
#   modifying the original message (0: off, 2: use text/plain instead)
#
# report_safe 1

#   Set which networks or hosts are considered 'trusted' by your mail
#   server (i.e. not spammers)
#
trusted_networks 10.42.0.1 192.168.42.

#   Set which networks are considered 'internal' by your mail server
#
internal_networks 192.168.42.

#   Set file-locking method (flock is not safe over NFS, but is faster)
#
# lock_method flock

#   Set the threshold at which a message is considered spam (default: 5.0)
#
# required_score 5.0

#   Skip RBL checks (default: 0)
#
skip_rbl_checks 1

#   When mail is reported as spam, this is the contact listed
#   in the report
#
report_contact postmaster@arda.homeunix.net

#   Use Bayesian classifier (default: 1)
#
# use_bayes 1

#   Bayesian classifier auto-learning (default: 1)
#
# bayes_auto_learn 1

#   Set custom scores for the Bayesian filter
#
score BAYES_95 0.0001 0.0001 4.0 4.0
score BAYES_99 0.0001 0.0001 4.5 4.5

#   Set headers which may provide inappropriate cues to the Bayesian
#   classifier
#
# bayes_ignore_header X-Bogosity
bayes_ignore_header X-Spam-Flag
bayes_ignore_header X-Spam-Status
bayes_ignore_header X-Spam-Level
bayes_ignore_header X-Spam-Checker-Version

# Bayesian database configuration options.
bayes_store_module                 Mail::SpamAssassin::BayesStore::MySQL
bayes_sql_dsn                      DBI:mysql:sabayesfilter:localhost:3306
bayes_sql_username                 sa
bayes_sql_password                 <password>

# Auto-Whitelist database configuration options.
auto_whitelist_factory Mail::SpamAssassin::SQLBasedAddrList
user_awl_dsn                 DBI:mysql:saawl:localhost:3306
user_awl_sql_username        sa
user_awl_sql_password        <password>
user_awl_sql_table           awl

This is the file where you put configuration options. You can also add custom rules and override standard rules here. If you have a lot of custom rules or do lots of overrides, I’d suggest putting them into their own file(s) to avoid clutter.

I’ve modified a number of the options in this file and added a few as well.

trusted_networks This option tells SpamAssassin that mail relays and MXs on these networks won’t originate spam. The practical upshot is that DNS blacklist checks won’t be performed for servers on listed networks.
skip_rbl_checks This option tells SpamAssassin whether or not to perform checks against DNS based Realtime Block Lists. I do this with netqmail and rblsmtpd so I tell SpamAssassin to skip these checks.
report_contact The report contact appears in the report generated by SpamAssassin when it determines that an email is spam and report_safe is set to 1 or 2. Spamassassin uses some generic text if you don’t specify a report contact. In versions before 3.1.4, a report contact was generated automatically by SpamAssassin but this caused a problem if you were using sa-update.
score I’ve overriden two rules related to the Bayesian filter. What I’ve done is increase the score for these two rules. I have yet to see these two rules hit on an email I did not consider spam so I felt justified in increasing their scores. You’ll notice that the new scores aren’t quite enough to tag an email as spam all by themselves. I’m not ready to let the Bayesian filter go it alone yet.
bayes_ignore_header These options tell the Bayesian filter to ignore the listed headers when learning what makes ham and spam. I’ve simply listed the headers that SpamAssassin itself adds to emails it scans.
SQL database options All the configuration lines after the bayes_ignore_header options deal with telling SpamAssassin to use an SQL database to store Bayesian filter and Auto-Whitelist information. If these options were not present, this information would be stored in dbm database files on Callisto. The volume of mail my system deals with doesn’t really justify running an SQL database but I decided to set one up anyway just to see how well it worked. Conclusion; it works very well. See the SQL Setup section for more details.

Here is the init.pre file.

# This is the right place to customize your installation of SpamAssassin.
#
# See 'perldoc Mail::SpamAssassin::Conf' for details of what can be
# tweaked.
#
# This file contains plugin activation commands for plugins included
# in SpamAssassin 3.0.x releases.  It will not be installed if you
# already have a file in place called "init.pre".
#
###########################################################################

# RelayCountry - add metadata for Bayes learning, marking the countries
# a message was relayed through
#
# Note: This requires the IP::Country::Fast Perl module
#
loadplugin Mail::SpamAssassin::Plugin::RelayCountry

# URIDNSBL - look up URLs found in the message against several DNS
# blocklists.
#
loadplugin Mail::SpamAssassin::Plugin::URIDNSBL

# Hashcash - perform hashcash verification.
#
#loadplugin Mail::SpamAssassin::Plugin::Hashcash

# SPF - perform SPF verification.
#
#loadplugin Mail::SpamAssassin::Plugin::SPF

# ImageInfo - designed to catch image spam
#
loadplugin Mail::SpamAssassin::Plugin::ImageInfo

This file lists a few of the plugins available to SpamAssassin. I turned on RelayCountry and turned off Hashcash and SPF. I found that I just wasn’t getting enough hits from Hashcash or SPF to justify having them on. I added a line in this file to activate the ImageInfo plugin.

And here is the v310.pre file.

# This is the right place to customize your installation of SpamAssassin.
#
# See 'perldoc Mail::SpamAssassin::Conf' for details of what can be
# tweaked.
#
# This file was installed during the installation of SpamAssassin 3.1.0,
# and contains plugin loading commands for the new plugins added in that
# release.  It will not be overwritten during future SpamAssassin installs,
# so you can modify it to enable some disabled-by-default plugins below,
# if you so wish.
#
###########################################################################

# DCC - perform DCC message checks.
#
# DCC is disabled here because it is not open source.  See the DCC
# license for more details.
#
#loadplugin Mail::SpamAssassin::Plugin::DCC

# Pyzor - perform Pyzor message checks.
#
#loadplugin Mail::SpamAssassin::Plugin::Pyzor

# Razor2 - perform Razor2 message checks.
#
loadplugin Mail::SpamAssassin::Plugin::Razor2

# SpamCop - perform SpamCop message reporting
#
#loadplugin Mail::SpamAssassin::Plugin::SpamCop

# AntiVirus - some simple anti-virus checks, this is not a replacement
# for an anti-virus filter like Clam AntiVirus
#
#loadplugin Mail::SpamAssassin::Plugin::AntiVirus

# AWL - do auto-whitelist checks
#
loadplugin Mail::SpamAssassin::Plugin::AWL

# AutoLearnThreshold - threshold-based discriminator for Bayes auto-learning
#
loadplugin Mail::SpamAssassin::Plugin::AutoLearnThreshold

# TextCat - language guesser
#
#loadplugin Mail::SpamAssassin::Plugin::TextCat

# AccessDB - lookup from-addresses in access database
#
#loadplugin Mail::SpamAssassin::Plugin::AccessDB

# WhitelistSubject - Whitelist/Blacklist certain subject regular expressions
#
loadplugin Mail::SpamAssassin::Plugin::WhiteListSubject

###########################################################################
# experimental plugins

# DomainKeys - perform DomainKeys verification
#
# External modules required for use, see INSTALL for more information.
#
#loadplugin Mail::SpamAssassin::Plugin::DomainKeys

# MIMEHeader - apply regexp rules against MIME headers in the message
#
loadplugin Mail::SpamAssassin::Plugin::MIMEHeader

# ReplaceTags
#
loadplugin Mail::SpamAssassin::Plugin::ReplaceTags

Like init.pre, this file lists plugins you can turn on or off. This is where I turn on the Auto-Whitelist plugin among others.

And, finally, here is the v312.pre file.

# This is the right place to customize your installation of SpamAssassin.
#
# See 'perldoc Mail::SpamAssassin::Conf' for details of what can be
# tweaked.
#
# This file was installed during the installation of SpamAssassin 3.1.2,
# and contains plugin loading commands for the new plugins added in that
# release.  It will not be overwritten during future SpamAssassin installs,
# so you can modify it to enable some disabled-by-default plugins below,
# if you so wish.
#
###########################################################################

###########################################################################
# experimental plugins

# DKIM - perform DKIM verification
#
# Mail::DKIM module required for use, see INSTALL for more information.
#
#loadplugin Mail::SpamAssassin::Plugin::DKIM

This is yet another plugin file. I’m in the dark about why SpamAssassin needs a different plugin file for each release. I wonder how many files we can expect to see before they begin to be consolidated.

Here are two examples of the headers added by SpamAssassin to email arriving at Callisto.

Classification Headers
ham
X-Spam-Checker-Version: SpamAssassin 3.1.5 (2006-08-29) on 
	lorien.arda.homeunix.net
X-Spam-Level: 
X-Spam-Status: No, score=-2.5 required=5.0 tests=AWL,BAYES_00,
	DK_POLICY_SIGNSOME autolearn=ham version=3.1.5
spam
X-Spam-Flag: YES
X-Spam-Checker-Version: SpamAssassin 3.1.5 (2006-08-29) on 
	lorien.arda.homeunix.net
X-Spam-Level: ***********
X-Spam-Status: Yes, score=11.5 required=5.0 tests=BAYES_99,DK_POLICY_SIGNSOME,
	FROM_LOCAL_NOVOWEL,HTML_MESSAGE,MIME_HTML_ONLY,SARE_GIF_ATTACH,
	TVD_FW_GRAPHIC_ID2,TVD_FW_GRAPHIC_NAME_MID autolearn=no version=3.1.5

Remember that all these headers are listed in bayes_ignore_header options in local.cf. When I feed sa-learn a false-negative, I don’t want it thinking that ‘X-Spam_Status: No’ means the email is spam.

Vipul’s Razor

One of the SpamAssassin plugins I use is Razor2. Razor2 provides an interface to the Razor distributed spam filtering network. Because I built SpamAssassin using the FreeBSD port, I didn’t need to install Razor separately. I simply told the port that I wanted Razor support. While Razor works without any additional configuration, doing a bit of extra work makes Razor operate much more efficiently.

Razor is structured as a client-server application. A Razor client calculates a digest from a mail message and then contacts a publicly accessible Razor server to see if that digest is in the server’s database. The list of servers to use, how often to update this list, and other parameters are kept in configuration files. The FreeBSD SpamAssassin port, however, doesn’t create these configuration files. This means that the Razor client has to retrieve the list of Razor servers every time it processes a mail message. This can generate a lot of needless network traffic on a busy server.

To create a default set of configuration files for Razor, I used this command.

razor-admin -d -create -home=/var/qmail/simscan/.razor

Notice that I specified the directory in which to put the configuration files. This is the home directory of the user ‘simscan’ on my system. You’ll read more about simscan in the Tagging Email section but the relevant bit of information here is that Razor always runs as simscan so it is in simscan’s home directory that it will look for its configuration files. Here is a listing of that directory after running the above command.

/var/qmail/simscan/

drwxr-x---  2 simscan  simscan  512 Nov 10 05:59 .razor
-rw-r--r--  1 root     simscan  698 Nov  9 23:16 .razor/razor-agent.conf
-rw-r--r--  1 simscan  simscan  566 Nov  8 07:10 .razor/server.folly.cloudmark.com.conf
-rw-r--r--  1 simscan  simscan   95 Nov 10 05:59 .razor/servers.catalogue.lst
-rw-r--r--  1 simscan  simscan   22 Nov  8 07:10 .razor/servers.discovery.lst
-rw-r--r--  1 simscan  simscan   38 Nov 10 05:59 .razor/servers.nomination.lst

I needed to set the ownership of the files after they were created.

Here is what the razor-agent.conf file looks like.

#
# Razor2 config file
#
# Autogenerated by Razor-Agents v2.82
# Wed Nov  8 07:10:40 2006
# Created with all default values
#
# see razor-agent.conf(5) man page
#

debuglevel             = 0
identity               = identity
ignorelist             = 0
listfile_catalogue     = servers.catalogue.lst
listfile_discovery     = servers.discovery.lst
listfile_nomination    = servers.nomination.lst
logfile                = razor-agent.log
logic_method           = 4
min_cf                 = ac
razordiscovery         = discovery.spamnet.com
rediscovery_wait       = 172800
report_headers         = 1
turn_off_discovery     = 0
use_engines            = 4,8
whitelist              = razor-whitelist

The only change I needed to make in this file was to set the debuglevel to 0. Razor doesn’t do any sort of log rotation so unless you take specific measures to prevent it, Razor’s log file will eventually eat up all your disk space. I simply left the default value of 3 in place for a time until I was sure Razor was working properly and then changed it to 0. A value of zero means that no log messages will be generated at all.

With a proper set of configuration files in place, Razor is now able to look up its list of available servers locally each time it is run thus reducing the time needed to process email.

Tagging EMail – Integration with netqmail and simscan

As outlined in the Preliminaries section, I use SpamAssassin in a site wide configuration. To accomplish this, I’ve integrated SpamAssassin with my MTA, netqmail. When netqmail receives an incoming email, it invokes spamc which passes the email to spamd for tagging. The tagged email is then returned to netqmail for local delivery. I’ve specified local delivery on purpose because in my setup, outgoing email is not scanned by SpamAssassin.

There are different ways to invoke spamc from netqmail. One way would be to use .qmail files that call spamc but that means emails would not be scanned until delivery time. I wanted something that would scan emails earlier in the processing cycle. After looking at a few options, I chose simscan.

simscan is small, easy to configure and install, and surprisingly feature rich. One unusual characteristic is that simscan sets which features to make available at compile time and not through configuration files. I was worried that this would cause me trouble if I needed to change simscan’s behaviour after installation but as it’s turned out, I’ve had no problems with it and I’m very happy with my choice.

simscan is invoked by netqmail before netqmail’s standard qmail-queue program. This means that email is scanned prior to being queued. If I wanted to drop email tagged as spam, this would happen before the mail was queued thus saving time and resources on my server. It also means that I could return an error code to the connecting SMTP client rather than sending a bounce email which is what would happen if I rejected the email at delivery time.

simscan will work with netqmail out of the box. If you’re using vanilla qmail, you’ll need to patch it with Bruce Guenter’s QMAILQUEUE patch. You control when netqmail calls simscan in your tcprules file. Here’s mine.

:deny
10.10.0.1:allow,QMAILQUEUE="/var/qmail/bin/simscan"
192.168.10.:allow
127.0.0.1:allow

The first line will refuse all connections unless a more specific rule overrides it. The third and forth lines indicate that connections from my local network and from localhost are accepted unconditionally. The second line is the relevant one from the standpoint of SpamAssassin. The IP shown is the virtual IP used by Thebe, my internet gateway, when establishing its VPN connection with the rest of my network. All email coming from outside my network (and from Thebe itself) will arrive at Callisto from 10.10.0.1 and will be scanned by SpamAssassin. My current setup won’t scan email generated by users on my network. If I wanted that, it’s as easy as adding the QMAILQUEUE environment variable to the third line.

simscan provides a useful summary of configured settings at compile time. Here is what simscan’s summary looks like for Callisto.

Current settings
---------------------------------------
 user                  = nobody
 qmail directory       = /var/qmail
 work directory        = /var/qmail/simscan
 control directory     = /var/qmail/control
 qmail queue program   = /var/qmail/bin/qmail-queue
 clamav scan           = OFF
 trophie scanning      = OFF
 attachement scan      = OFF
 ripmime program       = OFF
 custom smtp reject    = OFF
 drop message          = OFF
 regex scanner         = OFF
 quarantine processing = OFF
 domain based checking = OFF
 add received header   = OFF
 spam scanning         = ON
 spamc program         = /usr/local/bin/spamc
 spamc arguments       =
 spamc user            = OFF
 spam passthru         = ON

The first setting is misleading. It implies that simscan runs as user nobody but it doesn’t; it runs as user simscan. This has to do with the fact that I installed simscan from the FreeBSD port. Installing simscan from a tarball directly does not have this problem.

The rest of the configuration settings show that all I’m using simscan for is to scan incoming mail for spam using spamc. simscan has options that allow you to drop email if SpamAssassin determines it to be spam and you can even tell simscan to drop only email above a specified score. I’ve told simscan to not drop any email regardless of score as indicated by the spam passthru option. I deal with spam at delivery time with the aid of maildrop.

Delivering EMail – Integration with maildrop

As explained in the previous section, simscan doesn’t drop any email regardless of the score assigned to it by SpamAssassin. Instead, at delivery time, I look at the mail headers and if SpamAssassin has determined that an email is spam, I put it in a special folder in the recipient’s mail account called ‘Spam’. That way, the recipient can view the email or ignore it according to his or her wishes.

To accomplish this, I use maildrop. Procmail is another popular choice that I could have used here. I chose maildrop because I found it’s filter language easier to understand.

Every mail account on Callisto has a .mailfilter file in it’s home directory. The filter file looks for the relevant SpamAssassin header and if it’s found, delivers the email to the Spam folder. Otherwise, the email is delivered to the recipient’s inbox. Here is an example .mailfilter file.

HOME=`pwd`

#logfile "/home/vmail/maildrop.log"

# If Spamassassin says the mail is spam, put it in the Spam folder.
##
if ( /^X-Spam-Status: *Yes/)
{
        `test -d ./Maildir/.Spam`
        if( $RETURNCODE == 1 )
        {
                `/usr/local/bin/maildirmake -f "Spam" ./Maildir`
                `/usr/local/sbin/subscribeIMAP "Spam" "$HOME"`
        }

        to "Maildir/.Spam"
}

to Maildir

Here is the .qmail file from the same mail account. This is what tells netqmail to use maildrop for mail delivery.

|preline /usr/local/bin/maildrop .mailfilter

You’ll notice that the .mailfilter file checks to make sure the Spam folder exists before trying to deliver mail to it. If the folder doesn’t exist, it is created and the folder added to the list of subscribed folders. You should know that I use Courier IMAP as my IMAP server. That’s important because the subscribeIMAP script called from the .mailfilter file only works for Courier IMAP. Here it is.

#!/bin/sh
#
# $Id: subscribeIMAP.sh,v 1.2 2004/02/18 15:54:44 matt Exp $
#
# This subscribes the folder passed as $1 to courier imap
# so that IMAP clients (including some webmail programs like
# Mailman and Squirrelmail) will recognize the extra folder.
#
# Matt Simerson - 12 June 2003

LIST="$2/Maildir/courierimapsubscribed"

if [ -f "$LIST" ]; then
        # if the file exists, check it for the new folder
        TEST=`cat "$LIST" | grep "INBOX.$1"`

        # if it is not there, add it
        if [ -z "$TEST" ]; then
                echo "INBOX.$1" >> $LIST
        fi
else
        # the file does not exist so we define the full list
        # and then create the file.
        FULL="INBOX\nINBOX.Sent\nINBOX.Templates\nINBOX.Trash\nINBOX.Drafts\nINBOX.$1"

        echo -e $FULL > $LIST
        /usr/sbin/chown vmail:vmail $LIST
        /bin/chmod 644 $LIST
fi

Training the Bayesian Filter

Spam can be highly variable through space and time. The spam you see hitting your domain may be quite different from the spam I see. SpamAssassin’s Bayesian filter is designed to let mail administrators train their SpamAssassin installs to catch the particular spam their sites’ encounter. I’ve found using the Bayesian filter a good way to increase SpamAssassin’s hit rate on spam without increasing the false-positive rate.

I should make it clear now that you can train the Bayesian filter to identify ham as well as spam. So if SpamAssassin produces a false-positive, you can train the filter to identify similar mail as ham the next time it is encountered.

You can train the Bayesian filter in two ways. The first is to have it auto-learn from email already identified by SpamAssassin as ham or spam. The second way is to run sa-learn on one or more emails, telling sa-learn whether the emails are ham or spam. I use both methods in my setup.

SpamAssassin’s Bayesian filter auto-learns by default so I didn’t have to do anything to configure it.

Training the Bayesian filter with sa-learn is more involved. The biggest headache I’ve encountered is managing the emails I want to use for training the filter. After much consideration, I decided to forward all training emails to two specific folders in my domain’s abuse email account. Lucky for me, I use SquirrelMail for webmail access on my system and SquirrelMail has a very useful plugin that allows me to easily forward emails to specific accounts with a click of the mouse. For those who are interested, the plugin is called ‘Spam Buttons’.

Because all mail users on domains I host will be using the same SquirrelMail plugin to send email to the abuse account, I can be confident in the format of the incoming mail. Here are the relevant portions of the maildrop .mailfilter file I use to process email destined for the abuse account.

import RECIPIENT

# accept messages to abuse at any domain I host. All such messages
# go to the same abuse account
if ("$RECIPIENT" =~ /abuse@/)
{
  # Put spam in the Spam folder and ham in the Ham folder
  # of the abuse account.
  if ( /^Subject: \[SPAM: /)
  {
    # Pull the attachment out of the message.
    exception {
      xfilter '/usr/local/bin/reformime -s 1.2 -e'
    }

    to "abuse/Maildir/.Spam"
  }
  if ( /^Subject: \[HAM: /)
  {
    # Pull the attachment out of the message.
    exception {
      xfilter '/usr/local/bin/reformime -s 1.2 -e'
    }

    to "abuse/Maildir/.Ham"
  }

  to "abuse/Maildir"
}

The SquirrelMail plugin forwards emails as attachments and marks up the subject of the email with the word ‘SPAM’ or ‘HAM’ depending on what the user says it is. The reformime program I use to unpack the attachment is part of the maildrop package.

Interestingly enough, if I used Mozilla’s ‘Forward As Attachment’ option and replace ‘FWD’ in the subject with either ‘SPAM’ or ‘HAM’, the email will be processed correctly by the above .mailfilter file. Very convenient.

Once emails are safely in the appropriate mail folders, I use three files for training the Bayesian filter. One is the actual script that calls sa-learn while the other two contain the directories where the target emails are. Here is a listing of the three files.

-rw-r--r--  1 vmail  vmail   71 Apr 21 19:23 bayes-ham-folders
-rw-r--r--  1 vmail  vmail   73 Apr 21 19:22 bayes-spam-folders
-rwxr--r--  1 vmail  vmail  379 Apr 16 00:58 bayes-teach

And here are the contents of the files.

File Contents
bayes-ham-folders
/home/vmail/abuse/Maildir/.Ham/cur
/home/vmail/abuse/Maildir/.Ham/new
bayes-spam-folders
/home/vmail/abuse/Maildir/.Spam/cur
/home/vmail/abuse/Maildir/.Spam/new
bayes-teach
#!/bin/sh

/usr/local/bin/sa-learn --spam --username simscan \
--siteconfigpath=/usr/local/etc/mail/spamassassin \
--folders=/home/vmail/bayes-spam-folders

/usr/local/bin/sa-learn --ham --username simscan \
--siteconfigpath=/usr/local/etc/mail/spamassassin \
--folders=/home/vmail/bayes-ham-folders

Notice that the –username option in bayes-teach is set to simscan. Tokens saved in the Bayesian filter database are all associated with a username. This allows individualized Bayesian filters to be maintained per email account. Because I have SpamAssassin set up site-wide, all tokens should appear under the same username. Because simscan runs as user simscan, and because spamc is called from simscan, I want all tokens saved in my Bayesian filter database associated with the user simscan. Setting –username in bayes-teach ensures that this happens. Because spamc runs as the user simscan, this happens automatically for auto-learned Bayes tokens. Emails saved in my Auto-Whitelist database are also associated with the username simscan for the same reason.

Keep in mind that any tokens associated with a username other than simscan won’t be applied to incoming emails passed to spamd by simscan.

I use this cron job to run bayes-teach once a day.

# teach Bayesian filter on spam and ham once a day
43 0 * * *      root    /home/vmail/bayes-teach

Updating SpamAssassin Rules

Using sa-update to periodically update SpamAssassin rules is a fairly new feature and it isn’t as polished as the rest of SpamAssassin. Having said that, I think updating rules more frequently than when I do release upgrades is a valuable capability to have. It also hasn’t caused me any problems. For these reasons, I include a description of my setup here in case you want to have a go at it yourself.

I’m pulling updates from the default channel, updates.spamassassin.org, and two SARE channels. I’m also using gpg to verify the source and integrity of the downloaded rules. The public key for updates.spamassassin.org is installed along with Spamassassin. You can find it in SpamAssassin’s configuration directory. On my system, it’s /usr/local/share/spamassassin. The SARE key location is listed in the Further Reading section.

I used these commands to install the public key used by sa-update. I executed the commands from the /usr/local/etc/mail/spamassassin/sa-update-keys directory to make sure the keys were added to the correct keyring.

sa-update --import /usr/local/share/spamassassin/sa-update-pubkey.txt

sa-update --import /usr/local/etc/mail/spamassassin/sa-update-keys/sare.key

To update SpamAssassin’s rules, I use this script.

/usr/local/etc/mail/spamassassin/getruleupdate.sh

#!/bin/sh

HOMEDIR=/usr/local/etc/mail/spamassassin/sa-update-keys

# Retrieve updates to SpamAssassin's rules.
# Rules go into the default directory of /var/lib/spamassassin/.

/usr/local/bin/sa-update --channelfile $HOMEDIR/update-channels.txt --gpghomedir
 $HOMEDIR --gpgkey 856AA88A -D

# Restart spamd after an update.
/usr/local/bin/svc -h /var/service/spamd

And here is what the update-channels.txt file looks like.

updates.spamassassin.org
70_sare_stocks.cf.sare.sa-update.dostech.net
70_sare_oem.cf.sare.sa-update.dostech.net

This script first runs sa-update to download rules to the default directory, /var/lib/spamassassin. If the gpg key can’t be verified, the update will fail. The gpgkey option tells sa-update to trust the SARE key. Interestingly, the default channel key doesn’t need the gpgkey option to work. I also have the -D switch set so that I get a detailed record of what sa-update did emailed to me. I have FreeBSD set up to email anything a cron job spits out. After running sa-update, the script sends a HUP signal to spamd so that it will load the new rules. You’ll recall from the SpamAssassin Setup section that this script is runnable only by root. I don’t want just anyone updating my rulesets.

I use this cronjob to update SpamAssassin rules once a week.

# check for SpamAssassin rule updates once a week
20 2 * * 4      root    /usr/local/etc/mail/spamassassin/getruleupdate.sh
A word of warning to those thinking about using sa-update. For SpamAssassin versions prior to 3.1.4, it has been reported on the SpamAssassin mailing list that sa-update can fail in a most ungraceful manner, leaving your SpamAssassin installation non-functional. The problem seems to occur when the rule update directory (/var/lib/spamassassin by default) doesn’t exist prior to the first run of sa-update. sa-update creates the directory but then doesn’t download any rules into it causing SpamAssassin to ignore the rules in it’s regular configuration directory. It is also reported that running sa-update a second time fixes the problem as rules are then downloaded correctly into the waiting directory. So make sure you check your rule update directory after running sa-update the first time to make sure everything worked correctly. The Changelog for 3.1.4 indicates that some changes have been made to minimize the occurance of this problem. Since I have yet to experience this bug, I can’t say whether these changes have been effective or not.

Controlling spamd with daemontools

Daemontools is a package that includes programs used to control the startup and shutdown of long-running processes. This package is intended to ensure that processes that are supposed to run all the time actually do.

I’ve set up spamd to be controlled by the supervise program from the daemontools package. I also use svscan as the overseer to ensure all processes controlled by supervise are started during system boot. It’s a simple matter to set up spamd to work with supervise.

There are two processes that supervise will control. One is the spamd process itself and the other is multilog which, if you haven’t already guessed, handles logging.

supervise and multilog require two run scripts that tell these processes what to do. Here is a listing of the directory tree.

/usr/local/supervise/

drwxr-xr-x  4 root  wheel  512 Mar 28 07:06 spamd
drwxr-xr-x  3 root  wheel  512 Mar 28 07:06 spamd/log
-rwxr-x---  1 root  wheel   58 Mar 26 23:29 spamd/log/run
-rwxr-x---  1 root  wheel  309 May 14 13:59 spamd/run

Once you start supervise the first time, a lot more files and directories will appear under the spamd directory. I’ve listed only the ones that I had to put there myself.

Here are what the two run scripts look like.

File Contents
/usr/local/supervise/spamd/run
#!/bin/sh

exec /usr/local/bin/spamd --siteconfigpath=/usr/local/etc/mail/spamassassin \
--pidfile=/var/run/spamd.pid --syslog=stderr 2>&1
/usr/local/supervice/spamd/log/run
#!/bin/sh

exec /usr/local/bin/multilog t /var/log/spamd

Multilog automatically creates log directories named in its run script when it starts for the first time if they aren’t already there. Here is what my log directory looks like.

drwxr-x---  2 root    wheel       512 May 20 16:18 /var/log/spamd

Once the scripts are the way you want them, the only thing you need to do is create a link from the /usr/local/supervise/spamd directory to wherever your service directory is. On Callisto, mine is /var/service. So all I had to do was issue this command:

ln -s /usr/local/supervise/spamd /var/service

Five seconds later, spamd and multilog were up and running. This worked because I had already installed daemontools and svscan was already running.

A useful command to keep in mind is this.

svc -h /var/service/spamd

Use this command to have spamd read local.cf and any rules files in the site configuration directory without needing to stop and start the process.

SQL Setup

I store Bayesian filter and Auto-Whitelist data in an SQL database. Configuration options in local.cf specify the connection parameters for the SQL database and the credentials to use when logging into the database. I use MySQL because I was already using it for other things on my network.

Here are the SQL commands I used to create the database for the Bayesian filter data and for granting required permissions to the user SpamAssassin uses to access the database.

CREATE DATABASE sabayesfilter;

CREATE USER 'sa'@'localhost' IDENTIFIED BY '<password>';
CREATE USER 'sa'@'callisto.arda.homeunix.net' IDENTIFIED BY '<password>';

CREATE TABLE sabayesfilter.bayes_expire (
  id int(11) NOT NULL default '0',
  runtime int(11) NOT NULL default '0',
  KEY bayes_expire_idx1 (id)
) TYPE=MyISAM;

CREATE TABLE sabayesfilter.bayes_global_vars (
  variable varchar(30) NOT NULL default '',
  value varchar(200) NOT NULL default '',
  PRIMARY KEY  (variable)
) TYPE=MyISAM;

INSERT INTO sabayesfilter.bayes_global_vars VALUES ('VERSION','3');

CREATE TABLE sabayesfilter.bayes_seen (
  id int(11) NOT NULL default '0',
  msgid varchar(200) binary NOT NULL default '',
  flag char(1) NOT NULL default '',
  PRIMARY KEY  (id,msgid)
) TYPE=MyISAM;

CREATE TABLE sabayesfilter.bayes_token (
  id int(11) NOT NULL default '0',
  token char(5) NOT NULL default '',
  spam_count int(11) NOT NULL default '0',
  ham_count int(11) NOT NULL default '0',
  atime int(11) NOT NULL default '0',
  PRIMARY KEY  (id, token),
  INDEX bayes_token_idx1 (token),
  INDEX bayes_token_idx2 (id, atime)
) TYPE=MyISAM;

CREATE TABLE sabayesfilter.bayes_vars (
  id int(11) NOT NULL AUTO_INCREMENT,
  username varchar(200) NOT NULL default '',
  spam_count int(11) NOT NULL default '0',
  ham_count int(11) NOT NULL default '0',
  token_count int(11) NOT NULL default '0',
  last_expire int(11) NOT NULL default '0',
  last_atime_delta int(11) NOT NULL default '0',
  last_expire_reduce int(11) NOT NULL default '0',
  oldest_token_age int(11) NOT NULL default '2147483647',
  newest_token_age int(11) NOT NULL default '0',
  PRIMARY KEY  (id),
  UNIQUE bayes_vars_idx1 (username)
) TYPE=MyISAM;

GRANT SELECT, UPDATE, DELETE, INSERT ON TABLE sabayesfilter.bayes_token TO 'sa'@'localhost';
GRANT SELECT, UPDATE, DELETE, INSERT ON TABLE sabayesfilter.bayes_vars TO 'sa'@'localhost';
GRANT SELECT, DELETE, INSERT ON TABLE sabayesfilter.bayes_seen TO 'sa'@'localhost';
GRANT SELECT, DELETE, INSERT ON TABLE sabayesfilter.bayes_expire TO 'sa'@'localhost';
GRANT SELECT ON TABLE sabayesfilter.bayes_global_vars TO 'sa'@'localhost';

GRANT SELECT, UPDATE, DELETE, INSERT ON TABLE sabayesfilter.bayes_token TO 'sa'@callisto.arda.homeunix.net';
GRANT SELECT, UPDATE, DELETE, INSERT ON TABLE sabayesfilter.bayes_vars TO 'sa'@'callisto.arda.homeunix.net';
GRANT SELECT, DELETE, INSERT ON TABLE sabayesfilter.bayes_seen TO 'sa'@'callisto.arda.homeunix.net';
GRANT SELECT, DELETE, INSERT ON TABLE sabayesfilter.bayes_expire TO 'sa'@'callisto.arda.homeunix.net';
GRANT SELECT ON TABLE sabayesfilter.bayes_global_vars TO 'sa'@'callisto.arda.homeunix.net';

Here are the SQL commands I used to create the database for the Auto-Whitelist data and for granting required permissions. SpamAssassin uses the same user to access the Auto-Whitelist that it uses for the Bayesian filter database.

CREATE DATABASE saawl;

CREATE TABLE saawl.awl (
  username varchar(100) NOT NULL default '',
  email varchar(200) NOT NULL default '',
  ip varchar(10) NOT NULL default '',
  count int(11) default '0',
  totscore float default '0',
  PRIMARY KEY  (username,email,ip)
) TYPE=MyISAM;

GRANT SELECT,INSERT,UPDATE,DELETE ON saawl.* TO 'sa'@'localhost';
GRANT SELECT,INSERT,UPDATE,DELETE ON saawl.* TO 'sa'@'callisto.arda.homeunix.net';

Software Home Sites

daemontools http://cr.yp.to/daemontools.html
maildrop http://www.courier-mta.org/maildrop/
MySQL http://dev.mysql.com/
netqmail http://qmail.org/netqmail/
simscan http://inter7.com/?page=simscan
SpamAssassin http://spamassassin.apache.org/index.html
Vipul’s Razor http://razor.sourceforge.net/

Further Reading

ImageInfo SpamAssassin Plugin http://www.rulesemporium.com/plugins.htm
QMAILQUEUE patch http://www.qmail.org/
SpamAssassin Rules Emporium (SARE) http://www.rulesemporium.com/
SARE sa-update Howto http://daryl.dostech.ca/sa-update/sare/sare-sa-update-howto.txt
SARE sa-update GPG key http://daryl.dostech.ca/sa-update/sare/GPG.KEY