Home | History | Documents | Software | Feedback | Disclaimer

The Arda.Homeunix.Net SpamAssassin Setup

Table of contents

  1. Introduction
  2. Preliminaries
  3. SpamAssassin Setup
  4. Vipul’s Razor
  5. Tagging EMail - Integration with netqmail and simscan
  6. Delivering EMail - Integration with maildrop
  7. Training the Bayesian Filter
  8. Updating SpamAssassin Rules
  9. Controlling spamd with daemontools
  10. SQL Setup
  11. Software Home Sites
  12. Further Reading

Introduction

This document describes how I use SpamAssassin to keep spam out of my inbox. SpamAssassin is a program for tagging email. It reads in an email message, classifies the email as either legitimate (ham) or illegitimate (spam), and writes the email out again with specific mail headers added. The added headers tell you whether SpamAssassin thinks the email is ham or spam. SpamAssassin does what it does using predefined rules that are associated with scores, either positive or negative. All rules that match a particular email contribute to that email’s total score. If the total score reaches a configurable value, then the email is identified as spam. In addition to its rules, SpamAssassin also includes a Bayesian filter. The Bayesian filter provides a convenient way for mail administrators to tailer SpamAssassin to the specific email processed by their systems. The score assigned to an email can also be influenced by SpamAssassin’s Auto-Whitelist feature. The Auto-Whitelist keeps track of the average score for email on a per sender basis, and pushes the score of subsequent email towards the recorded average of the sender.

I’ve managed to make use of many of SpamAssassin’s features. Herein you will find descriptions of:

Because SpamAssassin is never used in isolation, included are descriptions of how I’ve integrated SpamAssassin with my MTA, netqmail, and my MDA, maildrop.

Preliminaries

SpamAssassin does no routing of email based on its classification, all it does is the classifying. For this reason, SpamAssassin is always used with other programs that take advantage of the email headers added by SpamAssassin.

In the Arda Network, SpamAssassin is installed on Callisto, my mail server. It is integrated with netqmail, my MTA (Mail Transfer Agent), and maildrop, my MDA (Mail Delivery Agent). I also use SquirrelMail for webmail access and I use one of its many plugins to help train SpamAssassin’s Bayesian filter. You will find an overview, including a very nice diagram, of the Arda Network here.

SpamAssassin installations can be divided into two broad categories, per account and site wide. A per account install, as one would expect, invokes SpamAssassin for specific mail accounts and happens at mail delivery time. A site wide install applies to all, or almost all, mail accounts in an email system and is invoked when email arrives at a mail server. SpamAssassin in the Arda Network is installed site wide, it is invoked for all email arriving at Callisto from outside my network.

In addition to the two types of installation, SpamAssassin can be invoked two different ways. The first way is to run the perl script spamassassin. The spamassassin script contains all the functionality to classify an email passed to it. spamassassin is intended to be run whenever an email needs to be classified as ham or spam. The second way is to use spamc and spamd. spamd is also a perl script but runs as a long lived daemon process. spamc is a program written in C. It provides an interface between spamd, which does the classifying and tagging of emails, and other programs. The intention here is to avoid having to invoke the perl interpreter for each email to be examined by SpamAssassin. spamd can be left to run in the background continuously while only the small and fast spamc need be invoked when an email is to be examined.

Both methods of invoking SpamAssassin have their place. Using spamassassin, however, is really only an option for per account installs. For site wide installs in mail systems that handle anything above a minimal amount of email, you want to use spamd/spamc. And that’s what I use in the Arda Network.

Here is a list of the various software packages that I describe in this document.

SpamAssassin Setup

Here is what my site configuration directory (--siteconfigpath option of spamd) looks like.

/usr/local/etc/mail/spamassassin

A stock SpamAssassin install will include only the three *.pre files and the local.cf file in this directory. I added the other files and the sa-update-keys directory myself.

The getruleupdate.sh script and sa-update-keys directory are used by sa-update and are explained in the Updating SpamAssassin Rules section. The imageinfo.cf file is used by the ImageInfo plugin. The Perl module associated with this plugin, ImageInfo.pm, went into SpamAssassin’s plugin directory. On my system, that’s here:

I use the ImageInfo plugin to trap stock spam composed mostly of gif, png, or jpeg images. You can find a link to the ImageInfo plugin in the Further Reading section below.

Here is the local.cf file.

This is the file where you put configuration options. You can also add custom rules and override standard rules here. If you have a lot of custom rules or do lots of overrides, I’d suggest putting them into their own file(s) to avoid clutter.

I’ve modified a number of the options in this file and added a few as well.
trusted_networks This option tells SpamAssassin that mail relays and MXs on these networks won’t originate spam. The practical upshot is that DNS blacklist checks won’t be performed for servers on listed networks.
skip_rbl_checks This option tells SpamAssassin whether or not to perform checks against DNS based Realtime Block Lists. I do this with netqmail and rblsmtpd so I tell SpamAssassin to skip these checks.
report_contact The report contact appears in the report generated by SpamAssassin when it determines that an email is spam and report_safe is set to 1 or 2. Spamassassin uses some generic text if you don’t specify a report contact. In versions before 3.1.4, a report contact was generated automatically by SpamAssassin but this caused a problem if you were using sa-update.
score I’ve overriden two rules related to the Bayesian filter. What I’ve done is increase the score for these two rules. I have yet to see these two rules hit on an email I did not consider spam so I felt justified in increasing their scores. You’ll notice that the new scores aren’t quite enough to tag an email as spam all by themselves. I’m not ready to let the Bayesian filter go it alone yet.
bayes_ignore_header These options tell the Bayesian filter to ignore the listed headers when learning what makes ham and spam. I’ve simply listed the headers that SpamAssassin itself adds to emails it scans.
SQL database options All the configuration lines after the bayes_ignore_header options deal with telling SpamAssassin to use an SQL database to store Bayesian filter and Auto-Whitelist information. If these options were not present, this information would be stored in dbm database files on Callisto. The volume of mail my system deals with doesn’t really justify running an SQL database but I decided to set one up anyway just to see how well it worked. Conclusion; it works very well. See the SQL Setup section for more details.

Here is the init.pre file.

This file lists a few of the plugins available to SpamAssassin. I turned on RelayCountry and turned off Hashcash and SPF. I found that I just wasn’t getting enough hits from Hashcash or SPF to justify having them on. I added a line in this file to activate the ImageInfo plugin.

And here is the v310.pre file.

Like init.pre, this file lists plugins you can turn on or off. This is where I turn on the Auto-Whitelist plugin among others.

And, finally, here is the v312.pre file.

This is yet another plugin file. I’m in the dark about why SpamAssassin needs a different plugin file for each release. I wonder how many files we can expect to see before they begin to be consolidated.

Here are two examples of the headers added by SpamAssassin to email arriving at Callisto.

Classification Headers
ham
X-Spam-Checker-Version: SpamAssassin 3.1.5 (2006-08-29) on 
	lorien.arda.homeunix.net
X-Spam-Level: 
X-Spam-Status: No, score=-2.5 required=5.0 tests=AWL,BAYES_00,
	DK_POLICY_SIGNSOME autolearn=ham version=3.1.5
	
spam
X-Spam-Flag: YES
X-Spam-Checker-Version: SpamAssassin 3.1.5 (2006-08-29) on 
	lorien.arda.homeunix.net
X-Spam-Level: ***********
X-Spam-Status: Yes, score=11.5 required=5.0 tests=BAYES_99,DK_POLICY_SIGNSOME,
	FROM_LOCAL_NOVOWEL,HTML_MESSAGE,MIME_HTML_ONLY,SARE_GIF_ATTACH,
	TVD_FW_GRAPHIC_ID2,TVD_FW_GRAPHIC_NAME_MID autolearn=no version=3.1.5
	

Remember that all these headers are listed in bayes_ignore_header options in local.cf. When I feed sa-learn a false-negative, I don’t want it thinking that 'X-Spam_Status: No' means the email is spam.

Vipul’s Razor

One of the SpamAssassin plugins I use is Razor2. Razor2 provides an interface to the Razor distributed spam filtering network. Because I built SpamAssassin using the FreeBSD port, I didn’t need to install Razor separately. I simply told the port that I wanted Razor support. While Razor works without any additional configuration, doing a bit of extra work makes Razor operate much more efficiently.

Razor is structured as a client-server application. A Razor client calculates a digest from a mail message and then contacts a publicly accessible Razor server to see if that digest is in the server’s database. The list of servers to use, how often to update this list, and other parameters are kept in configuration files. The FreeBSD SpamAssassin port, however, doesn’t create these configuration files. This means that the Razor client has to retrieve the list of Razor servers every time it processes a mail message. This can generate a lot of needless network traffic on a busy server.

To create a default set of configuration files for Razor, I used this command.

Notice that I specified the directory in which to put the configuration files. This is the home directory of the user 'simscan' on my system. You’ll read more about simscan in the Tagging Email section but the relevant bit of information here is that Razor always runs as simscan so it is in simscan’s home directory that it will look for its configuration files. Here is a listing of that directory after running the above command.

/var/qmail/simscan/

I needed to set the ownership of the files after they were created.

Here is what the razor-agent.conf file looks like.

The only change I needed to make in this file was to set the debuglevel to 0. Razor doesn’t do any sort of log rotation so unless you take specific measures to prevent it, Razor’s log file will eventually eat up all your disk space. I simply left the default value of 3 in place for a time until I was sure Razor was working properly and then changed it to 0. A value of zero means that no log messages will be generated at all.

With a proper set of configuration files in place, Razor is now able to look up its list of available servers locally each time it is run thus reducing the time needed to process email.

Tagging EMail - Integration with netqmail and simscan

As outlined in the Preliminaries section, I use SpamAssassin in a site wide configuration. To accomplish this, I’ve integrated SpamAssassin with my MTA, netqmail. When netqmail receives an incoming email, it invokes spamc which passes the email to spamd for tagging. The tagged email is then returned to netqmail for local delivery. I’ve specified local delivery on purpose because in my setup, outgoing email is not scanned by SpamAssassin.

There are different ways to invoke spamc from netqmail. One way would be to use .qmail files that call spamc but that means emails would not be scanned until delivery time. I wanted something that would scan emails earlier in the processing cycle. After looking at a few options, I chose simscan.

simscan is small, easy to configure and install, and surprisingly feature rich. One unusual characteristic is that simscan sets which features to make available at compile time and not through configuration files. I was worried that this would cause me trouble if I needed to change simscan’s behaviour after installation but as it’s turned out, I’ve had no problems with it and I’m very happy with my choice.

simscan is invoked by netqmail before netqmail’s standard qmail-queue program. This means that email is scanned prior to being queued. If I wanted to drop email tagged as spam, this would happen before the mail was queued thus saving time and resources on my server. It also means that I could return an error code to the connecting SMTP client rather than sending a bounce email which is what would happen if I rejected the email at delivery time.

simscan will work with netqmail out of the box. If you’re using vanilla qmail, you’ll need to patch it with Bruce Guenter’s QMAILQUEUE patch. You control when netqmail calls simscan in your tcprules file. Here’s mine.

The first line will refuse all connections unless a more specific rule overrides it. The third and forth lines indicate that connections from my local network and from localhost are accepted unconditionally. The second line is the relevant one from the standpoint of SpamAssassin. The IP shown is the virtual IP used by Thebe, my internet gateway, when establishing its VPN connection with the rest of my network. All email coming from outside my network (and from Thebe itself) will arrive at Callisto from 10.10.0.1 and will be scanned by SpamAssassin. My current setup won’t scan email generated by users on my network. If I wanted that, it’s as easy as adding the QMAILQUEUE environment variable to the third line.

simscan provides a useful summary of configured settings at compile time. Here is what simscan’s summary looks like for Callisto.

The first setting is misleading. It implies that simscan runs as user nobody but it doesn’t; it runs as user simscan. This has to do with the fact that I installed simscan from the FreeBSD port. Installing simscan from a tarball directly does not have this problem.

The rest of the configuration settings show that all I’m using simscan for is to scan incoming mail for spam using spamc. simscan has options that allow you to drop email if SpamAssassin determines it to be spam and you can even tell simscan to drop only email above a specified score. I’ve told simscan to not drop any email regardless of score as indicated by the spam passthru option. I deal with spam at delivery time with the aid of maildrop.

Delivering EMail - Integration with maildrop

As explained in the previous section, simscan doesn’t drop any email regardless of the score assigned to it by SpamAssassin. Instead, at delivery time, I look at the mail headers and if SpamAssassin has determined that an email is spam, I put it in a special folder in the recipient’s mail account called 'Spam'. That way, the recipient can view the email or ignore it according to his or her wishes.

To accomplish this, I use maildrop. Procmail is another popular choice that I could have used here. I chose maildrop because I found it’s filter language easier to understand.

Every mail account on Callisto has a .mailfilter file in it's home directory. The filter file looks for the relevant SpamAssassin header and if it’s found, delivers the email to the Spam folder. Otherwise, the email is delivered to the recipient’s inbox. Here is an example .mailfilter file.

Here is the .qmail file from the same mail account. This is what tells netqmail to use maildrop for mail delivery.

You’ll notice that the .mailfilter file checks to make sure the Spam folder exists before trying to deliver mail to it. If the folder doesn’t exist, it is created and the folder added to the list of subscribed folders. You should know that I use Courier IMAP as my IMAP server. That’s important because the subscribeIMAP script called from the .mailfilter file only works for Courier IMAP. Here it is.

Training the Bayesian Filter

Spam can be highly variable through space and time. The spam you see hitting your domain may be quite different from the spam I see. SpamAssassin’s Bayesian filter is designed to let mail administrators train their SpamAssassin installs to catch the particular spam their sites’ encounter. I’ve found using the Bayesian filter a good way to increase SpamAssassin’s hit rate on spam without increasing the false-positive rate.

I should make it clear now that you can train the Bayesian filter to identify ham as well as spam. So if SpamAssassin produces a false-positive, you can train the filter to identify similar mail as ham the next time it is encountered.

You can train the Bayesian filter in two ways. The first is to have it auto-learn from email already identified by SpamAssassin as ham or spam. The second way is to run sa-learn on one or more emails, telling sa-learn whether the emails are ham or spam. I use both methods in my setup.

SpamAssassin’s Bayesian filter auto-learns by default so I didn’t have to do anything to configure it.

Training the Bayesian filter with sa-learn is more involved. The biggest headache I’ve encountered is managing the emails I want to use for training the filter. After much consideration, I decided to forward all training emails to two specific folders in my domain’s abuse email account. Lucky for me, I use SquirrelMail for webmail access on my system and SquirrelMail has a very useful plugin that allows me to easily forward emails to specific accounts with a click of the mouse. For those who are interested, the plugin is called 'Spam Buttons'.

Because all mail users on domains I host will be using the same SquirrelMail plugin to send email to the abuse account, I can be confident in the format of the incoming mail. Here are the relevant portions of the maildrop .mailfilter file I use to process email destined for the abuse account.

The SquirrelMail plugin forwards emails as attachments and marks up the subject of the email with the word 'SPAM' or 'HAM' depending on what the user says it is. The reformime program I use to unpack the attachment is part of the maildrop package.

Interestingly enough, if I used Mozilla’s 'Forward As Attachment' option and replace 'FWD' in the subject with either 'SPAM' or 'HAM', the email will be processed correctly by the above .mailfilter file. Very convenient.

Once emails are safely in the appropriate mail folders, I use three files for training the Bayesian filter. One is the actual script that calls sa-learn while the other two contain the directories where the target emails are. Here is a listing of the three files.

And here are the contents of the files.

File Contents
bayes-ham-folders
/home/vmail/abuse/Maildir/.Ham/cur
/home/vmail/abuse/Maildir/.Ham/new
	
bayes-spam-folders
/home/vmail/abuse/Maildir/.Spam/cur
/home/vmail/abuse/Maildir/.Spam/new
	
bayes-teach
#!/bin/sh

/usr/local/bin/sa-learn --spam --username simscan \
--siteconfigpath=/usr/local/etc/mail/spamassassin \
--folders=/home/vmail/bayes-spam-folders

/usr/local/bin/sa-learn --ham --username simscan \
--siteconfigpath=/usr/local/etc/mail/spamassassin \
--folders=/home/vmail/bayes-ham-folders
	

Notice that the --username option in bayes-teach is set to simscan. Tokens saved in the Bayesian filter database are all associated with a username. This allows individualized Bayesian filters to be maintained per email account. Because I have SpamAssassin set up site-wide, all tokens should appear under the same username. Because simscan runs as user simscan, and because spamc is called from simscan, I want all tokens saved in my Bayesian filter database associated with the user simscan. Setting --username in bayes-teach ensures that this happens. Because spamc runs as the user simscan, this happens automatically for auto-learned Bayes tokens. Emails saved in my Auto-Whitelist database are also associated with the username simscan for the same reason.

Keep in mind that any tokens associated with a username other than simscan won’t be applied to incoming emails passed to spamd by simscan.

I use this cron job to run bayes-teach once a day.

Updating SpamAssassin Rules

Using sa-update to periodically update SpamAssassin rules is a fairly new feature and it isn’t as polished as the rest of SpamAssassin. Having said that, I think updating rules more frequently than when I do release upgrades is a valuable capability to have. It also hasn’t caused me any problems. For these reasons, I include a description of my setup here in case you want to have a go at it yourself.

I’m pulling updates from the default channel, updates.spamassassin.org, and two SARE channels. I’m also using gpg to verify the source and integrity of the downloaded rules. The public key for updates.spamassassin.org is installed along with Spamassassin. You can find it in SpamAssassin’s configuration directory. On my system, it’s /usr/local/share/spamassassin. The SARE key location is listed in the Further Reading section.

I used these commands to install the public key used by sa-update. I executed the commands from the /usr/local/etc/mail/spamassassin/sa-update-keys directory to make sure the keys were added to the correct keyring.

To update SpamAssassin’s rules, I use this script.

/usr/local/etc/mail/spamassassin/getruleupdate.sh

And here is what the update-channels.txt file looks like.

This script first runs sa-update to download rules to the default directory, /var/lib/spamassassin. If the gpg key can’t be verified, the update will fail. The gpgkey option tells sa-update to trust the SARE key. Interestingly, the default channel key doesn’t need the gpgkey option to work. I also have the -D switch set so that I get a detailed record of what sa-update did emailed to me. I have FreeBSD set up to email anything a cron job spits out. After running sa-update, the script sends a HUP signal to spamd so that it will load the new rules. You’ll recall from the SpamAssassin Setup section that this script is runnable only by root. I don’t want just anyone updating my rulesets.

I use this cronjob to update SpamAssassin rules once a week.

A word of warning to those thinking about using sa-update. For SpamAssassin versions prior to 3.1.4, it has been reported on the SpamAssassin mailing list that sa-update can fail in a most ungraceful manner, leaving your SpamAssassin installation non-functional. The problem seems to occur when the rule update directory (/var/lib/spamassassin by default) doesn’t exist prior to the first run of sa-update. sa-update creates the directory but then doesn’t download any rules into it causing SpamAssassin to ignore the rules in it’s regular configuration directory. It is also reported that running sa-update a second time fixes the problem as rules are then downloaded correctly into the waiting directory. So make sure you check your rule update directory after running sa-update the first time to make sure everything worked correctly. The Changelog for 3.1.4 indicates that some changes have been made to minimize the occurance of this problem. Since I have yet to experience this bug, I can’t say whether these changes have been effective or not.

Controlling spamd with daemontools

Daemontools is a package that includes programs used to control the startup and shutdown of long-running processes. This package is intended to ensure that processes that are supposed to run all the time actually do.

I’ve set up spamd to be controlled by the supervise program from the daemontools package. I also use svscan as the overseer to ensure all processes controlled by supervise are started during system boot. It’s a simple matter to set up spamd to work with supervise.

There are two processes that supervise will control. One is the spamd process itself and the other is multilog which, if you haven’t already guessed, handles logging.

supervise and multilog require two run scripts that tell these processes what to do. Here is a listing of the directory tree.

/usr/local/supervise/

Once you start supervise the first time, a lot more files and directories will appear under the spamd directory. I’ve listed only the ones that I had to put there myself.

Here are what the two run scripts look like.

File Contents
/usr/local/supervise/spamd/run
#!/bin/sh

exec /usr/local/bin/spamd --siteconfigpath=/usr/local/etc/mail/spamassassin \
--pidfile=/var/run/spamd.pid --syslog=stderr 2>&1
	
/usr/local/supervice/spamd/log/run
#!/bin/sh

exec /usr/local/bin/multilog t /var/log/spamd
	

Multilog automatically creates log directories named in its run script when it starts for the first time if they aren’t already there. Here is what my log directory looks like.

Once the scripts are the way you want them, the only thing you need to do is create a link from the /usr/local/supervise/spamd directory to wherever your service directory is. On Callisto, mine is /var/service. So all I had to do was issue this command:

Five seconds later, spamd and multilog were up and running. This worked because I had already installed daemontools and svscan was already running.

A useful command to keep in mind is this.

Use this command to have spamd read local.cf and any rules files in the site configuration directory without needing to stop and start the process.

SQL Setup

I store Bayesian filter and Auto-Whitelist data in an SQL database. Configuration options in local.cf specify the connection parameters for the SQL database and the credentials to use when logging into the database. I use MySQL because I was already using it for other things on my network.

Here are the SQL commands I used to create the database for the Bayesian filter data and for granting required permissions to the user SpamAssassin uses to access the database.

Here are the SQL commands I used to create the database for the Auto-Whitelist data and for granting required permissions. SpamAssassin uses the same user to access the Auto-Whitelist that it uses for the Bayesian filter database.

Software Home Sites

daemontools http://cr.yp.to/daemontools.html
maildrop http://www.courier-mta.org/maildrop/
MySQL http://dev.mysql.com/
netqmail http://qmail.org/netqmail/
simscan http://inter7.com/?page=simscan
SpamAssassin http://spamassassin.apache.org/index.html
Vipul’s Razor http://razor.sourceforge.net/

Further Reading

ImageInfo SpamAssassin Plugin http://www.rulesemporium.com/plugins.htm
QMAILQUEUE patch http://www.qmail.org/
SpamAssassin Rules Emporium (SARE) http://www.rulesemporium.com/
SARE sa-update Howto http://daryl.dostech.ca/sa-update/sare/sare-sa-update-howto.txt
SARE sa-update GPG key http://daryl.dostech.ca/sa-update/sare/GPG.KEY

Copyright © 2006 Andrew St. Jean Last update Nov. 12, 2006
Apache Webserver FreeBSD