What a nightmare! It has taken me about a month of web searching, playing, testing and configuring to get user specific spam training with SpamAssassin running on an Horde IMAP email client on a Plesk virtual hosts system.
The problem with the default set up is that in order to train SpamAssassin you need to log into the server's plesk interface and select emails which are spam and which are ham. From that interface you can't read the emails, so you never know for sure. Also this interface will only let you scan the inbox. This whole process does not work well for users who are using Horde with IMAP.
Bet you want to know how? There are several issues which need to be addressed in order to make it work properly.
- Horde application needs to fill in the correct SpamAssassin variables
- Horde needs to know what to do with the spam or ham emails
- Web user is different from the SpamAssassin user so the difference needs to be resolved.
Firstly Horde needs to fill in the correct variables when parsing the spam / notspam command program. In file /usr/share/psa-horde/imp/lib/Spam.php
make the following changes:
/* If a (not)spam reporting program has been provided, use
* it. */
if (!empty($GLOBALS['conf'][$action]['program'])) {
$raw_msg = $imp_contents->fullMessageText();
/* Use a pipe to write the message contents. This should
* be secure. */
$email_address = explode("@", Auth::getAuth());
$prog = str_replace('%u', escapeshellarg(Auth::getAuth()), $GLOBALS['conf'][$action]['program']);
$prog = str_replace('%l', escapeshellarg($email_address[0]), $prog);
$prog = str_replace('%d', escapeshellarg($email_address[1]), $prog);
$proc = proc_open($prog,
Next tell Horde what to do with the spam / notspam emails. Edit the configuration file (/usr/share/psa-horde/imp/config/conf.php
) to tell Horde how to do the training, and when to show the "Report as Spam" and "Report as Innocent" links.
Add the following lines:
$conf['spam']['reporting'] = true;
$conf['notspam']['reporting'] = true;
$conf['spam']['program'] = '/var/qmail/popuser/bin/saver.sh %l %d spam > /dev/null 2> /dev/null';
$conf['notspam']['program'] = '/var/qmail/popuser/bin/saver.sh %l %d ham > /dev/null 2> /dev/null';
The reporting
variables tell Horde to print the report links on every page and the program
variables tell it what to do.
The popuser's home directory does not exist by default so we will create it which will give us somewhere to house all the required programs.
mkdir /var/qmail/popuser
cd /var/qmail/popuser
mkdir bin train
chown popuser:popuser bin train
chmod 755 bin
chmod 703 train
The training directory must be closed, we are going to store copies of emails while they are waiting to be processed and we don't what prying eyes looking over them.
Now we must recreate the saver.sh
and trainer.sh
programs, to handle the different user permissions.
saver.sh
:
#!/bin/bash
#
# Author: David Newcomb
# Copyright: BigSoft Limited (c) 2007
#
# Handle parameters
USER=$1
DOMAIN=$2
WHAT=$3
TRAIN_DIR="/var/qmail/popuser/train"
DATE=`date +'%Y%m%d%H%M%S%N'`
UNIQFILE="$USER:$DOMAIN:$WHAT:$DATE.$$"
FILE="$TRAIN_DIR/$UNIQFILE"
cat > $FILE
# Mark as ready to pick up
chmod o+r $FILE
mv "$FILE" "$FILE.done"
This simple program takes input from stdin and writes it into a special file in a special directory.
trainer.sh
:
#!/bin/bash
#
# Author: David Newcomb
# Copyright: BigSoft Limited (c) 2007
#
TRAINER_DIR=/var/qmail/popuser/train
SPAM="/usr/bin/sa-learn -u %u --dbpath /var/qmail/mailnames/%d/%l/.spamassassin -L --spam"
HAM="/usr/bin/sa-learn -u %u --dbpath /var/qmail/mailnames/%d/%l/.spamassassin -L --ham"
PATH=/bin:$PATH
ls $TRAINER_DIR/*.done | \
while read FILENAME
do
`echo "$FILENAME" | sed 's/.*\/\(.*\):\(.*\):\(.*\):\(.*\)/export USER=\1 DOMAIN=\2 WHAT=\3/'`
if [ "$WHAT" = "spam" ]
then
DO=$SPAM
else
DO=$HAM
fi
PARSED=`echo $DO | sed "s/%u/$USER@$DOMAIN/g"`
PARSED=`echo $PARSED | sed "s/%l/$USER/g"`
PARSED=`echo $PARSED | sed "s/%d/$DOMAIN/g"`
cat $FILENAME | $PARSED
rm -f $FILENAME
done
This program reads the special directory and decodes the special filenames into user, domain and what its contents are. It then runs the sa-learn program pointing it at the specific users bayes_toks files.
The saver.sh is run by the apache webmail process and the trainer.sh is run by the popuser. One could tell apache to run webmail.domain under popuser but I want to avoid touching the webserver's configuration.
The question now is how often do you run the trainer. This will depend on the various resource requirements your system has, how many mail users you have, how much mail they receive and how much free disk space you have. For example if you have a small number of users then every hour would be enough, whereas if the server is heavily used during the day you may want to run the trainer during one of the more quieter times.
To add to the popuser's cron enter the following:
crontab -u popuser -e
When inside the editor, enter the line:
0 * * * * /var/qmail/popuser/bin/trainer.sh > /dev/null 2> /dev/null
This will run the trainer as the popuser every hour.
In all the cases above the output is directed to /dev/null but for debug you can change it to whatever you like.
There is another method that can be used. This is to set up a special user to whom you forward your spam / ham to. I think that this is a pain for users because you have to forward it and then delete it, where as both of these can be done using the above method.
It seems strange that there is no documentation for this feature as I think effective spam training is essential when administering a mail server. I understand that everyone loves coding, but hates writing documentation!
If anyone has any comments, suggestions or improvements to any of the above then please add to the comments.