Trinity UniversityTrinity University Computer Science

Use of SpamBayes on Computer Science Unix Systems

There is no direct mail client integration with SpamBayes on Unix and Linux systems. The overall philosophy behind how we are running SpamBayes is that your mail will be filtered by the server Mail.CS.Trinity.Edu . SpamaBayes divides your mail into three mailboxes identifying the mail in them as ham (Non-spam mail), spam and unsure (i.e, mail that it is unsure about). The names of these three mailboxes are arbitrary. The simplest thing is to have the ham mail placed in a spoolfile in your default mail directory called inbox. The spoolfile is the mailbox which your mail program is configured to look for your mail. The unsure mail could be placed in a mailbox called, unsure and the spam mail placed in a mailbox called something like spamfound. In addition you might have two mailboxes called ham and spam that will be used for training as we will see shortly. The reason for using inbox as the spoolfile of your mailclient (such as mutt, pine, kmail, etc) is that without using something like POP3 your mail client can't access your real mail spoolfile on the server, which is: /var/spool/mail/<your_login_name> . If you want to use a mail client and have it pop mail then you should leave your mail in the normal Linux spool file.

As you go through your mail you may find mail in your spool file that is actually spam so it should be transferred into the spam mail box. If you find mail in your spamfound mailbox that is really ham then it should be transferred to your ham mail box. In the unsure mailbox you might find both ham and spam mixed. These should then be transferred to the appropriate mailboxes. Now initially you might find that SpamBayes will make a large number of mistakes but with time and training the mistakes should shrink to the point that you would expect your spoolfile would be nearly or maybe totally spam free. Training should be done with about equal amounts of spam and ham. So one might purposely transfer some messages from your spoolfile (your good mail file) to the ham folder to equalize out the number of spam and ham messages you are using in training. More about this later. The data gathered by SpamBayes to train your mail is in a database file called hammie.db in your home directory.

The filtering program is called sb_filter.py and you will run it normally through a .procmailrc script that you place in you home directory.

Procmail

The following assumes that you are reading mail in the CS.Trinity.Edu domain. This mail is processed by Mail.CS.Trinity.Edu (aka Sol). The mail transport agent sendmail uses procmail to deliver your mail to your inbox and mail directory files. You will probably find sb_filter.py the easiest application to integrate into your mail environment. Now the .procmailrc file that you place in your home directory will be actually run by the server machine, but since the client machines and the server share your home directory you can do all the procmail configuration on a client as well as train hammie.db and these files will be used for filtering on the server. The filtering program used in the SpamBayes system, called sb_filter.py, is found in the directory /usr/bin which should be in your default path.

You should perform the following steps to set up your SpamBayes system for mail filtering.

1. Create the database that SpamBayes will use to test your incoming mail:

sb_filter.py -d $HOME/.hammie.db -n

2. Train it on your existing mail using the following command ( -g is the flag for the known good mail, and -s is for known spam).

/usr/local/bin/sb_mboxtrain.py -d $HOME/.hammie.db -g $HOME/Mail/ham -s $HOME/Mail/spam

3. Adding the following recipes to the top of your .procmailrc will get the spam and unsure stuff out of the way. Allowing everything else to be filtered as per your normal procmail recipes. The last clause in the script is what causes the ham to be put in the spoolfile inbox. If you want to use POP3 to receive your ham mail then leave out this last clause.

PATH=$HOME/bin:/usr/bin:/usr/ucb:/bin:/usr/local/bin:SHELL=/bin/sh
MAILDIR =       $HOME/Mail      # You'd better make sure it exists

LOGFILE =       $MAILDIR/procmail.log
LOCKFILE=       $HOME/.lockmail

:0fw:hamlock
| /usr/local/bin/sb_filter.py -d $HOME/.hammie.db

0:
* ^X-Spambayes-Classification: spam 
${MAILDIR}/spam

:0
* ^X-Spambayes-Classification: unsure
${MAILDIR}/unsure

:0
* ^X-Spabayes-Classification: ham  
${MAILDIR}/inbox
5. You train the database using the following command that assumes that you have collected ham in a file called ham and spam in a file called spam.
/usr/local/bin/sb_mboxtrain.py -d $HOME/.hammie.db -g $HOME/Mail/ham -s $HOME/Mail/spam
You might automate the process by constructing a shell script called trinsb .
#!/bin/bash
#script: trainsb
/usr/local/bin/sb_mboxtrain.py -d $HOME/.hammie.db -g $HOME/Mail/$1 -s $HOME/Mail/$2
Then trainsb would be used as follows:
trainsb ham spam
Now it turns out to be usefull to train your mail using both the ham and spam that SpamBayes has already identified.

SpamBayes configuration file

Your SpamBayes configuration file is called .spambayesrc and is located in your home directory.

Start it out with the following three lines and add lines directly or through the web interface as needed or required.

[Storage]
persistent_use_database=dbm
persistent_storage_file=~/.hammie.db
If you wish to get a list of all the configuration options or just some of them you can use the following commands:

(note each of these commands should be placed on a single line)

python -c "from spambayes.Options import options ; print options.display_full()"
The command above will print out a complete list of the options, including a description of the option, and their default values. You can also look up options for a single section, if you know its name:
python -c "from spambayes.Options import options ; print options.display_full('section_name')"
Or just a single option:
python -c "from spambayes.Options import options ;
print options.display_full('section_name', 'option_name')"
If you want a list of all the sections, you can use this command:
python -c "from spambayes.Options import options ; printoptions.sections()"
POP3

If you do your training as indicated above from a client machine then strictly speaking your use of POP3 or IMAP is not required. However, SpamBayes has an implementation of a POP3 proxy server that you can use to train you mail through web interface. To start it you type:

sb_server.py -b &
and it will respond:
SpamBayes POP3 Proxy Version 1.0rc2 (June 2004)
and engine SpamBayes Engine Version 0.3 (January 2004).
Loading database... User interface url is http://localhost:8880/
Open a browser and go to http://localhost:8880/ and you will see a web page that allows to you configure the SpamBayes POP3 proxy server. You can click on Configuration in the upper right and fill in the mailserver's address: mail.cs.trinity.edu and your username and password and tell the server to listen on port 8110. Another option deals with cutoffs for spam and ham. sb_filter.py assigns to each mail message a level of spamness. If this level is .2 or less it is identified as ham. If it is .9 or greater it is identified as spam. Between .2 and .9 the mail is identified as unsure. For now, leave the default cutoff values of .9 and .2 as well as the other parameters. These can be chaged later when you experience how SpamBayes reacts to your mail stream. Save the configuration and then return to the home page. There you will see you can identify mail files to be used in training as well as enter individual messages to be used in training.

To exit from sb_server.py just click Save and Exit on the bottom right corner of the main web page.

Further information on SpamBayes can be optaines at their web site: http://www.spambayes.org .


Trinity University

Site Index
Comments or Suggestions
Computer Science Department
Trinity University
One Trinity Place
San Antonio, Texas 78212-7200
voice: (210) 999-7480
fax: (210) 999-7477

Trinity google site search
CS WebMail
CS ListServer