Saturday, August 15, 2015

Unlearning Ham to Learn Ham

The title of this post sounds self-contradictory, doesn't it? Why would spamassassin need to unlearn something in order to learn it? I'm not sure why. However this seems to be the case.

Today I was writing some rules for ham emails that I commonly receive. Of course, I was giving these rules negative points since spamassassin will ideally test negative for ham and positive for spam.

Doggedly I wrote one rule after another. I was desperately trying to trigger autolearn. I never did succeed. This is in spite of the fact that I wrote 7 or more rules worth minus one point each. Finally, I got disgusted and gave up on auto-learn.

I'm still a bit mystified why this did not work. I had at least 2 ham rules for the header and 5 ham rules for the body. Why would an email that tested so overwhelmingly negative for spam not trigger ham autolearning? I never did figure out why.

In any case, I finally gave up and tried something else:

sa-learn --ham email.mbox

This did not work either. So perplexing!

As my last act of desperation, I tried this sequence:

sa-learn --forget email.mbox
sa-learn --ham email.mbox

Voila! It worked! I finally got bayes to learn my email as ham!

I'm still not sure why the email needed to be entirely forgotten in order to be learned as ham. I'm not a apamassassin professional. I'm simply a guy who does his own email filtering on an amateur basis. Perhaps someday I'll learn why.

Ed Abbott

Saturday, July 7, 2012

Registering a Razor Identity

I just learned from this post
that I need to register a Razor
identity if I wish to report spam
to Razor:

Registering a Razor Identity

Here's an article on reporting spam
generally:

Reporting Spam

Here's the generic command for reporting
spam:

spamassassin -r < message.txt
Note that the above generic command
automatically removes all spamassassin
markup. In other words, the -d option
is automatically on.

Ed Abbott

Wednesday, June 6, 2012

Spamassassin and Bind

I've been using Spamassassin in
local mode only. Here's how I
invoke Spamassassin:

spamassassin -L

It's the -L (dash-capital-L) that
keeps spamassassin local. If you leave
off the -L, spamassassin checks for
spammy IP addresses in incoming emails.

Here's a wonderful article that describes
this much better than I do:

Caching Nameserver

I'm going to try to set up bind9
just like the webpage above suggests. I
feel that spamassassin will be much more
effective if it checks for spammy IP addresses.

Here's how I will start the process of
installing bind9:

aptitude update
aptitude search bind9

The first command above updates the
availability information for Debian
packages. The second command tells
me just what in relation to bind9 is
available for installation.

When I do the search for bind9,
I'm given the following information:

p   bind9                       - Internet Domain Name Server          
p   bind9-doc                   - Documentation for BIND               
i   bind9-host                  - Version of 'host' bundled with BIND 9
p   bind9utils                  - Utilities for BIND                   
p   gforge-dns-bind9            - collaborative development tool - DNS 
i   libbind9-60                 - BIND9 Shared Library used by BIND  

Looks like what I need is on the top line, which
is bind9 itself. I'll go ahead and install the
bind9 package:

aptitude install bind9

I probably should have typed
this command first:

aptitude show bind9

The reason I did not type it is
I was quite confident that
bind9 was not installed. When I
type it now, it shows that indeed,
I have successfully installed bind9:

aptitude show bind9
Package: bind9                         
State: installed

The next step for me is to get
into kmail and turn off
the local switch for spamassassin.
Here's the menus I use to do
this:

Settings > Configure Filters

Next, I high-lighted the following:

SpamAssassin Check

Next, I look for the words
Pipe Through. I then
change the invocation from
spamassassin -L to just
spamassassin without the
-L. Next I click apply
and OK.

Update: June 13, 2012

Changing from local rules only to
rules that make use of IP addresses
and blacklists got me into trouble.
I posted to the Spamassassin mailing
list describing this trouble:

False Positive on Domain Name

The folks on the Spamassassin mailing
list were incredibly generous. I got over
50 replies to my post.

I learned that my ISP's DNS servers were
not going to work. My ISP's DNS servers
were giving me a false positive on what
should have been ham emails.

In effect, my DNS servers were causing
Spamassassin to view all unknown domain
names as having been blacklisted. That
was the overall effect.

Therefore if a domain name was mentioned
in an incoming email, and that domain name
had never been either blacklisted or
white-listed, it was assumed to be blacklisted.

To get around the unknown domain name problem,
I took two steps:

1 -- Set up a bind9 server (for DNS)

2 -- Change my /etc/resolv.conf file to
"local host."

That's it in a nutshell. More detail is
given in the above link.

Update: June 22, 2012

The above link in a mailing list discussion
that taught me how to set up my own Bind
server. The last thing I learned was how
to set my /etc/resolve.conf to localhost.

Ultimately, the way to configure /etc/resolv.conf
is to configure /etc/dhcp/dhclient.conf. The
latter controls the former.

Here's the line I added to dhclient.conf:

supersede domain-name-servers 127.0.0.1;

In this one line, I'm superseding the domain
name servers provided by dhcp. To me,
the logical place to add the above line is
after dhcp information has been retrieved.

Therefore, I've chosen to place the above line
in this context:

request subnet-mask, broadcast-address, time-offset, routers,
 domain-name, domain-name-servers, domain-search, host-name,
 netbios-name-servers, netbios-scope, interface-mtu,
 rfc3442-classless-static-routes, ntp-servers;
supersede domain-name-servers 127.0.0.1;

Whether or not these two operations --- a request followed
by a supersede --- are truly sequential, I do not know.

However, I've decided to treat these two operations as if they
were programming language operations on a serial computer.
I'm treating it as if the supersede only has lasting effect
if it comes last.

In all probability, the order in which they two operations appear
does not matter. However, I've not experimented with reversing
the order of the operations. I'm happy to leave things just as
they are.

Update: June 28, 2012

Here's a summary of what I eventually
did to get bind working as my caching
nameserver:

  1. aptitude install bind9
  2. Add a supersede command to
    the file /etc/dhcp/dhclient.conf

The supersede command is there to set the
nameserver to local host. Here are the
sepcifics of the supersede command that I
added.

First I changed this line:

request subnet-mask, broadcast-address, time-offset, routers,
 domain-name, domain-name-servers, domain-search, host-name,
 netbios-name-servers, netbios-scope, interface-mtu,
 rfc3442-classless-static-routes, ntp-servers;

to this line:

request subnet-mask, broadcast-address, time-offset, routers,
 domain-name, domain-search, host-name,
 netbios-name-servers, netbios-scope, interface-mtu,
 rfc3442-classless-static-routes, ntp-servers;

Note that the above dhcp request has had the domain-name-servers
part of the request removed. Next, I added my supersede command
Here's what the supersede command looks like in context:

request subnet-mask, broadcast-address, time-offset, routers,
 domain-name, domain-search, host-name,
 netbios-name-servers, netbios-scope, interface-mtu,
 rfc3442-classless-static-routes, ntp-servers;
supersede domain-name-servers 127.0.0.1;

In other words, the domain names servers are no longer a product
of the dhcp request. Domain names servers are now set by the
supersede command.

Ed Abbott

Wednesday, January 11, 2012

How to Write Spamassassin Rules With No Score

I had the mistaken notion that
Spamassassin rules have to have
a score. Because of this, I was
writing rules with 1/1000th of a
point.

I gave very small scores to rules
that I would later collect together
to form a meta rule.

The mistaken notion came about because
I read that rules that have a score
of zero are not evaluated. Since Perl
uses zero to mean false and since
spamassassin is based on Perl, I figured
I was stuck writing rules with miniscule
scores if I wanted Perl to distinguish
between true and false.

A rule that evaluates to zero is zero
regardless of whether the rule is evaluated
or not. Therefore, rules that have scores
of zero are not evaluated at all by spamassassin
because evaluating them has no meaning.

What I needed, but did not realize I needed,
was a way to evaluate rules without giving
the rule a score.

Here's the article that taught me that
you can make a rule in spamassassin
that is evaluated but that has no score:

Writing your own Add-On Rules for SpamAssassin

No score rules are great! They are great
for several reasons:

  1. No score rules do not show up in your
    spamassassin scoring reports that are
    inserted into each email evaluated by
    Spamassassin. This reduces visual clutter.
  2. No score rules only have consequences
    if they add up to something greater in
    a meta rule.
  3. No score rules allow you to score cumulative
    words and phrases, regardless of what order
    the words and phrases appear in the email

Let me give an example. Let's say, in
your own mind, the term on sale
is spammy, but not so spammy as to trigger
a spam rule that scores points.

Let's also say that the word discount,
in your own mind, is also spammy, but not so
spammy as to warrant a spam trigger via point
assignment.

You now have 2 terms on sale, and
discount, which by themselves are
not spammy enough to do anything about.

After all, a good friend could email you and
say that they got something on sale or
they got it at a discount and it's all
perfectly innocent. Both terms, on sale
and discount are legitimate terms in
normal human discourse.

Now let's say that even though discount
and on sale in insolation are not worth
assigning a score to — but — taken together,
they have a much greater meaning than they do
when seen in isolation.

Let's say that, in your mind, any email that mentions
both discount and on sale in the same
email should be scored one point. Here's how you
do this:

First, you need a scoreless way to score the
term on sale. Here's the no score
way to do it:

body           __ON_SALE  m|on.{0,12}sale|i
describe       __ON_SALE  The term 'on sale' is found in the body of the email message

Note the double underscores in the name of the rule.
That's the mechanism that gives you a scoreless rule.
That's the thing I was missing. I did not understand
that spamassassin has a mechanism for assigning no
score
.

Note also that I've decided to use the match operator
in a very liberal way. The zero in the match operator
indicates that I don't care whether or not the words
are run together. onsale and on sale
will both trigger this rule equally well. That's
what the zero is all about.

Also, I don't care too much about what appears between
the two words on sale as long as it is 12 characters
or less. That's very liberal and will catch the
words on sale in many different forms.

For example, it will catch this weekend only sale,
because the word only has the word on embedded
inside it. I"m choosing to be very liberal to demonstrate
that spamassassin is a very flexible tool. You may choose
to be more cautious than I"m being in my example.

Also note that I'm using the case insensitive suffix, the
letter i. With the letter i suffix, the
terms On Sale, ON SALE, and on sale
are all caught equally well.

Now lets do the same thing to the term discount.
Here's my rule for discount:

body           __DISCOUNT  m|discount|i
describe       __DISCOUNT  The word 'discount' is found in the body of the email message

OK. Now we're ready to put it all together. Now
we're ready to say that anyone who mentions a
discount and something being on sale
in the same email is at least a little bit likely
to be a spammer.

Here's how it all comes together:

meta           ON_SALE_DISCOUNT      (__ON_SALE && __DISCOUNT)
describe       ON_SALE_DISCOUNT      The terms 'on sale' and 'discount' are both found in the body of the email

Here's the thing I love most about this approach. The
rule ON_SALE_DISCOUNT has the following
characteristics:

  1. It does not care whether on sale
    or discount appears first. Any order
    for these 2 terms is acceptable and earns
    the spammer a point
  2. Distance does not matter. These 2 terms
    could be 5 paragraphs apart. Even with huge
    swaths of text separating the 2 terms, the
    2 terms together earn our spammer a point.

One more thing worth mentioning. I could have
assigned the rule ON_SALE_DISCOUNT a
score other than 1 point. However, I'm very
very happy with 1 point.

I pretty much never mess with the default 1
point that spamassassin gives you. Instead, I
write more rules if a one rule is not enough
or I eliminate a rule if that rule is not worth
1 point all by itself.

The lesson for me? There always a better way
to do things. For me, placing a double underscore
in front of rules that are only there for cumulative
effect is a much better way of doing things.

Ed Abbott

Tuesday, December 20, 2011

Why Does Spamassassin
Ignore Emails
with PDF Files Attached?

Why does Spamassassin fail to score emails
that have a PDF file attached? This was
the question I tired to answer for myself
yesterday.

I started doing a little bit of research and
I learned something interesting. It is not
the PDF file that is the problem; it is the
email length.

Spamassassin fails to markup emails that are
too long. That's the crux of the problem.

There's a reason for this:

The longer an email, the more exponential the
processing time for that email. Thus a 200
kilobyte email is likely to take 4 times as
long to process as a 100 kilobyte email.

Don't take my figures too literally. However,
my computer programming experience tells me
that processing time often grows in proportion
to the square of the file size.

Thus a file that is twice as long will tend
to take four times as long to process. There's
a simple reason for this. Everything in a file
tends to relate to everything else.

Let me give you a super-simple example. Let's
say a file consists of 3 lines. Since everything
relates to everything else (including itself) than
3 lines gives us 3 X 3 = 9 things to relate to. A
file with 3 lines creates 9 relationships.

In our 3 line file, line 1 relates to line 1. It
also relates to line 2 and 3. So far that's three
things to relate to. If you multiply all these
relationships out, 3 lines form 9 different relationships.

Now here's where exponential processing time kicks
in. If you double the size of the file from 3 lines
to 6 lines, you now have 36 different relationships.

That is to say, 6 times 6 equals 36. In doubling the
file size, you've quadrupled the number of lines
in the file relating to each other. 6 X 6 = 36
is quadruple 3 X 3 = 9. That is to say 4 X 9 = 36.

This all comes bakc to processing time. Processing
time quadruples when you double the size of the file
being processed.

Of course, this is an absurd oversimplification. Good
explanations are often absurd oversimplifications.

However, this does illustrate why Spamassassins processing
time is not a linear one. Processing time is not linear
relative to file size.

In the real world, some spamaassin rules probably are linear
and some are not. For example, if you write a spamassassin
rule that only looks for a single spammy word in the file.
such as the word discount, the relationship
is linear. That is to say, the search for the one-word
discount is probably linear and directly proportional
to the file size. Looking for the one-word discount
only takes as long as the file is long. Double the file
size when looking for one word and you double your processing
time. Simple.

Keep in mind, though, that Spamassassin rules are quite
sophisticated at times. The rules are often just as sophisticated
as the spam they process. Therefore, Spamassassin rules can be
quite exponential in the processing time they require.

What is the default file size? How big does a spam email have
to be before Spamassassin gives up and decides not to process
that messsage at all?

This Spamassassin documentation seems to be saying that the
maximum file size for an email is 256KB:

Spamassassin Options

The way the above documentation reads to me is unclear. It
sounds like you can set the maxium message size to
anything you want. It also sounds like the default message
size is 500K.

However, it also says that the maximum message size is
256 megabytes. Big difference. I'm going to go ahead
and assume that this means that 256 megabytes is an absolute
limit and 500 kilobytes is the default limit. I can't think
of another way to read this, can you?

In any case, 256 megabytes is way too many megabytes to worry
about. A spam message this size would take many many minutes
just to load off the network as of this writing (December 2011).

I'd forget the 256 megabytes and focus on the 500 kilobytes.

The 500 kilobyte figure explains why PDF files are often ignored
by Spamassassin. Sending a PDF is a favorite spammer trick. Since
PDF files are so big, it is a technique for manipulating Spamassassin
into ignoring the spam message.

OK. That explains it. That explains to my satisfaction why
Spamassassin has been ignoring spam messages with PDF files
attached. The PDF files are too big to be examined.

I'll look into this more later.

Ed Abbott

Tuesday, December 13, 2011

Body Versus Rawbody
in Spamaassassin Rules

Currently I'm doing so research into the
difference between body and rawbody
when writing spamassassin rules. Here's
one explanation:

Rawbody or Body

It seems that body does a couple of things based
on the above explanation:

  1. It ignores HTML tags
  2. It goes beyond end-of-line
    boundaries

I make these suppositions based on what I read
at the above link.

It would seem that rawbody gives you the
ability to do a couple of things:

  1. Examine beginning and end-of-line relationships
  2. Examine HTML tags

An example of when you might want to examine
an end-of-line relationship is the occasion
when you choose to use an end-of-line anchor,
which is a dollar sign character in
regular expressions.

Here's a more complete explanation of the difference
between body and rawbody:

Rule Definitions and Priviledged Settings

I've learned something valuable by experimenting
with these rules. I've learned that sometimes a
body rule has trouble crossing a line boundary
but other times it does not.

I've struggled with this for quite some time! In
doing a little research I think I may have found an
answer. It seems that body rules treat HTML
paragraphs as more than one line. Here's where I
first came across this idea:

Add a New Rule Type:
Single-Line Body?


The helps me to understand why this rule crosses
line boundaries:

rawbody ED__FREE_EARN_MONEY_RAWBODY m|free.{0,1200}earn.{0,1200}money|is
describe ED__FREE_EARN_MONEY_RAWBODY ED__FREE_EARN_MONEY_BODY The words 'free earn money' in body

But this rule does not:

body ED__FREE_EARN_MONEY_BODY  m|free.{0,1200}earn.{0,1200}money|is
describe ED__FREE_EARN_MONEY_BODY The words 'free earn money' in body  

More precisely, the above rawbody rule crosses
paragraph boundaries but the body rule does
not. When I say paragraphs, I mean HTML paragraphs.

I suspect I need to make better use of meta rules
in order to cross paragraph boundaries. Perhaps I need
an aphabetized list of spam words that I can then
use to trigger meta rules.

I might then have collections of spammy words that trigger
actual points. For example, the above rule goes after the
word free followed by the word earn followed
by the word money.

Only when these words are used together are they spammy words.
Each word in isolation is not spam. It's the words used
together to form bigger thoughts that makes them spammy.

Or maybe I should just stick with rawbody rules. I
think I can avoid processing time on these rules growing
exponentially long by putting a 1200 character limit (or
whatever limit) on my regular expression patterns.

I prefer to use rawbody as seldom as possible because
some rawbodies are awfully long. However, in some cases,
rawbody may be the best solution.

The lesson for me in this is find the right tool fo
do the job. One tool will fit one job and another tool
will do a better job on another job.

Ed Abbott

Wednesday, December 7, 2011

Where sa-update Keeps the Latest Rules

When fighting spam, I like to know that
I"m working with the latest rules. The
sa-update command is fairly quiet
about what it is doing. I tend to run
sa-update and then wonder if anything
happened.

Here's where sa-update keeps the most
up-to-date rules on my system:

/var/lib/spamassassin/3.003001/updates_spamassassin_org/

In time, the version number, 3.003001, will be outdated.
This page tells you why:

Rule Updates

The timestamps on all the files in the above directory are the
same. All the timestamps are identical and just a few minutes
old.

As of this writing, it appears that sa-update updates
all the files in the directory by overwriting them. This
makes sense. Since all these files appear to be small text
files, perhaps this is as good an approach as any.

To observe sa-update in action, you might try this
command. The command includes the -D switch which
gives debug information:

sa-update -D

I had assumed that my rules were updated each time I retrieve
email with Kmail. I'm wrong! Does this mean I should
run sa-update just before I retrieve email? Perhaps so.

Also, I notice that I have to run sa-update as root or it does
not work. This makes sense.

Will I get less spam if I run sa-update more often? I'm
going to experiment to see.

Update: January 11, 2012

I'm back trying to fill in the holes in my knowledge
about sa-update. A question I've had all along
is Does sa-update automatically update?.

Apparently not. It's apparent to me that in my current
Linux distribution, which is Debian Squeeze, sa-update
is a manual operation. I'm guessing that there is a way
to make it automatic, I just don't personally know how to
do it.

In reading the sa-update man page, I find that
there are 2 fundamental truths regarding the availability
of updates:

  1. If you run sa-update and no update is available,
    sa-update exits with an exit status of 1
  2. When an update does become available, running
    sa-update will give you an exit state of 0

So it all comes down to one or zero. So that's how
this thing works! I've been wondering about this
for quite some time.

In ancient times, when I was still writing Unix
shell scripts regularly, I knew that typing
echo $? at the command line prompt would
give you the exit status of the last command
typed.

Try this command sequence:

ls
echo $?
ls --invalidoption
echo $?

The first command, ls, gives you an exit
status of zero. The second command, ls --invalidoption,
gives you an exit status of something other than
zero.

Zero is OK and non-zero is not so OK.

Apparently this is how you determine whether or not
sa-update has an update for you. You type the following
2 commands:

  1. sa-update --checkonly
  2. echo $?

If the exit status is a zero, an update is available.
IF the exit status is one, no update is available.

I think I've finally figured out how to tell whether
or not sa-update actually did something. Run
sa-update and check the exit status before and
after.

If the exit status is zero before you run sa-update
and it is non-zero after you run sa-update, sa-update
actually did something.

Got it!

The lesson here seems to be if you dig deep enough, you
find the answer you are looking for.

Ed Abbott

Thursday, October 6, 2011

How to Filter Out
Foreign Language Email

I just learned something new. Lately
I've been receiving Russian spam of some
kind. Since I do not know Russian, I do
not know what the spam actually says.

This web page describes how foreign
language spam can be filtered out:

Mail::SpamAssassin::Conf -
SpamAssassin configuration file


I placed the following line in my
user_prefs file:

ok_locales en

Once I finished altering user_prefs,
I tested the result using the technique
described in this post:


How to
Test One Single Email
With Spamassassin


According to the above test, the
Russian spam triggered the following
rules:

CHARSET_FARAWAY_HEADER
MIME_CHARSET_FARAWAY

Such a simple thing! Setting
ok_locales triggered a
couple of rules in this case. Both
rules use the word faraway.

I like the word faraway. I
get both chinese and russian spam.
This spam is very faraway from my
desires.

The lesson for me in all of this
is that if you want to solve a
problem, look for a simple solution
first.

Filtering out languages that I don't
understand is a simple solution.

Update: November 9, 2011

For some reason, this does not always
work. It works for some foreign
language spam but not all foreign
language spam.

Why are they still getting through?
I'm not sure.

I suspect it has to do with utf-8.
Since utf-8 is a universal character
set, it may not be as easy to itdentify
utf-8 spam as other spam.

That's my theory as to why ok_languages
does not always seem to work.

Right now, it's just a theory. However,
in coming weeks I'll be looking to see if is
is consistently true that utf-8 foreign
language spam
does not get filtered out.

Ed Abbott

Thursday, September 1, 2011

How to
Test One Single Email
With Spamassassin

Testing One single email with Spamassassin.
It sounds so simple. Why has it taken me
so long to discover how to do this?

I suppose it took me this long for two reasons.

  1. It never occurred to me to pipe the single
    email through spamassassin.
  2. I was slow figuring out how to use the
    man pages to discover spamassassin command
    line options

Of the two discoveries, the first was discovering
command line options. I write about this in a
previous post:

Spamassassin Options

Next, I discovered the delete and
test options. You can read about
these on the man page called
spamassassin-run.

Here's how I put it all together:

cat spam.mbox | spamassassin -dt >testresults.txt

Here are the steps that puts it all
together:

  1. Save the spam email that you are
    interested in testing to a file called
    spam.mbox
  2. Run the spam email through
    spamassassin using the above pipe
  3. At the end of the pipe, save the
    results to a file called testresults.txt
  4. View testresults.txt with your
    favorite text editor

How do you save a spam email? My email client
is called kmail. With kmail, the spam
email is saved by using the file menu
in the upper left-hand corner of your screen.
The way you save a spam email to a file may
differ from the way I save a spam email to a
file.

When testing your single spam email, be sure
to include the -dt option. The -d
part of the option deletes spamassassin markup
that is in already in the email and that may
confuse the issue.

The -t option says that this is just a
test and is not the real deal. Basically, you
are testing how spamassassin will respond to
a specific email rather than running spamassassin
for its ability to classify and categorize spam.

In other words, -t is theory instead of
actual practice. With -t you can test
your brand new spamassasin rule before
putting it into production.

Of course, you want to be sure the new rule has
correct syntax before doing any of this. The
command for testing a rule for syntax correctness
is:

spamassassin --lint

It's nice to be able to immediately test a new
rule you've written for a specific spam email to
see how many points it will rack up. That's the
name of the game: racking up points.

The lesson? Sometimes it takes a long time to
discover the simplest little thing.

Being able to test a spam email for how many
points it will rack up is the simplest little
thing. Yet, it is very helpful to know how to
do this.

Update: February 7, 2012

I've since learned more about testing a single
email against spamassassin. I've learned that
it is probably better to run local tests only
when testing a single email.

What is a local test? It is a non-network test.
Some tests require a network access. To turn
off test that require a network access, you
use the -L option.

If I understand correctly, the spamassassin -L option
will only run tests that are stored on your hard
drive. These tests include tests that you have
written and tests that have been written by others..

The tests that I have written are stored in this
directory on my Debian Squeeze system:

~/.spamassassin/user_prefs

The tests that I did not write are stored here:

/var/lib/spamassassin/3.003001/updates_spamassassin_org/

It's when I run sa-update at the command line while
logged in as root that I acquire rules that I did not write
in the above directory. The point? Generally speaking, rules
that are stored on my hard drive are considered local rules.

The rules that are not local are the ones that require a
network access. In my rather limited experience, network
rules are rules that necessitate a lookup in a blocklist of
some kind. Perhaps there are other kinds of rules that
require a network access that I do not know about.

Generally speaking, blocklists are spammy IP addresses
that have been used to send spam in the past. If I
understand right, overuse of blocklist lookups can
get you categorized as a commercial user who is supposed
to pay for these lookups.

Using the spamassassin -L option can help you to
avoid excessive lookups in various blocklists. Therefore,
when I test a single email, I now add the -L option
like this:

cat spam.mbox | spamassassin -dtL >testresults.txt

Note that the above command line is the same as the
one I published up above a few months ago except that
the -L option is now present.

The lesson? No matter how much you learn about something,
there's always something else to know.

Ed Abbott

Friday, August 26, 2011

Spamassassin Options

I'm only now figuring out all the
options Spamassassin has available
on the command line. My mistake?
Using the following command to find
opitons:


man spamassassin

What I should have been typing is this:

man spamassassin-run

More and more I'm finding that other commands
are organized like Spamassassin. As the world
gets more complex, commands are organized into
families.

Spamassassin is a family of commands, not just
one command. OK. Now I now where to look for
Spamassassin command line options.

To find out about the family of spamassassin
commands, type this:

man spamassassin

To find out about spamassassin commandline
options, type this:

man spamassassin-run

The lesson? To get where you are going, it
helps to know how to get there.


Ed Abbott

Tuesday, September 7, 2010

Eliminating False Positives on Ham Emails

 
A hazard of writing rules for
Spamassassin is false positives.
Inadvertently, a ham message can
trigger a false positive.

How do you avoid writing rules
that catch ham when you intended
to catch spam instead?

Here are some of the guidelines I
personally follow to avoid falling
into the false positive trap:

  1. Never write a spam rule that
    would appear in ham emails more
    often than one of ten thousand times.
  2. Have a separate email folder that
    has been designated ham where
    you store your collection of ham emails.
  3. Periodically do searches on your
    ham emails for rules you've written
    that have been triggered.
  4. To make it easy to find rules
    that have been triggered in your ham
    folder, use a unique set of characters
    to identify spam rules you've written
    yourself

About the one out of ten thousand
rule
: This is strictly a subjective
criteria.

For example, in my own mind, I've decided
that the term on sale is probably
going to appear less than one out of ten
thousand times in my ham emails.

After I've implemented a spam rule, I test
it for unintended consequences by searching
my ham emails regularly. Here are the steps
I take to search ham emails.

I use kmail as my email client.
While the steps I take may vary slightly from
your steps, you probably can find a way to do
the same thing that I do.

Here are the steps under kmail:

  1. Click on the ham folder
    to make it the present working folder
  2. Click on the tools menu at the top of
    the kmail interface
  3. Click on Find Messages
  4. Search for messages that have your
    very carefully chosen personal rule
    identification string
  5. Wait until all messages in your
    ham folder that have triggered false
    positives to be gathered
  6. Once all the false positives have
    been gathered, click on the date
    column
    to sort the false positive
    emails in date order
  7. Start clicking on the emails themselves
    in reverse chronological order to find
    out why each email was subject to one
    or more false positives

The reason I look at emails in reverse
chronological order
is that I'm really
only interested in false positives that
I've not yet seen.

Here's an example of a false positive:

I recently got an email from a friend.
We were planning to attend the annual
church campout hosted by our church.

She wrote to say that she normally is
able to get tiki torches at the
end of summer on sale. Our campout
is in late August.

This year, however, she was unable to find
any on sale. The words on sale are
part of my personal spam ruleset. Therefore,
her email got triggered.

In spite of the trigger, her email got an
overall score of minus 1.7. Minus 1.7 is
clearly a ham email.

Without the one point for the on sale
mention in the body of her message, her score
would have been minus 2.7. Not much of a
difference.

However, this did give me the opportunity to
rethink my rule. Are the words on sale
likely to appear in ham emails more often than
one out of ten thousand emails? I've decided not.

I feel that the rule is doing a lot of good and
a miniscule amount of harm so I've decided to keep
it.

The one out of ten thousand rule rule for
writing has served me well. I catch a lot of
spam this way and it becomes almost statistically
impossible for legitimate email to be identified
as illegitimate email.

I'm amazed at how well this rule works. Here's
how the one out of ten thousand rule works in
actual practice.

Let's say that one of these rules will generate
a false positive in one out of ten thousand cases.
Let's further say that two of these rules, working
together, add another multiple of one hundred to
the probability.

One out of ten thousand times another one out of a
hundred is one out of a million. Therefore, two
rules working together are, in theory, likely to
trigger together in a ham email one out of a million
times. The math I'm using is very very intuitive.

The only thing I've got backing it up is
experience. Using the one out of ten thousand
rule, I've never had two rules together trigger
simultaneously on a ham email. There is an
exception to this and that is when a friend
of mine forwards a promotional email to me.

I consider promotional emails sent to me by
a friend, who thinks I might be interested,
a ham email. However, it is almost impossible
to write rules for this rare case that make any
sense.

Therefore, I ignore this possibility and hope
for the best. Typically, the ham email, which
has promotional language in it, will get
through my spam filters very easily.

The fact that the email is legitimately from
a friend and the fact that my friend is part
of my auto whitelist seems to provide cover
for what would otherwise be an offending
email.

Other than the rare case when emails
are legitimately written in promotional
language, my spam filters are working
perfectly. In fact, I can't recall a
single missed email in the past few years.

All of the good stuff seems to be getting
through. When bad email gets through, I
start writing more rules. All my rules
are one out of ten thousand rules
and all of them are worth exactly one point.

When 5 points are gathered, the email is
designated spam and automatically goes to
the spam folder.

Using the one out of ten thousand rule
I get so little spam that I sometimes get
suspicious. I go into my spam folder just
to make sure that the spammers have not been
neglecting me.

In the end, I find that I've not been neglected
at all. Currently I have approximately 25,000
spam emails in my spam folder. No need to worry.
The spammers still love me.

Spamassassin, when used in a clever and intelligent
way, is one of the most effective software packages
I've ever run across.

Ed Abbott

Tuesday, March 30, 2010

Spamassassin Plugins Directory

 
I think I've found the spamassassin
plugin directory. Anyone who knows
more than I do, please please correct
me. On my computer, the plugin directory
seems to be here:

/etc/spamassassin

I was looking to see if the MIME-type
plugin was enabled in my version of
spamassassin. Indeed it seems to be.
The following grep command yields
the following result when executed in
the above /etc/spamassassin directory:

$ grep -i mimeheader *
v310.pre:# MIMEHeader - apply regexp rules 
against MIME headers in the message
v310.pre:loadplugin 
    Mail::SpamAssassin::Plugin::MIMEHeader

OK. It appears that I do have a
plugin called MIMEHeader
which is plugged into Spamassassin.
Here's how I got more information
on this plugin:

perldoc Mail::SpamAssassin::Plugin::MIMEHeader

Reading the documentation, I learned
that there is a raw version
of the MIME type. Learning this
has helped me solve a problem.

I learned that the raw version of
the MIME type does not clean up
newlines while the regular version
does.

This has caused me to use the Perl
match operator a bit differently.
I had wondered why my match operator
was not matching multi-line data.

A little digging in the Perl documentation
and I realized that I need to and an
s suffix to match the raw
version
, because the raw version
does not replace newlines but instead leaves
them intact.

Here's what I've done. I've gone from
this:

m|matchme.*matchme|i

to this:

m|matchme.*matchme|is

I'm hoping the s suffix
in the above match operator
will allow me to cross newline
boundaries with my .* pattern.

The other solution seems to be to
drop the :raw namespace off
the end of the MIME type so that
newlines are cleaned up for me.

All this is supposition. I'll
have to see whether or not my
new Spamassassin rule is now able
to cross newline boundaries unhindered.

Ed Abbott

Monday, December 7, 2009

Examples of Spamassassin Rules

Here's where you can go to find great
examples of spamassassin rules:

cd /usr/share/spamassassin/

In this directory are all kinds of
files with spamassassin rules that
catch spam.

Imitate and learn.

Of course, I'm assuming your system
is set up the way mine is. If not,
I'm sure you can find the same files
on your system somewhere.

My system is Debian Lenny Linux. I
always go with the standard install
options.

Your Linux may be different than mine.
I don't know.

Ed Abbott

Friday, December 4, 2009

Customizing Spamassassin

OK. This is how you customize
spamassassin to eliminate specific
kinds of spam emails.

Say, for example, you don't want
any more emails from a guy named
Joe the Rolex Man who has an email
address self-identify blurb
that always includes Joe the
Rolex Man
in double quotes.

Hopefully, I'm making this up. If
your name is Joe the Rolex Man, I'm
sorry.

In other words, the sender of the email
is a guy names Joe Rolex. How
do you eliminate all email senders who
have Rolex in their name?

OK. Here are the steps:

  1. Find your .spamassassin folder
  2. Find the user_prefs file in the folder
  3. Add rolex to the file

Obviously, the 3 steps above need further
explanation. Especially the word rolex.
Rolex needs to be a perl regular expression.

So here's some more tips. First some tips on
how to find the .spamassassin folder:

  1. It is a hidden folder because it starts
    with a dot
  2. Type ls -Al to see it
  3. Look for it in your home directory

If you are trying eliminate Joe Rolex
from your entire system, and for every user
on that system, my advice is no good.

I'm telling you how to eliminate Joe Rolex
as a single spamassassin end-user. System
administrators, note the upcoming link.

Here's a much more comprehensive guide to
Spamassassin that will show you how to do
the same thing system-wide:

Custom rules for spamassassin

OK. Back to being a single user
trying to eliminate Joe Rolex.

Presumably you've now found the
user_prefs file under your
home directory. Here's where you will
find it:

~/.spamassassin/user_prefs

Now we need to add a rule to this file.
A rule is something that kicks in when
we want it to.

The rule kicks in and spam get kicked
out.

Our rule is that we are going to make
some attempt to eliminate email sent
to us by Joe Rolex.

Here's what our rule looks like in the
user_pref file:

header ED_ROLEX_FROM        From:name =~ m{rolex}i
describe ED_ROLEX_FROM      From name has rolex in it

One thing you want to do
after writing a new rule
is to run spamassassin's
lint program.

As you have probably surmised,
it gets the lint (bugs) out
of your rules.

Here's how to run lint:

spamassassin --lint

How many points does the
new rule shown above assign?

If the email comes from Joe
Rolex, it defaults to 1 point.

Your can change the default by
using the score directive like
this:

header ED_ROLEX_FROM        From:name =~ m{rolex}i
describe ED_ROLEX_FROM      From name has rolex in it
score ED_ROLEX_FROM         .5

The above code changes the score from
the default of one point to a specified
half a point.

One more minor detail:

Notice that the name of my rule is
always prefaced by this string:

ED_

That's because my name is Ed and I
want to differentiate between rules
written by myself and rules written
by others.

This way, if a rule is not working right,
I know to go fix it.

In other words, sometimes spam gets through.
When a apam gets through and I see that one
of my rules has not been triggered, I go
investigate.

Likewise, if one of my rules gets triggered
on an innocent ham message, I investigate.

Again, it is the followed pre-pended string
on the rule name that signals to me that I
need to reconsider one of my own rules because
it is not quite working right:

ED_

Ed Abbott

Monday, November 16, 2009

Feeding sa-learn an email folder

This is a new blog.

I'm blogging about SpamAssassin
and the things I find that make
SpamAssassin useful.

One thing I like to do to make SpamAssassin
a better performer is to periodically re-feed
it old emails that I have stored in either my
ham email folder or my spam email folder.

I have one spam folder and one ham folder.

I reprocess these old emails periodically in
the hopes of improving SpamAssassin's accuracy.

First, I'll give some general steps for
feeding SpamAssassin ham:

  1. Find your ham email folder. The
    place to look? Wherever your email
    reader places these files. In my
    case, I use kmail. Therefore, my
    email folder is a kmail email folder.
  2. Feed the folder to sa-learn

OK. Those are the steps, generally
speaking.

Here's the generic command for feeding
ham to sa-learn:

sa-learn --ham your-ham-folder


OK. Now I'll give you the specific steps
I take. Note that the steps I take and the
steps you take are likely to be quite different.

Why? Because I use kmail as my email client
and you likely use something else.

Also, I've used kmail to set up a ham folder
that I can send ham emails to with a single
click. In all likelihood, you have not yet
set up such a folder.

So, read the following steps and translate them
to your own situation.

These are my steps for feeding sa-learn ham on
a Debian Linux system using kmail as my email
client:

  1. cd /home/eds_home_dir/.kde/share/apps/kmail/mail/ham
  2. sa-learn --ham cur

Simple, isn't it?

Of course, prior to running sa-learn, you want to be
sure, in your own mind, that all the messages in your
ham folder really are ham messages. Otherwise, you
might confuse sa-learn.

What about spam messages?

Well, it is pretty much the same thing.

With spam messages, the generic command
is as follows:

sa-learn --spam your-spam-folder


So, basically, you do the same thing
you did for ham:

  1. Find your spam folder
  2. Feed the spam folder to sa-learn

Hope this helps!

Ed Abbott