Tuesday, September 7, 2010

Eliminating False Positives on Ham Emails

 
A hazard of writing rules for
Spamassassin is false positives.
Inadvertently, a ham message can
trigger a false positive.

How do you avoid writing rules
that catch ham when you intended
to catch spam instead?

Here are some of the guidelines I
personally follow to avoid falling
into the false positive trap:

  1. Never write a spam rule that
    would appear in ham emails more
    often than one of ten thousand times.
  2. Have a separate email folder that
    has been designated ham where
    you store your collection of ham emails.
  3. Periodically do searches on your
    ham emails for rules you've written
    that have been triggered.
  4. To make it easy to find rules
    that have been triggered in your ham
    folder, use a unique set of characters
    to identify spam rules you've written
    yourself

About the one out of ten thousand
rule
: This is strictly a subjective
criteria.

For example, in my own mind, I've decided
that the term on sale is probably
going to appear less than one out of ten
thousand times in my ham emails.

After I've implemented a spam rule, I test
it for unintended consequences by searching
my ham emails regularly. Here are the steps
I take to search ham emails.

I use kmail as my email client.
While the steps I take may vary slightly from
your steps, you probably can find a way to do
the same thing that I do.

Here are the steps under kmail:

  1. Click on the ham folder
    to make it the present working folder
  2. Click on the tools menu at the top of
    the kmail interface
  3. Click on Find Messages
  4. Search for messages that have your
    very carefully chosen personal rule
    identification string
  5. Wait until all messages in your
    ham folder that have triggered false
    positives to be gathered
  6. Once all the false positives have
    been gathered, click on the date
    column
    to sort the false positive
    emails in date order
  7. Start clicking on the emails themselves
    in reverse chronological order to find
    out why each email was subject to one
    or more false positives

The reason I look at emails in reverse
chronological order
is that I'm really
only interested in false positives that
I've not yet seen.

Here's an example of a false positive:

I recently got an email from a friend.
We were planning to attend the annual
church campout hosted by our church.

She wrote to say that she normally is
able to get tiki torches at the
end of summer on sale. Our campout
is in late August.

This year, however, she was unable to find
any on sale. The words on sale are
part of my personal spam ruleset. Therefore,
her email got triggered.

In spite of the trigger, her email got an
overall score of minus 1.7. Minus 1.7 is
clearly a ham email.

Without the one point for the on sale
mention in the body of her message, her score
would have been minus 2.7. Not much of a
difference.

However, this did give me the opportunity to
rethink my rule. Are the words on sale
likely to appear in ham emails more often than
one out of ten thousand emails? I've decided not.

I feel that the rule is doing a lot of good and
a miniscule amount of harm so I've decided to keep
it.

The one out of ten thousand rule rule for
writing has served me well. I catch a lot of
spam this way and it becomes almost statistically
impossible for legitimate email to be identified
as illegitimate email.

I'm amazed at how well this rule works. Here's
how the one out of ten thousand rule works in
actual practice.

Let's say that one of these rules will generate
a false positive in one out of ten thousand cases.
Let's further say that two of these rules, working
together, add another multiple of one hundred to
the probability.

One out of ten thousand times another one out of a
hundred is one out of a million. Therefore, two
rules working together are, in theory, likely to
trigger together in a ham email one out of a million
times. The math I'm using is very very intuitive.

The only thing I've got backing it up is
experience. Using the one out of ten thousand
rule, I've never had two rules together trigger
simultaneously on a ham email. There is an
exception to this and that is when a friend
of mine forwards a promotional email to me.

I consider promotional emails sent to me by
a friend, who thinks I might be interested,
a ham email. However, it is almost impossible
to write rules for this rare case that make any
sense.

Therefore, I ignore this possibility and hope
for the best. Typically, the ham email, which
has promotional language in it, will get
through my spam filters very easily.

The fact that the email is legitimately from
a friend and the fact that my friend is part
of my auto whitelist seems to provide cover
for what would otherwise be an offending
email.

Other than the rare case when emails
are legitimately written in promotional
language, my spam filters are working
perfectly. In fact, I can't recall a
single missed email in the past few years.

All of the good stuff seems to be getting
through. When bad email gets through, I
start writing more rules. All my rules
are one out of ten thousand rules
and all of them are worth exactly one point.

When 5 points are gathered, the email is
designated spam and automatically goes to
the spam folder.

Using the one out of ten thousand rule
I get so little spam that I sometimes get
suspicious. I go into my spam folder just
to make sure that the spammers have not been
neglecting me.

In the end, I find that I've not been neglected
at all. Currently I have approximately 25,000
spam emails in my spam folder. No need to worry.
The spammers still love me.

Spamassassin, when used in a clever and intelligent
way, is one of the most effective software packages
I've ever run across.

Ed Abbott