A hazard of writing rules for
Spamassassin is false positives.
Inadvertently, a ham message can
trigger a false positive.
How do you avoid writing rules
that catch ham when you intended
to catch spam instead?
Here are some of the guidelines I
personally follow to avoid falling
into the false positive trap:
- Never write a spam rule that
would appear in ham emails more
often than one of ten thousand times. - Have a separate email folder that
has been designated ham where
you store your collection of ham emails. - Periodically do searches on your
ham emails for rules you've written
that have been triggered. - To make it easy to find rules
that have been triggered in your ham
folder, use a unique set of characters
to identify spam rules you've written
yourself
About the one out of ten thousand
rule: This is strictly a subjective
criteria.
For example, in my own mind, I've decided
that the term on sale is probably
going to appear less than one out of ten
thousand times in my ham emails.
After I've implemented a spam rule, I test
it for unintended consequences by searching
my ham emails regularly. Here are the steps
I take to search ham emails.
I use kmail as my email client.
While the steps I take may vary slightly from
your steps, you probably can find a way to do
the same thing that I do.
Here are the steps under kmail:
- Click on the ham folder
to make it the present working folder - Click on the tools menu at the top of
the kmail interface - Click on Find Messages
- Search for messages that have your
very carefully chosen personal rule
identification string - Wait until all messages in your
ham folder that have triggered false
positives to be gathered - Once all the false positives have
been gathered, click on the date
column to sort the false positive
emails in date order - Start clicking on the emails themselves
in reverse chronological order to find
out why each email was subject to one
or more false positives
The reason I look at emails in reverse
chronological order is that I'm really
only interested in false positives that
I've not yet seen.
Here's an example of a false positive:
I recently got an email from a friend.
We were planning to attend the annual
church campout hosted by our church.
She wrote to say that she normally is
able to get tiki torches at the
end of summer on sale. Our campout
is in late August.
This year, however, she was unable to find
any on sale. The words on sale are
part of my personal spam ruleset. Therefore,
her email got triggered.
In spite of the trigger, her email got an
overall score of minus 1.7. Minus 1.7 is
clearly a ham email.
Without the one point for the on sale
mention in the body of her message, her score
would have been minus 2.7. Not much of a
difference.
However, this did give me the opportunity to
rethink my rule. Are the words on sale
likely to appear in ham emails more often than
one out of ten thousand emails? I've decided not.
I feel that the rule is doing a lot of good and
a miniscule amount of harm so I've decided to keep
it.
The one out of ten thousand rule rule for
writing has served me well. I catch a lot of
spam this way and it becomes almost statistically
impossible for legitimate email to be identified
as illegitimate email.
I'm amazed at how well this rule works. Here's
how the one out of ten thousand rule works in
actual practice.
Let's say that one of these rules will generate
a false positive in one out of ten thousand cases.
Let's further say that two of these rules, working
together, add another multiple of one hundred to
the probability.
One out of ten thousand times another one out of a
hundred is one out of a million. Therefore, two
rules working together are, in theory, likely to
trigger together in a ham email one out of a million
times. The math I'm using is very very intuitive.
The only thing I've got backing it up is
experience. Using the one out of ten thousand
rule, I've never had two rules together trigger
simultaneously on a ham email. There is an
exception to this and that is when a friend
of mine forwards a promotional email to me.
I consider promotional emails sent to me by
a friend, who thinks I might be interested,
a ham email. However, it is almost impossible
to write rules for this rare case that make any
sense.
Therefore, I ignore this possibility and hope
for the best. Typically, the ham email, which
has promotional language in it, will get
through my spam filters very easily.
The fact that the email is legitimately from
a friend and the fact that my friend is part
of my auto whitelist seems to provide cover
for what would otherwise be an offending
email.
Other than the rare case when emails
are legitimately written in promotional
language, my spam filters are working
perfectly. In fact, I can't recall a
single missed email in the past few years.
All of the good stuff seems to be getting
through. When bad email gets through, I
start writing more rules. All my rules
are one out of ten thousand rules
and all of them are worth exactly one point.
When 5 points are gathered, the email is
designated spam and automatically goes to
the spam folder.
Using the one out of ten thousand rule
I get so little spam that I sometimes get
suspicious. I go into my spam folder just
to make sure that the spammers have not been
neglecting me.
In the end, I find that I've not been neglected
at all. Currently I have approximately 25,000
spam emails in my spam folder. No need to worry.
The spammers still love me.
Spamassassin, when used in a clever and intelligent
way, is one of the most effective software packages
I've ever run across.
Ed Abbott
No comments:
Post a Comment