Wednesday, January 11, 2012

How to Write Spamassassin Rules With No Score

I had the mistaken notion that
Spamassassin rules have to have
a score. Because of this, I was
writing rules with 1/1000th of a
point.

I gave very small scores to rules
that I would later collect together
to form a meta rule.

The mistaken notion came about because
I read that rules that have a score
of zero are not evaluated. Since Perl
uses zero to mean false and since
spamassassin is based on Perl, I figured
I was stuck writing rules with miniscule
scores if I wanted Perl to distinguish
between true and false.

A rule that evaluates to zero is zero
regardless of whether the rule is evaluated
or not. Therefore, rules that have scores
of zero are not evaluated at all by spamassassin
because evaluating them has no meaning.

What I needed, but did not realize I needed,
was a way to evaluate rules without giving
the rule a score.

Here's the article that taught me that
you can make a rule in spamassassin
that is evaluated but that has no score:

Writing your own Add-On Rules for SpamAssassin

No score rules are great! They are great
for several reasons:

  1. No score rules do not show up in your
    spamassassin scoring reports that are
    inserted into each email evaluated by
    Spamassassin. This reduces visual clutter.
  2. No score rules only have consequences
    if they add up to something greater in
    a meta rule.
  3. No score rules allow you to score cumulative
    words and phrases, regardless of what order
    the words and phrases appear in the email

Let me give an example. Let's say, in
your own mind, the term on sale
is spammy, but not so spammy as to trigger
a spam rule that scores points.

Let's also say that the word discount,
in your own mind, is also spammy, but not so
spammy as to warrant a spam trigger via point
assignment.

You now have 2 terms on sale, and
discount, which by themselves are
not spammy enough to do anything about.

After all, a good friend could email you and
say that they got something on sale or
they got it at a discount and it's all
perfectly innocent. Both terms, on sale
and discount are legitimate terms in
normal human discourse.

Now let's say that even though discount
and on sale in insolation are not worth
assigning a score to — but — taken together,
they have a much greater meaning than they do
when seen in isolation.

Let's say that, in your mind, any email that mentions
both discount and on sale in the same
email should be scored one point. Here's how you
do this:

First, you need a scoreless way to score the
term on sale. Here's the no score
way to do it:

body           __ON_SALE  m|on.{0,12}sale|i
describe       __ON_SALE  The term 'on sale' is found in the body of the email message

Note the double underscores in the name of the rule.
That's the mechanism that gives you a scoreless rule.
That's the thing I was missing. I did not understand
that spamassassin has a mechanism for assigning no
score
.

Note also that I've decided to use the match operator
in a very liberal way. The zero in the match operator
indicates that I don't care whether or not the words
are run together. onsale and on sale
will both trigger this rule equally well. That's
what the zero is all about.

Also, I don't care too much about what appears between
the two words on sale as long as it is 12 characters
or less. That's very liberal and will catch the
words on sale in many different forms.

For example, it will catch this weekend only sale,
because the word only has the word on embedded
inside it. I"m choosing to be very liberal to demonstrate
that spamassassin is a very flexible tool. You may choose
to be more cautious than I"m being in my example.

Also note that I'm using the case insensitive suffix, the
letter i. With the letter i suffix, the
terms On Sale, ON SALE, and on sale
are all caught equally well.

Now lets do the same thing to the term discount.
Here's my rule for discount:

body           __DISCOUNT  m|discount|i
describe       __DISCOUNT  The word 'discount' is found in the body of the email message

OK. Now we're ready to put it all together. Now
we're ready to say that anyone who mentions a
discount and something being on sale
in the same email is at least a little bit likely
to be a spammer.

Here's how it all comes together:

meta           ON_SALE_DISCOUNT      (__ON_SALE && __DISCOUNT)
describe       ON_SALE_DISCOUNT      The terms 'on sale' and 'discount' are both found in the body of the email

Here's the thing I love most about this approach. The
rule ON_SALE_DISCOUNT has the following
characteristics:

  1. It does not care whether on sale
    or discount appears first. Any order
    for these 2 terms is acceptable and earns
    the spammer a point
  2. Distance does not matter. These 2 terms
    could be 5 paragraphs apart. Even with huge
    swaths of text separating the 2 terms, the
    2 terms together earn our spammer a point.

One more thing worth mentioning. I could have
assigned the rule ON_SALE_DISCOUNT a
score other than 1 point. However, I'm very
very happy with 1 point.

I pretty much never mess with the default 1
point that spamassassin gives you. Instead, I
write more rules if a one rule is not enough
or I eliminate a rule if that rule is not worth
1 point all by itself.

The lesson for me? There always a better way
to do things. For me, placing a double underscore
in front of rules that are only there for cumulative
effect is a much better way of doing things.

Ed Abbott