Tuesday, December 20, 2011

Why Does Spamassassin
Ignore Emails
with PDF Files Attached?

Why does Spamassassin fail to score emails
that have a PDF file attached? This was
the question I tired to answer for myself
yesterday.

I started doing a little bit of research and
I learned something interesting. It is not
the PDF file that is the problem; it is the
email length.

Spamassassin fails to markup emails that are
too long. That's the crux of the problem.

There's a reason for this:

The longer an email, the more exponential the
processing time for that email. Thus a 200
kilobyte email is likely to take 4 times as
long to process as a 100 kilobyte email.

Don't take my figures too literally. However,
my computer programming experience tells me
that processing time often grows in proportion
to the square of the file size.

Thus a file that is twice as long will tend
to take four times as long to process. There's
a simple reason for this. Everything in a file
tends to relate to everything else.

Let me give you a super-simple example. Let's
say a file consists of 3 lines. Since everything
relates to everything else (including itself) than
3 lines gives us 3 X 3 = 9 things to relate to. A
file with 3 lines creates 9 relationships.

In our 3 line file, line 1 relates to line 1. It
also relates to line 2 and 3. So far that's three
things to relate to. If you multiply all these
relationships out, 3 lines form 9 different relationships.

Now here's where exponential processing time kicks
in. If you double the size of the file from 3 lines
to 6 lines, you now have 36 different relationships.

That is to say, 6 times 6 equals 36. In doubling the
file size, you've quadrupled the number of lines
in the file relating to each other. 6 X 6 = 36
is quadruple 3 X 3 = 9. That is to say 4 X 9 = 36.

This all comes bakc to processing time. Processing
time quadruples when you double the size of the file
being processed.

Of course, this is an absurd oversimplification. Good
explanations are often absurd oversimplifications.

However, this does illustrate why Spamassassins processing
time is not a linear one. Processing time is not linear
relative to file size.

In the real world, some spamaassin rules probably are linear
and some are not. For example, if you write a spamassassin
rule that only looks for a single spammy word in the file.
such as the word discount, the relationship
is linear. That is to say, the search for the one-word
discount is probably linear and directly proportional
to the file size. Looking for the one-word discount
only takes as long as the file is long. Double the file
size when looking for one word and you double your processing
time. Simple.

Keep in mind, though, that Spamassassin rules are quite
sophisticated at times. The rules are often just as sophisticated
as the spam they process. Therefore, Spamassassin rules can be
quite exponential in the processing time they require.

What is the default file size? How big does a spam email have
to be before Spamassassin gives up and decides not to process
that messsage at all?

This Spamassassin documentation seems to be saying that the
maximum file size for an email is 256KB:

Spamassassin Options

The way the above documentation reads to me is unclear. It
sounds like you can set the maxium message size to
anything you want. It also sounds like the default message
size is 500K.

However, it also says that the maximum message size is
256 megabytes. Big difference. I'm going to go ahead
and assume that this means that 256 megabytes is an absolute
limit and 500 kilobytes is the default limit. I can't think
of another way to read this, can you?

In any case, 256 megabytes is way too many megabytes to worry
about. A spam message this size would take many many minutes
just to load off the network as of this writing (December 2011).

I'd forget the 256 megabytes and focus on the 500 kilobytes.

The 500 kilobyte figure explains why PDF files are often ignored
by Spamassassin. Sending a PDF is a favorite spammer trick. Since
PDF files are so big, it is a technique for manipulating Spamassassin
into ignoring the spam message.

OK. That explains it. That explains to my satisfaction why
Spamassassin has been ignoring spam messages with PDF files
attached. The PDF files are too big to be examined.

I'll look into this more later.

Ed Abbott

Tuesday, December 13, 2011

Body Versus Rawbody
in Spamaassassin Rules

Currently I'm doing so research into the
difference between body and rawbody
when writing spamassassin rules. Here's
one explanation:

Rawbody or Body

It seems that body does a couple of things based
on the above explanation:

  1. It ignores HTML tags
  2. It goes beyond end-of-line
    boundaries

I make these suppositions based on what I read
at the above link.

It would seem that rawbody gives you the
ability to do a couple of things:

  1. Examine beginning and end-of-line relationships
  2. Examine HTML tags

An example of when you might want to examine
an end-of-line relationship is the occasion
when you choose to use an end-of-line anchor,
which is a dollar sign character in
regular expressions.

Here's a more complete explanation of the difference
between body and rawbody:

Rule Definitions and Priviledged Settings

I've learned something valuable by experimenting
with these rules. I've learned that sometimes a
body rule has trouble crossing a line boundary
but other times it does not.

I've struggled with this for quite some time! In
doing a little research I think I may have found an
answer. It seems that body rules treat HTML
paragraphs as more than one line. Here's where I
first came across this idea:

Add a New Rule Type:
Single-Line Body?


The helps me to understand why this rule crosses
line boundaries:

rawbody ED__FREE_EARN_MONEY_RAWBODY m|free.{0,1200}earn.{0,1200}money|is
describe ED__FREE_EARN_MONEY_RAWBODY ED__FREE_EARN_MONEY_BODY The words 'free earn money' in body

But this rule does not:

body ED__FREE_EARN_MONEY_BODY  m|free.{0,1200}earn.{0,1200}money|is
describe ED__FREE_EARN_MONEY_BODY The words 'free earn money' in body  

More precisely, the above rawbody rule crosses
paragraph boundaries but the body rule does
not. When I say paragraphs, I mean HTML paragraphs.

I suspect I need to make better use of meta rules
in order to cross paragraph boundaries. Perhaps I need
an aphabetized list of spam words that I can then
use to trigger meta rules.

I might then have collections of spammy words that trigger
actual points. For example, the above rule goes after the
word free followed by the word earn followed
by the word money.

Only when these words are used together are they spammy words.
Each word in isolation is not spam. It's the words used
together to form bigger thoughts that makes them spammy.

Or maybe I should just stick with rawbody rules. I
think I can avoid processing time on these rules growing
exponentially long by putting a 1200 character limit (or
whatever limit) on my regular expression patterns.

I prefer to use rawbody as seldom as possible because
some rawbodies are awfully long. However, in some cases,
rawbody may be the best solution.

The lesson for me in this is find the right tool fo
do the job. One tool will fit one job and another tool
will do a better job on another job.

Ed Abbott

Wednesday, December 7, 2011

Where sa-update Keeps the Latest Rules

When fighting spam, I like to know that
I"m working with the latest rules. The
sa-update command is fairly quiet
about what it is doing. I tend to run
sa-update and then wonder if anything
happened.

Here's where sa-update keeps the most
up-to-date rules on my system:

/var/lib/spamassassin/3.003001/updates_spamassassin_org/

In time, the version number, 3.003001, will be outdated.
This page tells you why:

Rule Updates

The timestamps on all the files in the above directory are the
same. All the timestamps are identical and just a few minutes
old.

As of this writing, it appears that sa-update updates
all the files in the directory by overwriting them. This
makes sense. Since all these files appear to be small text
files, perhaps this is as good an approach as any.

To observe sa-update in action, you might try this
command. The command includes the -D switch which
gives debug information:

sa-update -D

I had assumed that my rules were updated each time I retrieve
email with Kmail. I'm wrong! Does this mean I should
run sa-update just before I retrieve email? Perhaps so.

Also, I notice that I have to run sa-update as root or it does
not work. This makes sense.

Will I get less spam if I run sa-update more often? I'm
going to experiment to see.

Update: January 11, 2012

I'm back trying to fill in the holes in my knowledge
about sa-update. A question I've had all along
is Does sa-update automatically update?.

Apparently not. It's apparent to me that in my current
Linux distribution, which is Debian Squeeze, sa-update
is a manual operation. I'm guessing that there is a way
to make it automatic, I just don't personally know how to
do it.

In reading the sa-update man page, I find that
there are 2 fundamental truths regarding the availability
of updates:

  1. If you run sa-update and no update is available,
    sa-update exits with an exit status of 1
  2. When an update does become available, running
    sa-update will give you an exit state of 0

So it all comes down to one or zero. So that's how
this thing works! I've been wondering about this
for quite some time.

In ancient times, when I was still writing Unix
shell scripts regularly, I knew that typing
echo $? at the command line prompt would
give you the exit status of the last command
typed.

Try this command sequence:

ls
echo $?
ls --invalidoption
echo $?

The first command, ls, gives you an exit
status of zero. The second command, ls --invalidoption,
gives you an exit status of something other than
zero.

Zero is OK and non-zero is not so OK.

Apparently this is how you determine whether or not
sa-update has an update for you. You type the following
2 commands:

  1. sa-update --checkonly
  2. echo $?

If the exit status is a zero, an update is available.
IF the exit status is one, no update is available.

I think I've finally figured out how to tell whether
or not sa-update actually did something. Run
sa-update and check the exit status before and
after.

If the exit status is zero before you run sa-update
and it is non-zero after you run sa-update, sa-update
actually did something.

Got it!

The lesson here seems to be if you dig deep enough, you
find the answer you are looking for.

Ed Abbott