Tuesday, December 20, 2011

Why Does Spamassassin
Ignore Emails
with PDF Files Attached?

Why does Spamassassin fail to score emails
that have a PDF file attached? This was
the question I tired to answer for myself
yesterday.

I started doing a little bit of research and
I learned something interesting. It is not
the PDF file that is the problem; it is the
email length.

Spamassassin fails to markup emails that are
too long. That's the crux of the problem.

There's a reason for this:

The longer an email, the more exponential the
processing time for that email. Thus a 200
kilobyte email is likely to take 4 times as
long to process as a 100 kilobyte email.

Don't take my figures too literally. However,
my computer programming experience tells me
that processing time often grows in proportion
to the square of the file size.

Thus a file that is twice as long will tend
to take four times as long to process. There's
a simple reason for this. Everything in a file
tends to relate to everything else.

Let me give you a super-simple example. Let's
say a file consists of 3 lines. Since everything
relates to everything else (including itself) than
3 lines gives us 3 X 3 = 9 things to relate to. A
file with 3 lines creates 9 relationships.

In our 3 line file, line 1 relates to line 1. It
also relates to line 2 and 3. So far that's three
things to relate to. If you multiply all these
relationships out, 3 lines form 9 different relationships.

Now here's where exponential processing time kicks
in. If you double the size of the file from 3 lines
to 6 lines, you now have 36 different relationships.

That is to say, 6 times 6 equals 36. In doubling the
file size, you've quadrupled the number of lines
in the file relating to each other. 6 X 6 = 36
is quadruple 3 X 3 = 9. That is to say 4 X 9 = 36.

This all comes bakc to processing time. Processing
time quadruples when you double the size of the file
being processed.

Of course, this is an absurd oversimplification. Good
explanations are often absurd oversimplifications.

However, this does illustrate why Spamassassins processing
time is not a linear one. Processing time is not linear
relative to file size.

In the real world, some spamaassin rules probably are linear
and some are not. For example, if you write a spamassassin
rule that only looks for a single spammy word in the file.
such as the word discount, the relationship
is linear. That is to say, the search for the one-word
discount is probably linear and directly proportional
to the file size. Looking for the one-word discount
only takes as long as the file is long. Double the file
size when looking for one word and you double your processing
time. Simple.

Keep in mind, though, that Spamassassin rules are quite
sophisticated at times. The rules are often just as sophisticated
as the spam they process. Therefore, Spamassassin rules can be
quite exponential in the processing time they require.

What is the default file size? How big does a spam email have
to be before Spamassassin gives up and decides not to process
that messsage at all?

This Spamassassin documentation seems to be saying that the
maximum file size for an email is 256KB:

Spamassassin Options

The way the above documentation reads to me is unclear. It
sounds like you can set the maxium message size to
anything you want. It also sounds like the default message
size is 500K.

However, it also says that the maximum message size is
256 megabytes. Big difference. I'm going to go ahead
and assume that this means that 256 megabytes is an absolute
limit and 500 kilobytes is the default limit. I can't think
of another way to read this, can you?

In any case, 256 megabytes is way too many megabytes to worry
about. A spam message this size would take many many minutes
just to load off the network as of this writing (December 2011).

I'd forget the 256 megabytes and focus on the 500 kilobytes.

The 500 kilobyte figure explains why PDF files are often ignored
by Spamassassin. Sending a PDF is a favorite spammer trick. Since
PDF files are so big, it is a technique for manipulating Spamassassin
into ignoring the spam message.

OK. That explains it. That explains to my satisfaction why
Spamassassin has been ignoring spam messages with PDF files
attached. The PDF files are too big to be examined.

I'll look into this more later.

Ed Abbott

No comments:

Post a Comment