Tuesday, December 20, 2011

Why Does Spamassassin
Ignore Emails
with PDF Files Attached?

Why does Spamassassin fail to score emails
that have a PDF file attached? This was
the question I tired to answer for myself
yesterday.

I started doing a little bit of research and
I learned something interesting. It is not
the PDF file that is the problem; it is the
email length.

Spamassassin fails to markup emails that are
too long. That's the crux of the problem.

There's a reason for this:

The longer an email, the more exponential the
processing time for that email. Thus a 200
kilobyte email is likely to take 4 times as
long to process as a 100 kilobyte email.

Don't take my figures too literally. However,
my computer programming experience tells me
that processing time often grows in proportion
to the square of the file size.

Thus a file that is twice as long will tend
to take four times as long to process. There's
a simple reason for this. Everything in a file
tends to relate to everything else.

Let me give you a super-simple example. Let's
say a file consists of 3 lines. Since everything
relates to everything else (including itself) than
3 lines gives us 3 X 3 = 9 things to relate to. A
file with 3 lines creates 9 relationships.

In our 3 line file, line 1 relates to line 1. It
also relates to line 2 and 3. So far that's three
things to relate to. If you multiply all these
relationships out, 3 lines form 9 different relationships.

Now here's where exponential processing time kicks
in. If you double the size of the file from 3 lines
to 6 lines, you now have 36 different relationships.

That is to say, 6 times 6 equals 36. In doubling the
file size, you've quadrupled the number of lines
in the file relating to each other. 6 X 6 = 36
is quadruple 3 X 3 = 9. That is to say 4 X 9 = 36.

This all comes bakc to processing time. Processing
time quadruples when you double the size of the file
being processed.

Of course, this is an absurd oversimplification. Good
explanations are often absurd oversimplifications.

However, this does illustrate why Spamassassins processing
time is not a linear one. Processing time is not linear
relative to file size.

In the real world, some spamaassin rules probably are linear
and some are not. For example, if you write a spamassassin
rule that only looks for a single spammy word in the file.
such as the word discount, the relationship
is linear. That is to say, the search for the one-word
discount is probably linear and directly proportional
to the file size. Looking for the one-word discount
only takes as long as the file is long. Double the file
size when looking for one word and you double your processing
time. Simple.

Keep in mind, though, that Spamassassin rules are quite
sophisticated at times. The rules are often just as sophisticated
as the spam they process. Therefore, Spamassassin rules can be
quite exponential in the processing time they require.

What is the default file size? How big does a spam email have
to be before Spamassassin gives up and decides not to process
that messsage at all?

This Spamassassin documentation seems to be saying that the
maximum file size for an email is 256KB:

Spamassassin Options

The way the above documentation reads to me is unclear. It
sounds like you can set the maxium message size to
anything you want. It also sounds like the default message
size is 500K.

However, it also says that the maximum message size is
256 megabytes. Big difference. I'm going to go ahead
and assume that this means that 256 megabytes is an absolute
limit and 500 kilobytes is the default limit. I can't think
of another way to read this, can you?

In any case, 256 megabytes is way too many megabytes to worry
about. A spam message this size would take many many minutes
just to load off the network as of this writing (December 2011).

I'd forget the 256 megabytes and focus on the 500 kilobytes.

The 500 kilobyte figure explains why PDF files are often ignored
by Spamassassin. Sending a PDF is a favorite spammer trick. Since
PDF files are so big, it is a technique for manipulating Spamassassin
into ignoring the spam message.

OK. That explains it. That explains to my satisfaction why
Spamassassin has been ignoring spam messages with PDF files
attached. The PDF files are too big to be examined.

I'll look into this more later.

Ed Abbott

Tuesday, December 13, 2011

Body Versus Rawbody
in Spamaassassin Rules

Currently I'm doing so research into the
difference between body and rawbody
when writing spamassassin rules. Here's
one explanation:

Rawbody or Body

It seems that body does a couple of things based
on the above explanation:

  1. It ignores HTML tags
  2. It goes beyond end-of-line
    boundaries

I make these suppositions based on what I read
at the above link.

It would seem that rawbody gives you the
ability to do a couple of things:

  1. Examine beginning and end-of-line relationships
  2. Examine HTML tags

An example of when you might want to examine
an end-of-line relationship is the occasion
when you choose to use an end-of-line anchor,
which is a dollar sign character in
regular expressions.

Here's a more complete explanation of the difference
between body and rawbody:

Rule Definitions and Priviledged Settings

I've learned something valuable by experimenting
with these rules. I've learned that sometimes a
body rule has trouble crossing a line boundary
but other times it does not.

I've struggled with this for quite some time! In
doing a little research I think I may have found an
answer. It seems that body rules treat HTML
paragraphs as more than one line. Here's where I
first came across this idea:

Add a New Rule Type:
Single-Line Body?


The helps me to understand why this rule crosses
line boundaries:

rawbody ED__FREE_EARN_MONEY_RAWBODY m|free.{0,1200}earn.{0,1200}money|is
describe ED__FREE_EARN_MONEY_RAWBODY ED__FREE_EARN_MONEY_BODY The words 'free earn money' in body

But this rule does not:

body ED__FREE_EARN_MONEY_BODY  m|free.{0,1200}earn.{0,1200}money|is
describe ED__FREE_EARN_MONEY_BODY The words 'free earn money' in body  

More precisely, the above rawbody rule crosses
paragraph boundaries but the body rule does
not. When I say paragraphs, I mean HTML paragraphs.

I suspect I need to make better use of meta rules
in order to cross paragraph boundaries. Perhaps I need
an aphabetized list of spam words that I can then
use to trigger meta rules.

I might then have collections of spammy words that trigger
actual points. For example, the above rule goes after the
word free followed by the word earn followed
by the word money.

Only when these words are used together are they spammy words.
Each word in isolation is not spam. It's the words used
together to form bigger thoughts that makes them spammy.

Or maybe I should just stick with rawbody rules. I
think I can avoid processing time on these rules growing
exponentially long by putting a 1200 character limit (or
whatever limit) on my regular expression patterns.

I prefer to use rawbody as seldom as possible because
some rawbodies are awfully long. However, in some cases,
rawbody may be the best solution.

The lesson for me in this is find the right tool fo
do the job. One tool will fit one job and another tool
will do a better job on another job.

Ed Abbott

Wednesday, December 7, 2011

Where sa-update Keeps the Latest Rules

When fighting spam, I like to know that
I"m working with the latest rules. The
sa-update command is fairly quiet
about what it is doing. I tend to run
sa-update and then wonder if anything
happened.

Here's where sa-update keeps the most
up-to-date rules on my system:

/var/lib/spamassassin/3.003001/updates_spamassassin_org/

In time, the version number, 3.003001, will be outdated.
This page tells you why:

Rule Updates

The timestamps on all the files in the above directory are the
same. All the timestamps are identical and just a few minutes
old.

As of this writing, it appears that sa-update updates
all the files in the directory by overwriting them. This
makes sense. Since all these files appear to be small text
files, perhaps this is as good an approach as any.

To observe sa-update in action, you might try this
command. The command includes the -D switch which
gives debug information:

sa-update -D

I had assumed that my rules were updated each time I retrieve
email with Kmail. I'm wrong! Does this mean I should
run sa-update just before I retrieve email? Perhaps so.

Also, I notice that I have to run sa-update as root or it does
not work. This makes sense.

Will I get less spam if I run sa-update more often? I'm
going to experiment to see.

Update: January 11, 2012

I'm back trying to fill in the holes in my knowledge
about sa-update. A question I've had all along
is Does sa-update automatically update?.

Apparently not. It's apparent to me that in my current
Linux distribution, which is Debian Squeeze, sa-update
is a manual operation. I'm guessing that there is a way
to make it automatic, I just don't personally know how to
do it.

In reading the sa-update man page, I find that
there are 2 fundamental truths regarding the availability
of updates:

  1. If you run sa-update and no update is available,
    sa-update exits with an exit status of 1
  2. When an update does become available, running
    sa-update will give you an exit state of 0

So it all comes down to one or zero. So that's how
this thing works! I've been wondering about this
for quite some time.

In ancient times, when I was still writing Unix
shell scripts regularly, I knew that typing
echo $? at the command line prompt would
give you the exit status of the last command
typed.

Try this command sequence:

ls
echo $?
ls --invalidoption
echo $?

The first command, ls, gives you an exit
status of zero. The second command, ls --invalidoption,
gives you an exit status of something other than
zero.

Zero is OK and non-zero is not so OK.

Apparently this is how you determine whether or not
sa-update has an update for you. You type the following
2 commands:

  1. sa-update --checkonly
  2. echo $?

If the exit status is a zero, an update is available.
IF the exit status is one, no update is available.

I think I've finally figured out how to tell whether
or not sa-update actually did something. Run
sa-update and check the exit status before and
after.

If the exit status is zero before you run sa-update
and it is non-zero after you run sa-update, sa-update
actually did something.

Got it!

The lesson here seems to be if you dig deep enough, you
find the answer you are looking for.

Ed Abbott

Thursday, October 6, 2011

How to Filter Out
Foreign Language Email

I just learned something new. Lately
I've been receiving Russian spam of some
kind. Since I do not know Russian, I do
not know what the spam actually says.

This web page describes how foreign
language spam can be filtered out:

Mail::SpamAssassin::Conf -
SpamAssassin configuration file


I placed the following line in my
user_prefs file:

ok_locales en

Once I finished altering user_prefs,
I tested the result using the technique
described in this post:


How to
Test One Single Email
With Spamassassin


According to the above test, the
Russian spam triggered the following
rules:

CHARSET_FARAWAY_HEADER
MIME_CHARSET_FARAWAY

Such a simple thing! Setting
ok_locales triggered a
couple of rules in this case. Both
rules use the word faraway.

I like the word faraway. I
get both chinese and russian spam.
This spam is very faraway from my
desires.

The lesson for me in all of this
is that if you want to solve a
problem, look for a simple solution
first.

Filtering out languages that I don't
understand is a simple solution.

Update: November 9, 2011

For some reason, this does not always
work. It works for some foreign
language spam but not all foreign
language spam.

Why are they still getting through?
I'm not sure.

I suspect it has to do with utf-8.
Since utf-8 is a universal character
set, it may not be as easy to itdentify
utf-8 spam as other spam.

That's my theory as to why ok_languages
does not always seem to work.

Right now, it's just a theory. However,
in coming weeks I'll be looking to see if is
is consistently true that utf-8 foreign
language spam
does not get filtered out.

Ed Abbott

Thursday, September 1, 2011

How to
Test One Single Email
With Spamassassin

Testing One single email with Spamassassin.
It sounds so simple. Why has it taken me
so long to discover how to do this?

I suppose it took me this long for two reasons.

  1. It never occurred to me to pipe the single
    email through spamassassin.
  2. I was slow figuring out how to use the
    man pages to discover spamassassin command
    line options

Of the two discoveries, the first was discovering
command line options. I write about this in a
previous post:

Spamassassin Options

Next, I discovered the delete and
test options. You can read about
these on the man page called
spamassassin-run.

Here's how I put it all together:

cat spam.mbox | spamassassin -dt >testresults.txt

Here are the steps that puts it all
together:

  1. Save the spam email that you are
    interested in testing to a file called
    spam.mbox
  2. Run the spam email through
    spamassassin using the above pipe
  3. At the end of the pipe, save the
    results to a file called testresults.txt
  4. View testresults.txt with your
    favorite text editor

How do you save a spam email? My email client
is called kmail. With kmail, the spam
email is saved by using the file menu
in the upper left-hand corner of your screen.
The way you save a spam email to a file may
differ from the way I save a spam email to a
file.

When testing your single spam email, be sure
to include the -dt option. The -d
part of the option deletes spamassassin markup
that is in already in the email and that may
confuse the issue.

The -t option says that this is just a
test and is not the real deal. Basically, you
are testing how spamassassin will respond to
a specific email rather than running spamassassin
for its ability to classify and categorize spam.

In other words, -t is theory instead of
actual practice. With -t you can test
your brand new spamassasin rule before
putting it into production.

Of course, you want to be sure the new rule has
correct syntax before doing any of this. The
command for testing a rule for syntax correctness
is:

spamassassin --lint

It's nice to be able to immediately test a new
rule you've written for a specific spam email to
see how many points it will rack up. That's the
name of the game: racking up points.

The lesson? Sometimes it takes a long time to
discover the simplest little thing.

Being able to test a spam email for how many
points it will rack up is the simplest little
thing. Yet, it is very helpful to know how to
do this.

Update: February 7, 2012

I've since learned more about testing a single
email against spamassassin. I've learned that
it is probably better to run local tests only
when testing a single email.

What is a local test? It is a non-network test.
Some tests require a network access. To turn
off test that require a network access, you
use the -L option.

If I understand correctly, the spamassassin -L option
will only run tests that are stored on your hard
drive. These tests include tests that you have
written and tests that have been written by others..

The tests that I have written are stored in this
directory on my Debian Squeeze system:

~/.spamassassin/user_prefs

The tests that I did not write are stored here:

/var/lib/spamassassin/3.003001/updates_spamassassin_org/

It's when I run sa-update at the command line while
logged in as root that I acquire rules that I did not write
in the above directory. The point? Generally speaking, rules
that are stored on my hard drive are considered local rules.

The rules that are not local are the ones that require a
network access. In my rather limited experience, network
rules are rules that necessitate a lookup in a blocklist of
some kind. Perhaps there are other kinds of rules that
require a network access that I do not know about.

Generally speaking, blocklists are spammy IP addresses
that have been used to send spam in the past. If I
understand right, overuse of blocklist lookups can
get you categorized as a commercial user who is supposed
to pay for these lookups.

Using the spamassassin -L option can help you to
avoid excessive lookups in various blocklists. Therefore,
when I test a single email, I now add the -L option
like this:

cat spam.mbox | spamassassin -dtL >testresults.txt

Note that the above command line is the same as the
one I published up above a few months ago except that
the -L option is now present.

The lesson? No matter how much you learn about something,
there's always something else to know.

Ed Abbott

Friday, August 26, 2011

Spamassassin Options

I'm only now figuring out all the
options Spamassassin has available
on the command line. My mistake?
Using the following command to find
opitons:


man spamassassin

What I should have been typing is this:

man spamassassin-run

More and more I'm finding that other commands
are organized like Spamassassin. As the world
gets more complex, commands are organized into
families.

Spamassassin is a family of commands, not just
one command. OK. Now I now where to look for
Spamassassin command line options.

To find out about the family of spamassassin
commands, type this:

man spamassassin

To find out about spamassassin commandline
options, type this:

man spamassassin-run

The lesson? To get where you are going, it
helps to know how to get there.


Ed Abbott