difference between body and rawbody
when writing spamassassin rules. Here's
one explanation:
Rawbody or Body
It seems that body does a couple of things based
on the above explanation:
- It ignores HTML tags
- It goes beyond end-of-line
boundaries
I make these suppositions based on what I read
at the above link.
It would seem that rawbody gives you the
ability to do a couple of things:
- Examine beginning and end-of-line relationships
- Examine HTML tags
An example of when you might want to examine
an end-of-line relationship is the occasion
when you choose to use an end-of-line anchor,
which is a dollar sign character in
regular expressions.
Here's a more complete explanation of the difference
between body and rawbody:
Rule Definitions and Priviledged Settings
I've learned something valuable by experimenting
with these rules. I've learned that sometimes a
body rule has trouble crossing a line boundary
but other times it does not.
I've struggled with this for quite some time! In
doing a little research I think I may have found an
answer. It seems that body rules treat HTML
paragraphs as more than one line. Here's where I
first came across this idea:
Add a New Rule Type:
Single-Line Body?
The helps me to understand why this rule crosses
line boundaries:
rawbody ED__FREE_EARN_MONEY_RAWBODY m|free.{0,1200}earn.{0,1200}money|is
describe ED__FREE_EARN_MONEY_RAWBODY ED__FREE_EARN_MONEY_BODY The words 'free earn money' in body
But this rule does not:
body ED__FREE_EARN_MONEY_BODY m|free.{0,1200}earn.{0,1200}money|is
describe ED__FREE_EARN_MONEY_BODY The words 'free earn money' in body
More precisely, the above rawbody rule crosses
paragraph boundaries but the body rule does
not. When I say paragraphs, I mean HTML paragraphs.
I suspect I need to make better use of meta rules
in order to cross paragraph boundaries. Perhaps I need
an aphabetized list of spam words that I can then
use to trigger meta rules.
I might then have collections of spammy words that trigger
actual points. For example, the above rule goes after the
word free followed by the word earn followed
by the word money.
Only when these words are used together are they spammy words.
Each word in isolation is not spam. It's the words used
together to form bigger thoughts that makes them spammy.
Or maybe I should just stick with rawbody rules. I
think I can avoid processing time on these rules growing
exponentially long by putting a 1200 character limit (or
whatever limit) on my regular expression patterns.
I prefer to use rawbody as seldom as possible because
some rawbodies are awfully long. However, in some cases,
rawbody may be the best solution.
The lesson for me in this is find the right tool fo
do the job. One tool will fit one job and another tool
will do a better job on another job.
Ed Abbott
No comments:
Post a Comment