ifile-discuss
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Ifile-discuss] Mailing List Filtering


From: William E. Kempf
Subject: Re: [Ifile-discuss] Mailing List Filtering
Date: Thu, 6 Mar 2003 09:18:37 -0600 (CST)

clemens fischer said:
> Jack Bertram <address@hidden>:
>
>> So you store many different ifile categories in the same folders on
>> disk?
>
> what makes you say that?  i was confused about your usage of folders,
> that's all.  i prefer naming ifiles database "classes" categories.

Take a look at Mr. Bertram's web site and study his scripts
(http://www.jbertram.net/projects/ifile/ifile.html).  Here's a flow
description of what happens:

1) Mail comes in to the system and is passed to procmail.

2) Procmail calls a script which does an ifile query/learn to categorize
the e-mail, and places this in an X-Ifile-hint header in the message.

3) Procmail next looks at the same message for the X-Ifile-hint header and
stores the message in that "folder".

4) Periodically a Cron job runs another script that scans all of the mail
and compares the X-Ifile-Hint contents against the location of the e-mail,
and "relearns" the message if they differ.

The key is in (3) above.  The ifile "category" *IS* the mail systems
"folder".  In my case I use a Maildir format, so spam messages get an
X-Ifile-Hint of .Spam, which corresponds with ~/Maildir/.Spam on my HD. 
Mr. Bertrand uses MBox format, so for him it might be just spam, which
would correspond to the MBox file ~/Mail/spam.  So, again, the "category"
*IS* the "folder".

The best idea I've heard so far is to modify step 2 above, like follows:

2) Procmail calls a script which does the following:

2.1) Calls ifile with the message body (and optionally header as well),
which will categorize the message as either Ham or Spam (as well as learn
this information into the ifile database).

2.2) If the category is Spam, we're done, skip to step (3).

2.3) Create a unique string based on various headers (I'll use a list of
possible headers sorted in descending order of probability, and will
likely include the From: header to both pick up those few lists that don't
use any other headers, as well as to categorize personal e-mail from some
folks). The unique string will be based on the address of the header, with
all punctuation stripped (and possibly with some unique characters
appended just to ensure no possibility for clashes with any other
occurrences of the "word").

2.4) Call ifile again to query/learn a category for this message based
solely on this unique string.

2.5) Place the final category in X-Ifile-Hint (which will either be Spam,
or the folder in which we want the message to go).

Maybe there should be another step after 2.4 to categorize the e-mail
based on the message body (and possibly header) if 2.4 didn't find a
likely category?

There will also have to be a similar change to how step 4 works, so that
messages are relearned correctly both for Ham/Spam as well as the final
category.

Anyone spot any flaws in this logic, or further steps that should be
taken?  This shouldn't be too difficult to implement in a Python script (I
dislike Perl), so I'll likely take this route.  If there's interest, I can
share the results when I'm done.

-- 
William E. Kempf






reply via email to

[Prev in Thread] Current Thread [Next in Thread]