ifile-discuss
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Ifile-discuss] Re: Updated ifile writeup


From: Karl Vogel
Subject: [Ifile-discuss] Re: Updated ifile writeup
Date: 19 Feb 2003 15:12:27 -0500

>> On 18 Feb 2003 10:26:13 +0100, 
>> "clemens fischer" <address@hidden> said:

C> a spam-corpus this large definitely deserves special care.  did you
C> think about making it a sourceforge project?  there are several sf
C> projects about bayesian text-classifiers, and some of them have links to
C> spam corpora, but there's no project collecting spam systematically.

   That has potential.  My private collection doesn't grow by leaps and
   bounds, but it does grow.  The net-abuse collection just mirrors the
   "net-abuse" newsgroup, and I haven't been grabbing that lately, so I
   don't know how "systematic" this really is.

   Also, stuff posted to "net-abuse" is often munged by the sender enough
   to throw off ifile, unless I do something to clean up the message first.
   That's why I don't automatically include that collection when generating
   a new idata file.

   Does anyone on this list have a spamtrap running?  I have enough
   diskspace on one of my local systems to collect spam, but not on my ISP.
   Raw messages are best; I don't need any header fields except To:, Date:,
   and Message-ID, and I can fake those if necessary.

C> and btw, what is "gtaylor" spam?

   That's the stuff from Grant Taylor's spam archive at
   http://www2.picante.com:81/~gtaylor/download/spam.tar.gz

   I keep any usable collection with more than a few thousand messages
   separate from the stuff I collect.  "Keep your namespace clean", and all
   that...

   The most recent spam trick that seems to beat ifile is inserting stuff
   that looks like HTML (but often isn't) in the middle of a message,
   presumably so your browser or mailreader will drop it on the floor:

      Prepare for the prof<!--vogelke-->essional advancement you deserve!

   I have a short C program to strip all HTML tags from a file, so my next
   change will be to add that program as a new filtering stage before ifile
   is run.

-- 
Karl Vogel                      I don't speak for the USAF or my company
address@hidden                          http://www.pobox.com/~vogelke

That married couples can live together day after day is a
miracle that the Vatican has overlooked.                      --Bill Cosby





reply via email to

[Prev in Thread] Current Thread [Next in Thread]