Hi,
On Thu, 10 Dec 2009 20:28:27 +0100, Johannes Bauer wrote:
Eduardo M KALINOWSKI schrieb:
On Qui, 10 Dez 2009, Johannes Bauer wrote:
I'm thinking about filtering all such encoded subjects (as there's no reason to encode them US-ASCII), but suppose it were UTF-8 or something: how can I filter on the actual content, not the encoded subject? Surely someone has solved that problem already?
Yes, such as the guys behind SpamAssassin, or dspam, or any of the many spam filtering programs that exist. Actually, they make much more complicated decisions instead of only looking for bad words in the subject field. I'd suggest you try installing one of them.
I had SpamAssassin running once and was pretty disappointed. All those complicated rules and scoring and "smart" bayesian filtering did not work very well, although I taught it in around 50k mails right from wrong. I had both lots of false-positives and lots of false-negatives, which was kind of annoying.
However, analyzing 274 spam mails I deleted in the last 5 months I can conclude that by using that extremely simple filter list I'd catch 258 of them (that's 94%). So I'd like to stick to KISS in this case.
That must have been a configuration issue - SpamAssassin works pretty well, if configured correctly - but I admit, it's a monster (both in terms of configuration and resource usage).
You could go for bogofilter (purely Bayesian). I'm using it for years on my private mail server with very good results. I like to use the tri-state filtering, where there is not only one threshold value, but two. A certainty of a mail being spam ("bogosity") of 0.35 and below goes into my inbox, mails with a bogosity value between 0.35 and 0.65 go into Spam/Unsure, and everything above 0.65 goes directly into Spam. That way I have something like 10-20 mails per week in Spam/Unsure that are usually false negatives, rarely false positives (currently around 1000 mails per week end up in Spam). To my knowledge there has never been a false positive in Spam.
Of course initial training is necessary. For ongoing training / feedback I have set up a Spam/Learn-Spam and Spam/Learn-Ham mailbox into which I move false negatives/positives. A cron script then runs the mails found in those (maildir) mailboxes through bogofilter again, with the command line option for classifying the mail as Spam/Ham and moves them to the correct mailbox (Spam/inbox) afterwards. This works well in all MUAs, because it only requires IMAP functionality to train the filter.
The solution was inspired by a Gentoo Wiki article (http://www.gentoo-wiki.info/Bogofilter).
Patrick.
-- STAR Software (Shanghai) Co., Ltd. http://www.star-group.net/ Phone: +86 (21) 3462 7688 x 826 Fax: +86 (21) 3462 7779
PGP key E883A005 https://stshacom1.star-china.net/keys/patrick_nagel.asc Fingerprint: E09A D65E 855F B334 E5C3 5386 EF23 20FC E883 A005