On 12/10/2009 2:28 PM, Johannes Bauer wrote:
Eduardo M KALINOWSKI schrieb:
On Qui, 10 Dez 2009, Johannes Bauer wrote:
I'm thinking about filtering all such encoded subjects (as there's no reason to encode them US-ASCII), but suppose it were UTF-8 or something: how can I filter on the actual content, not the encoded subject? Surely someone has solved that problem already?
Yes, such as the guys behind SpamAssassin, or dspam, or any of the many spam filtering programs that exist. Actually, they make much more complicated decisions instead of only looking for bad words in the subject field. I'd suggest you try installing one of them.
I had SpamAssassin running once and was pretty disappointed. All those complicated rules and scoring and "smart" bayesian filtering did not work very well, although I taught it in around 50k mails right from wrong. I had both lots of false-positives and lots of false-negatives, which was kind of annoying.
However, analyzing 274 spam mails I deleted in the last 5 months I can conclude that by using that extremely simple filter list I'd catch 258 of them (that's 94%). So I'd like to stick to KISS in this case.
From what I've seen, SA has been extremely good and accurate for us. We use amavisd-new to interface, but SA is at the end of a long chain of checks.
Between the (3) HELO checks, clamav-milter, and a SPF policy daemon, we're killing ~60% of all connections at SMTP time. (I analyzed that in November, instead of 65/day hitting my inbox I would've seen 6x that amount if it wasn't for those checks. So ~80% of all spam was getting blocked at SMTP time.) If we were to pay for the Spamhaus Zen list, we could probably boost that percentage to 90%.
All of the domains we do business with get a -2 or -4 score using amavisd-new. Specific addresses get a larger negative score. I ran a few thousand spam & ham messages at the SA bayes filter, then turned it on. We tag messages with a [spam] flag at 5.0 and quarantine at 9.0. Tagged messages go to the user's Inbox, quarantined messages get sieve'd into a sub-folder in the user's mailbox.
So far (in a month), no false positives. Or at least none that people have complained were quarantined when they should not have been. I'm considering lowering the quarantine threshold next month.
It's been nice to have my Inbox back, without 65 spams/day cluttering it up. Now I might see 2-5 per day that slip through without getting tagged as borderline spam (at 5.0 or higher). Those are mostly zero-day spam that haven't made it to the URIBLs or DNSBLs yet.
I'm still debating grey-listing, Razor, DCC or paying for the Spamhaus Zen list.
Compared to another, commercial, product that we were using a few years ago, SA is very very good. Not perfect, but really does a good job of classifying things with decent accuracy.