Jan Kundrát wrote:
Marc Perkel wrote:
For example, a new message comes in and you find that sender matches email in 100 people's spam folders and none in any other folder? It can be classified as spam. If however the from address matches ham in people folder and no spam then you can probably deliver it without spam scanning.
It's called auto-whitelisting and smart spam scanners should do that.
actually, auto white listing is any one of a number of techniques used to eliminate false positives from "known parties". I use one in camram where anyone you send e-mail to is automatically white listed. To distinguish that from the often confusing auto white listing terminology, I call it "friends list". It works exceedingly well and haven't had any significant problems even when the site has been infected with zombies. With any automatic white listing tool, you need the human feedback which says "this is spam". The human feedback enables automatic elimination of the entry from the auto white list, and blacklisting the IP address the message came from (you did preserve the source IP address as a new header in the message, didn't you?).
The analysis techniques suggested originally is classically naïve. A technique I'm playing with that appears to work much better is to use the output of the content filter to predict whether a message is good or bad. all of the bad messages are placed into a dumpster and expired after five days. If a message is left in the dumpster, the IP address is listed as a "bad source".
Any messages that passes the content filter, friends filter, or spam filter is recorded as "good source". If the ratio of good source to bad source drops below 80%, the site is listed as contaminated and automatically dumped in the spam trap for human analysis. If the ratio drops below 40%, it's listed as spam and all messages are brown listed.
the main downside of this technique is that it does increase the workload for the user (more content in the spam trap) and it does seem to work better if you have multiple sources for feeding the good/bad ratio analysis
my two cents worth.