On Wednesday 15 Dec 2004 4:48 pm, John Peacock wrote:
Hauke Fath wrote:
While this of course depends on your definition of "larger", some people seem to think otherwise:
Not having a Usenix login, I cannot comment on the full paper, but to
The full paper seems to be there under HTML (despite the 'before November 2005' comment - whoops).
The use of a single wordlist is appropriate for limited circumstances. Even in a corporate environment like I manage, there is a very wide definition of what constitutes spam, and a configuration such as described above wouldn't work here. It would work even less in an ISP environment, with widely varied userbase.
Oh I don't know - we could probably easily filter our clients spam with a single word list - real pharmacists don't obfusicate drug names very often. But it would obviously lose skill, and if tuned right let more spam through. But that wouldn't stop it being a very effective spam filter. But I think the spamassassin aproach of weighing several inputs statistically is better here anyway - over reliance of content will always lead to false positives.
I'm interested how much Spam Assassin maintenance was complained about. I use to do some with SA 2, but with SA3 with network tests switched on, it seems to just work pretty much. Although the damn thing has started autolearning as ham one type of spam (argh) in the last week.
However delegating this to users may create it's own form of maintenance :(
I wouldn't have thought that different database backends dbm versus Postgres would affect scalability (other than the NFS issue). As presumably if each user has a unique list we need to read the relevant words for each message from whichever database. I could see the NFS thing being a practical issue, but I dare say there are ways. Certainly we had a busy webserver with several GDBM writes happening on a web server for every hit, and my predecessors hadn't noticed it was opening the databases every time instead of holding them open between requests.