[Dovecot] deploying dspam

newer
[Dovecot] permissions and location...

Johannes Berg

14 Dec 2004 14 Dec '04

5:03 p.m.

Hi everyone,

I've been thinking of phasing out my spamassassin installation (it's hardly doing me any good these days) in favour of dspam. Now since I'm working on this anyway, I thought I'd configure it so when users get any spam, it is delivered into a special folder. I figure I need the 1.0 test series for this, so I can make that folder fixed. But, I also want to allow them to move spam in and out of that folder and dspam can learn it.

Anyway, I found a message dating back a while where Timo says this could easily be implemented in a plugin (in the 1.0 test series), but I can't put my finger on how to do it. Does anyone have an example plugin that can run commands when mail is moved in or out of a special folder? Or is there any plugin-writing documentation?

Thanks, johannes

Show replies by date

Marcus Rueckert

14 Dec 14 Dec

5:43 p.m.

On 2004-12-14 16:03:50 +0100, Johannes Berg wrote:

...

I've been thinking of phasing out my spamassassin installation (it's hardly doing me any good these days) in favour of dspam. Now since I'm working on this anyway, I thought I'd configure it so when users get any spam, it is delivered into a special folder. I figure I need the 1.0 test series for this, so I can make that folder fixed. But, I also want to allow them to move spam in and out of that folder and dspam can learn it.

Anyway, I found a message dating back a while where Timo says this could easily be implemented in a plugin (in the 1.0 test series), but I can't put my finger on how to do it. Does anyone have an example plugin that can run commands when mail is moved in or out of a special folder? Or is there any plugin-writing documentation?

why not do it at delivery stage from your MTA? imho it is the better place for spamassassin.

just my 2 cents

darix

-- irssi - the client of the smart and beautiful people

          http://www.irssi.de/

Johannes Berg

7:53 p.m.

Marcus Rueckert schrieb:

...

why not do it at delivery stage from your MTA?

...

imho it is the better place for spamassassin.

Sure. I want to filter there, but want to have the ability to later re-classify messages by just moving them out or into the spam folder (which will be cleaned up by age)

johannes

Rick Jones

10:06 p.m.

--On 14 December 2004 18:53 +0100 Johannes Berg <johannes@sipsolutions.net> wrote:

| Marcus Rueckert schrieb: | | > why not do it at delivery stage from your MTA? | > imho it is the better place for spamassassin. | | Sure. I want to filter there, but want to have the ability to later | re-classify messages by just moving them out or into the spam folder (which | will be cleaned up by age)

How about using procmail for final delivery, and set up rules there? Then the burden isn't on the MUA. I find it easier to do all my sorting rules this way - apart from anything else it's then not dependent on what MUA you use to read your mail. In my case I use Mulberry locally, but horde/imp webmail remotely.

BTW, I find SpamAssassin very effective, especially with continuous training. What problems are you experiencing?

-- Rick Jones

Marcus Rueckert

15 Dec 15 Dec

12:10 a.m.

On 2004-12-14 18:53:58 +0100, Johannes Berg wrote:

...

Sure. I want to filter there, but want to have the ability to later re-classify messages by just moving them out or into the spam folder (which will be cleaned up by age)

err this is a cron job thingie. imagine you have: /var/spool/mail/user1/.ham /var/spool/mail/user1/.spam /var/spool/mail/user2/.ham /var/spool/mail/user2/.spam

so your script iterates over all users and check their spam/ham directories and runs sa-learn against the mails it finds in there. there are options so it learns to the user databases.

hope this helps

darix

-- irssi - the client of the smart and beautiful people

          http://www.irssi.de/

Johannes Berg

11:23 a.m.

Marcus Rueckert schrieb:

...

err this is a cron job thingie.

imagine you have: /var/spool/mail/user1/.ham /var/spool/mail/user1/.spam /var/spool/mail/user2/.ham /var/spool/mail/user2/.spam

so your script iterates over all users and check their spam/ham directories and runs sa-learn against the mails it finds in there. there are options so it learns to the user databases.

Thats unintuitive. If I want to re-classify a mail as "ham" I want to have it in my inbox too. Or any other box for that matter. Timo said it was possible to execute actions when messages are moved, so why not take advantage of that?

johannes

Tom Allison

17 Dec 17 Dec

2:50 p.m.

Johannes Berg wrote:

...

Marcus Rueckert schrieb:

...
err this is a cron job thingie.

imagine you have: /var/spool/mail/user1/.ham /var/spool/mail/user1/.spam /var/spool/mail/user2/.ham /var/spool/mail/user2/.spam so your script iterates over all users and check their spam/ham directories and runs sa-learn against the mails it finds in there. there are options so it learns to the user databases.

Thats unintuitive. If I want to re-classify a mail as "ham" I want to have it in my inbox too. Or any other box for that matter. Timo said it was possible to execute actions when messages are moved, so why not take advantage of that?

johannes

I would not do anything unexpected. If you move an email into a folder, and it's a PIPE, then the user won't be able to go to that folder and see that the email he moved actually shows up. That's unexpected.

That's also going to get the user to copy mail again and again and again.

But most clients only move. If mail disappears because of this...

Gunter Ohrner

15 Dec 15 Dec

1:47 a.m.

Am Dienstag, 14. Dezember 2004 18:53 schrieb Johannes Berg:

...

...
why not do it at delivery stage from your MTA? imho it is the better place for spamassassin.

...

Sure. I want to filter there, but want to have the ability to later re-classify messages by just moving them out or into the spam folder (which will be cleaned up by age)

Mh, that's just what I am doing... I'm using dspam and every user has a folder which collects eMails classified as spam, and which has two subfolders: One where the user can move false negatives into and one for false positives. All messages in these folders are used to retrain dspam on a regular basis.

I run dovecot 0.99.something on this machine but it has absolutely nothing to do with dspam or the re-train mechanism - so, actually I'm not really understanding what you are asking for. ;) Do you want a single global spam collector box for all users? If yes, you should consider what happens if some confidential mail of a user gets wrongly classified as spam...

Greetings,

Gunter

-- +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

http://aachen.uni-dsl.de/ - Der direkte Draht in's Hochschulnetz! + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Frische Luft: Das, was aus der Klimaanlage kommt (Marcus Stögbauer)
http://www.iks-jena.de/mitarb/lutz/usenet/Fachbegriffe.der.Informatik.htm l#374 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

              PGP-verschlüsselte Mails bevorzugt!                 +

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Rick Flower

6:33 a.m.

Thanks for mentioning dspam.. I've never heard of it.. I've been using
Spamassassin for a while and while it does work pretty good with my
exim mail server (and rejecting them at SMTP time), the overhead for
the Perl interpretter is killing me. I'm running on a itty bitty box
with only 64Mb of RAM and SA consumes about 20Mb for each instance..
I'll be switching if possible (it's building right now).

Thanks!

-- Rick

On Dec 14, 2004, at 3:47 PM, Gunter Ohrner wrote:

...

Am Dienstag, 14. Dezember 2004 18:53 schrieb Johannes Berg:

...
...
why not do it at delivery stage from your MTA? imho it is the better place for spamassassin.

...
Sure. I want to filter there, but want to have the ability to later re-classify messages by just moving them out or into the spam folder (which will be cleaned up by age)

Mh, that's just what I am doing... I'm using dspam and every user has a folder which collects eMails classified as spam, and which has two subfolders: One where the user can move false negatives into and one
for false positives. All messages in these folders are used to retrain
dspam on a regular basis.

I run dovecot 0.99.something on this machine but it has absolutely
nothing to do with dspam or the re-train mechanism - so, actually I'm not
really understanding what you are asking for. ;) Do you want a single global spam collector box for all users? If yes, you should consider what happens if some confidential mail of a user gets wrongly classified as spam...

Greetings,

Gunter

--
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+- +-+

http://aachen.uni-dsl.de/ - Der direkte Draht in's
Hochschulnetz! + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+- +-+

Frische Luft: Das, was aus der Klimaanlage kommt (Marcus
Stögbauer)
http://www.iks-jena.de/mitarb/lutz/usenet/ Fachbegriffe.der.Informatik.htm l#374 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+- +-+
              PGP-verschlüsselte Mails bevorzugt!                 
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+- +-+

Gunter Ohrner

16 Dec 16 Dec

3:22 a.m.

Am Mittwoch, 15. Dezember 2004 05:33 schrieb Rick Flower:

...

the Perl interpretter is killing me. I'm running on a itty bitty box with only 64Mb of RAM and SA consumes about 20Mb for each instance..

The same here. dspam kicks ass in this respect, using only about 1,5 MB per instance and being about 5 to 6 times as fast as SpamAssassin was before.

However you'll get a hard time using dspam_clean, it's a real memory hog. You should use an SQL based db backend and utilize the pruge.sql script. Even there you can optimize a lot by changing some of the queries to allow indizes to take effect.

Greetings,

Gunter

-- +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

http://aachen.uni-dsl.de/ - Der direkte Draht in's Hochschulnetz! + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ The druid stiffened. "*Nice?*" he said. "A triumph of the silicon chunk, a miracle of modern masonic technology -- *nice*?" "Oh, yes," said Twoflower, to whom sarcasm was merely a seven letter word beginning with S. -- (Terry Pratchett, The Light Fantastic) +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

              PGP-verschlüsselte Mails bevorzugt!                 +

+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

tallison＠tacocat.net

15 Dec 15 Dec

4:42 p.m.

...

Marcus Rueckert schrieb:

...
why not do it at delivery stage from your MTA?

...
imho it is the better place for spamassassin.

Sure. I want to filter there, but want to have the ability to later re-classify messages by just moving them out or into the spam folder (which will be cleaned up by age)

johannes

OK, I'll play dumb here.

How is dspam any different from using something like bogofilter + procmail or bogofilter + maildrop?

bogofilter is a C-based statistical spam filtering ..yada..yada..yada.

(dspam's website screams like a marketing brochure)

John Peacock

5:01 p.m.

tallison@tacocat.net wrote:

...

OK, I'll play dumb here.

And you're so good at it! ;)

...

bogofilter is a C-based statistical spam filtering ..yada..yada..yada.

And it works acceptably for individual users (since it uses BDB for token storage). I also question the author's knowledge of Berkeley DB usage, since he specifically discusses NFS usage which is strictly forbidden for all but a tiny number of NFS implementations (due to the BDB shared memory map requirements).

However, for a larger installation, bogofilter simply won't work well, because it doesn't support a multiuser database (like MySQL or PostgreSQL). dspam also provides several other categorization schemes which bogofilter doesn't have.

I can personally confirm that dspam works great; my personal account stats are:

filtering accuracy is	98.895% since last reset
false positive rate is	0.728% since last reset

and since the spam are quarantined, I can deal with the very rare false positive. Unlike SpamAssassin, as an administrator I do not have to do anything to achieve high accuracy.

John

-- John Peacock Director of Information Research and Technology Rowman & Littlefield Publishing Group 4501 Forbes Boulevard Suite H Lanham, MD 20706 301-459-3366 x.5010 fax 301-429-5748

tallison＠tacocat.net

5:18 p.m.

...

tallison@tacocat.net wrote:

...
bogofilter is a C-based statistical spam filtering ..yada..yada..yada.

However, for a larger installation, bogofilter simply won't work well, because it doesn't support a multiuser database (like MySQL or PostgreSQL). dspam also provides several other categorization schemes which bogofilter doesn't have.

So one of the key differences is the lack of a database that you can query by user? bogofilter would probably just give each user their own wordlist or use one wordlist to join them all. But the pros/cons of that decision belong elsewhere.

...

I can personally confirm that dspam works great; my personal account stats are:

filtering accuracy is 98.895% since last reset false positive rate is 0.728% since last reset

I'm not sure what you mean by a reset.

Bogofilter and SA, when they added Bayesian filtering) both exhibited a rather retarded functionality for the first 100 emails or so. After a bit they began to learn. Given that initial curve... Unless dspam starts with a preloaded wordlist or something else, I can't imagine it's success being significantly different at the beginning.

After training a few thousand emails, I think they all start to approach 99.999%. But again, that's a different list.

But I'm to understand that dspam is still implimented as a maildrop/procmail add-in? Just like bogofilter and SpamAssassin (minus amavisd)?

John Peacock

5:35 p.m.

tallison@tacocat.net wrote:

...

So one of the key differences is the lack of a database that you can query by user? bogofilter would probably just give each user their own wordlist or use one wordlist to join them all. But the pros/cons of that decision belong elsewhere.

No, the key difference is that bogofilter requires BDB, which is a shared access database with no resident process (as used here). Each time an e-mail is processed, the BDB database must be opened, read, updated, shutdown (even though the BDB libraries themselves remain resident). Consequently, the load on the server for 400 users is much higher than a true database like MySQL. I'm not saying that BDB is bad, but rather that as used here, it doesn't scale well at all. bogofilter also permits _either_ a shared dictionary or individual dictionary. dspam has several ways of sharing or grouping users.

...

I'm not sure what you mean by a reset.

That's just a byproduct of the management CGI; I have never reset the stats, so those numbers are lifetime (4 months).

...

Given that initial curve... Unless dspam starts with a preloaded wordlist or something else, I can't imagine it's success being significantly different at the beginning.

All statistical systems require some initial training before they become accurate; dspam is no exception. I ran for about a week with a shared corpus (actually the SpamAssassin public corpus) before reverting the users to personal training dictionaries.

...

After training a few thousand emails, I think they all start to approach 99.999%. But again, that's a different list.

Except that in practice, SA requires more handholding to maintain that accuracy, whereas dspam just works. I cannot speak for bogofilter, but I know that when I was still using SA, 94% accuracy was considered excellent.

...

But I'm to understand that dspam is still implimented as a maildrop/procmail add-in? Just like bogofilter and SpamAssassin (minus amavisd)?

The out-of-the-box installation for dspam is a command line client. The actual code is implemented as a library (which is all that the command line client calls), so any proposed integration for Dovecot would be via the library, too.

John

-- John Peacock Director of Information Research and Technology Rowman & Littlefield Publishing Group 4501 Forbes Boulevard Suite H Lanham, MD 20706 301-459-3366 x.5010 fax 301-429-5748

Jethro R Binks

7:07 p.m.

On Wed, 15 Dec 2004 tallison@tacocat.net wrote:

...

Bogofilter and SA, when they added Bayesian filtering) both exhibited a rather retarded functionality for the first 100 emails or so. After a bit they began to learn. Given that initial curve... Unless dspam starts with a preloaded wordlist or something else, I can't imagine it's success being significantly different at the beginning.

The SA instructions _specifically_ tell you that you must train the Bayes stuff _before_ using it; you have to feed it at least 500 spams and 500 hams, or something along those lines.

I imagine similar tools suggest the same.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jethro R Binks Computing Officer, IT Services University Of Strathclyde, Glasgow, UK

tallison＠tacocat.net

8:47 p.m.

...

On Wed, 15 Dec 2004 tallison@tacocat.net wrote:

...
Bogofilter and SA, when they added Bayesian filtering) both exhibited a rather retarded functionality for the first 100 emails or so. After a bit they began to learn. Given that initial curve... Unless dspam starts with a preloaded wordlist or something else, I can't imagine it's success being significantly different at the beginning.

The SA instructions _specifically_ tell you that you must train the Bayes stuff _before_ using it; you have to feed it at least 500 spams and 500 hams, or something along those lines.

bogofilter recommends 1000+ but it's reasonably effective after only 100 of both ham and spam.

Hauke Fath

6:39 p.m.

Am 15.12.2004 um 10:01 Uhr -0500 schrieb John Peacock:

...

...
bogofilter is a C-based statistical spam filtering ..yada..yada..yada.

And it works acceptably for individual users (since it uses BDB for token storage). I also question the author's knowledge of Berkeley DB usage, since he specifically discusses NFS usage which is strictly forbidden for all but a tiny number of NFS implementations (due to the BDB shared memory map requirements).

However, for a larger installation, bogofilter simply won't work well, because it doesn't support a multiuser database (like MySQL or PostgreSQL).

While this of course depends on your definition of "larger", some people seem to think otherwise:

http://www.usenix.org/events/lisa04/tech/blosser.html http://www.usenix.org/events/lisa04/tech/blosser/blosser.pdf

hauke

-- /~\ The ASCII Ribbon Campaign Hauke Fath \ / No HTML/RTF in email Institut für Nachrichtentechnik X No Word docs in email TU Darmstadt / \ Respect for open standards Ruf +49-6151-16-3281

John Peacock

6:48 p.m.

Hauke Fath wrote:

...

While this of course depends on your definition of "larger", some people seem to think otherwise:

http://www.usenix.org/events/lisa04/tech/blosser.html

Not having a Usenix login, I cannot comment on the full paper, but to quote this from the abstract:

...

Bayesian classification has been able to solve the spam problem for this user population for the present and observable future, with a single wordlist, and with no secondary spam filtering techniques ^^^^^^^^^^^^^^^

The use of a single wordlist is appropriate for limited circumstances. Even in a corporate environment like I manage, there is a very wide definition of what constitutes spam, and a configuration such as described above wouldn't work here. It would work even less in an ISP environment, with widely varied userbase.

This is now veering into Off Topic Territory...

John

-- John Peacock Director of Information Research and Technology Rowman & Littlefield Publishing Group 4501 Forbes Boulevard Suite H Lanham, MD 20706 301-459-3366 x.5010 fax 301-429-5748

Hauke Fath

6:58 p.m.

Am 15.12.2004 um 11:48 Uhr -0500 schrieb John Peacock:

...

...
http://www.usenix.org/events/lisa04/tech/blosser.html

Not having a Usenix login, I cannot comment on the full paper, but to quote this from the abstract:

The PDF is available without login from the URL you snipped. A link is in the abstract. 8>

HAND, hauke

-- /~\ The ASCII Ribbon Campaign Hauke Fath \ / No HTML/RTF in email Institut für Nachrichtentechnik X No Word docs in email TU Darmstadt / \ Respect for open standards Ruf +49-6151-16-3281

Simon Waters

7:11 p.m.

New subject: OT: More on Spam Re: [Dovecot] deploying dspam

On Wednesday 15 Dec 2004 4:48 pm, John Peacock wrote:

...

Hauke Fath wrote:

...
While this of course depends on your definition of "larger", some people seem to think otherwise:

http://www.usenix.org/events/lisa04/tech/blosser.html

Not having a Usenix login, I cannot comment on the full paper, but to

The full paper seems to be there under HTML (despite the 'before November 2005' comment - whoops).

...

The use of a single wordlist is appropriate for limited circumstances. Even in a corporate environment like I manage, there is a very wide definition of what constitutes spam, and a configuration such as described above wouldn't work here. It would work even less in an ISP environment, with widely varied userbase.

Oh I don't know - we could probably easily filter our clients spam with a single word list - real pharmacists don't obfusicate drug names very often. But it would obviously lose skill, and if tuned right let more spam through. But that wouldn't stop it being a very effective spam filter. But I think the spamassassin aproach of weighing several inputs statistically is better here anyway - over reliance of content will always lead to false positives.

I'm interested how much Spam Assassin maintenance was complained about. I use to do some with SA 2, but with SA3 with network tests switched on, it seems to just work pretty much. Although the damn thing has started autolearning as ham one type of spam (argh) in the last week.

However delegating this to users may create it's own form of maintenance :(

I wouldn't have thought that different database backends dbm versus Postgres would affect scalability (other than the NFS issue). As presumably if each user has a unique list we need to read the relevant words for each message from whichever database. I could see the NFS thing being a practical issue, but I dare say there are ways. Certainly we had a busy webserver with several GDBM writes happening on a web server for every hit, and my predecessors hadn't noticed it was opening the databases every time instead of holding them open between requests.

Benjamin J. Weiss

7:26 p.m.

On Wed, 15 Dec 2004, John Peacock wrote:

...

Hauke Fath wrote:

...
While this of course depends on your definition of "larger", some people seem to think otherwise:

http://www.usenix.org/events/lisa04/tech/blosser.html

Not having a Usenix login, I cannot comment on the full paper, but to quote this from the abstract:

...
Bayesian classification has been able to solve the spam problem for this user population for the present and observable future, with a single wordlist, and with no secondary spam filtering techniques ^^^^^^^^^^^^^^^

The use of a single wordlist is appropriate for limited circumstances. Even in a corporate environment like I manage, there is a very wide definition of what constitutes spam, and a configuration such as described above wouldn't work here. It would work even less in an ISP environment, with widely varied userbase.

This is now veering into Off Topic Territory...

I tried to install dspam from Dag's apt repo for RHEL 3. It seemed rather complex...

Tom Allison

16 Dec 16 Dec

12:51 p.m.

Hauke Fath wrote:

...

While this of course depends on your definition of "larger", some people seem to think otherwise:

http://www.usenix.org/events/lisa04/tech/blosser.html http://www.usenix.org/events/lisa04/tech/blosser/blosser.pdf
hauke

These are very good articles!

He details not only how to impliment this on a large scale system, but has a history of how effective it is to use on the system. 10E6 emails isn't bad.

Johannes Berg

15 Dec 15 Dec

6:50 p.m.

tallison@tacocat.net schrieb:

...

How is dspam any different from using something like bogofilter + procmail

Oh. Probably not at all really (different approach, yadda yadda). However, I want the SQL backend.

OTOH, I only posted the stuff about dspam here as an explanation of why I was looking for such a plugin :-)

johannes

Curtis Maloney

16 Dec 16 Dec

12:58 a.m.

Johannes Berg wrote:

...

tallison@tacocat.net schrieb:

...
How is dspam any different from using something like bogofilter + procmail

Oh. Probably not at all really (different approach, yadda yadda). However, I want the SQL backend.

OTOH, I only posted the stuff about dspam here as an explanation of why I was looking for such a plugin :-)

I was about to chime in and say "I think people missed your point."

It never came across to me that you were wanting something specific with dpsam... more that you wanted an explicit trigger for when a user decided something was/wasn't SPAM. And I, personally, love the idea.

Auto-training when (and only when) the user evaluates a message? Sounds great. Count me in.

So, let me see if I fully understand how you want it to go: 1) mail hits your MTA, which hands it off to dspam 2) dspam munges it, decides if it's SPAM or HAM, and delivers it to the user, either in INBOX or SPAM 3) if the USER moves a messages into SPAM, dspam is notified it missed a message, and retrains on it. 4) if the USER moves a message out of SPAM, dspam is notified it got a false-positive, and retrains on it.

Seems to me it will possibly lower the overall load, since you will only rescan/retrain messages _explicitly_ changed from/to SPAM/HAM. Now, if only you could get some resident form of dspam, so you didn't have to keep spawning it.... or did I miss something in the docs? Then again, there's libdspam...

-- Curtis Maloney

Mark E. Mallett

1:08 a.m.

On Thu, Dec 16, 2004 at 09:58:53AM +1100, Curtis Maloney wrote:

...

It never came across to me that you were wanting something specific with dpsam... more that you wanted an explicit trigger for when a user decided something was/wasn't SPAM. And I, personally, love the idea.

I'd still like to see more general hooks on moving into and out of folders, or ways to "redeliver" email, or folders that could act as pipes, e.g. as mentioned in this thread:

http://www.dovecot.org/list/dovecot/2003-July/001973.html

Timo Sirainen

1:26 a.m.

On 16.12.2004, at 01:08, Mark E. Mallett wrote:

...

I'd still like to see more general hooks on moving into and out of folders,

Easy to use hooks would be nice.. But that needs more thinking.

...

or ways to "redeliver" email, or folders that could act as pipes, e.g. as mentioned in this thread:

Do you mean real pipes? I think mbox code could be pretty easily modified to support pipes so that reading always shows empty mailbox but writing works.

Mark E. Mallett

17 Dec 17 Dec

12:25 a.m.

On Thu, Dec 16, 2004 at 01:26:35AM +0200, Timo Sirainen wrote:

...

On 16.12.2004, at 01:08, Mark E. Mallett wrote:

...
I'd still like to see more general hooks on moving into and out of folders,

Easy to use hooks would be nice.. But that needs more thinking.

...
or ways to "redeliver" email, or folders that could act as pipes, e.g. as mentioned in this thread:

Do you mean real pipes? I think mbox code could be pretty easily modified to support pipes so that reading always shows empty mailbox but writing works.

I meant real pipes, yep. I don't know how well you could map them to IMAP semantics though. A pipe would be something you would file a message into, at which point the message would be fed into a program rather than filed into a folder. Is it possible to have a write-only folder using IMAP? Or does the folder have to present a read interface too?

Others have mentioned cron jobs for the specific situation where one wants to do statistical reclassification. That's always an option, of course, but getting away from that kind of implementation is the (my, anyway) motivation for wanting pipeboxes in the first place. Event-driven operations are better for some things than asynchronous polling-- resourcewise and effectwise. For example if I have updated my delivery rules and want to redeliver some wrongly-filed email according to the new rules, I just move all the message from where they are into the specific pipebox that hooks into the delivery agent.

Timo Sirainen

2:07 a.m.

On 17.12.2004, at 00:25, Mark E. Mallett wrote:

...

...
Do you mean real pipes? I think mbox code could be pretty easily modified to support pipes so that reading always shows empty mailbox but writing works.

I meant real pipes, yep. I don't know how well you could map them to IMAP semantics though. A pipe would be something you would file a message into, at which point the message would be fed into a program rather than filed into a folder. Is it possible to have a write-only folder using IMAP? Or does the folder have to present a read interface too?

There has to be read interface, but nothing requires that any messages ever exist there. CVS supports now named pipes as write-only mboxes.

Tom Allison

16 Dec 16 Dec

1:13 p.m.

Mark E. Mallett wrote:

...

On Thu, Dec 16, 2004 at 09:58:53AM +1100, Curtis Maloney wrote:

...
It never came across to me that you were wanting something specific with dpsam... more that you wanted an explicit trigger for when a user decided something was/wasn't SPAM. And I, personally, love the idea.

I'd still like to see more general hooks on moving into and out of folders, or ways to "redeliver" email, or folders that could act as pipes, e.g. as mentioned in this thread:

http://www.dovecot.org/list/dovecot/2003-July/001973.html

mm

Here's how I use training with dovecot. It's hardly related to dovecot, but we've strayed this far, I thought I would attempt something that might become related again.

bogofilter does a test on email, without an database updates. This keeps the database smaller and since it doesn't change I believe it's cached.

bogofilter goes into three categories: (H)am, (U)nsure, (S)pam.

Ham is copied into a folder, "Ham" and delivered as usual. Unsure is copied into a folder, "Unsure" and delivered as usual. Spam is delivered into a folder, "Spam"

The rest is done through crontabs.

crontab: All email in Ham, Spam that is >4 days old is automatically moved out of the IMAP system (mbox actually, but it's no longer IMAP accessable).

the human: moves Ham/Spam/Unsure into seperate folders, NewHam, NewSpam

crontab: All email in NewHam, NewSpam is checked for learning. If the bogofilter score (H/U/S) doesn't match the folder it's placed in it's used for training. In other words if the score is Unsure or Ham and it's in folder NewSpam then $score != $folder and it's used for retraining.

I like this method because the crontabs can be run at night when the load is small.

If you trigger training based on a mail copy, what happens when someone dumps 400 emails into a folder all at once? What happens when 30 people do this all at the same time? It might not suit a smaller system at peak hours to have this done.

I would prefer to impliment a system where you can queue up the training in large numbers, but the actual training is done in a managed approach. Over time, the actual amount of training that occurs on a daily basis is on the order of <1 per week so it's not time critical that training be done. At first, I ran it hourly. Now I run it at midnight only. But on a large system, I would never deploy something without an initial wordlist to provide some filtering which would also make hourly jobs unneccessary.

So where does dovecot fall into all of this?

I don't know. I really can't make an arguement for doing anything to an IMAP server that would help with any of this without also making for potential problems. Dumping mail into pipes would lead to an unrecoverable condition if there was a human error (wrong pipe).

Perhaps the only thing would be to ask if moving email through the file system will really screw up the dovecot indexes. Sometimes dovecot reports some pretty strange number of messages in these folders.

Kenneth Porter

4:17 p.m.

--On Thursday, December 16, 2004 6:13 AM -0500 Tom Allison <tallison@tacocat.net> wrote:

...

Here's how I use training with dovecot. It's hardly related to dovecot, but we've strayed this far, I thought I would attempt something that might become related again.

For comparison, I'm using SpamAssassin. I have lots of mbox folders, but for spam training I have two folders named Uncaught and FalsePositives. Any spam that SA doesn't flag gets dragged to Uncaught. Anything in the Spam folder (put there by a procmail filter because it has SA markup) that's not spam gets dragged to FalsePositives. I have a nightly cron job that runs sa-learn on the two folders. sa-learn knows which messages it's seen before so I don't have to do anything special to flag the new stuff.

In my crontab:

12 0 * * * /home/ken/bin/sa-learn-nightly

The script:

#!/bin/sh sa-learn --spam --mbox ~/mail/Spam/Uncaught sa-learn --ham --mbox ~/mail/Spam/FalsePositives

Johannes Berg

2:05 p.m.

Curtis Maloney schrieb:

...

I was about to chime in and say "I think people missed your point."

Thanks. At least someone... ;-)

...

It never came across to me that you were wanting something specific with dpsam... more that you wanted an explicit trigger for when a user decided something was/wasn't SPAM. And I, personally, love the idea.

Right. I don't care if its SA or dspam or bogofilter. Hey, I was just evaluating dspam because SA is killing me with its overhead, so I'll be coding for dspam.

...

So, let me see if I fully understand how you want it to go: 1) mail hits your MTA, which hands it off to dspam 2) dspam munges it, decides if it's SPAM or HAM, and delivers it to the user, either in INBOX or SPAM 3) if the USER moves a messages into SPAM, dspam is notified it missed a message, and retrains on it. 4) if the USER moves a message out of SPAM, dspam is notified it got a false-positive, and retrains on it.

Precisely. Add a step to clean up the SPAM folder once a while.

...

Seems to me it will possibly lower the overall load, since you will only rescan/retrain messages _explicitly_ changed from/to SPAM/HAM.
Now, if only you could get some resident form of dspam, so you didn't have to keep spawning it.... or did I miss something in the docs?
Then again, there's libdspam...

Yeah, though both these options kinda suck. Spawning dspam gives you all the benefit of the command line client (it reads config files etc.) while using libdspam makes it in-process. I'm looking at making another dspam library that encapsulates more functionality of the dspam client (ie. the config file reading etc.) and using that in-process with dovecot, I'll kick that idea around the dspam-dev list. Also, I'd link that library into my MTA (exim). The rationale for that idea is to centralize dspam's configuration while still using it from within multiple processes. There's one catch: This system will require that dspam stores the signature in the header (that way I can use dovecot's API to extract it and pass it to libdspam w/o retrieving the whole message).

Also, dspam appears to store the messages in its database, so I was thinking of making a dspam-database dovecot storage plugin as well (or integrate that with the dspam plugin I need to write anyway). That way, those emails are only stored once. I haven't figured out what it stores though, whether all messages, to a certain limit, only spam, or ..... Needs some thinking, probably, and for a start, I'll just deliver the spam-messages to another maildir and make a namespace for it. Oh, and I'll have to prohibit APPENDing to the spam box, if some braindead imap clients moves by fetch/append/delete then that's their problem, but APPEND is kinda hard to manage I think (as per timo's message about append).

johannes

Curtis Maloney

17 Dec 17 Dec

12:33 a.m.

Johannes Berg wrote:

...

Curtis Maloney schrieb:

...
Seems to me it will possibly lower the overall load, since you will only rescan/retrain messages _explicitly_ changed from/to SPAM/HAM.
Now, if only you could get some resident form of dspam, so you didn't have to keep spawning it.... or did I miss something in the docs?
Then again, there's libdspam...

Yeah, though both these options kinda suck. Spawning dspam gives you all the benefit of the command line client (it reads config files etc.) while using libdspam makes it in-process. I'm looking at making another dspam library that encapsulates more functionality of the dspam client (ie. the config file reading etc.) and using that in-process with dovecot, I'll kick that idea around the dspam-dev list. Also, I'd link that library into my MTA (exim). The rationale for that idea is to centralize dspam's configuration while still using it from within multiple processes. There's one catch: This system will require that dspam stores the signature in the header (that way I can use dovecot's API to extract it and pass it to libdspam w/o retrieving the whole message).

Sounds to me like a dspam daemon would be a better option in some ways. It would mean a single task could handle work from both the MTA and Dovecot. As I said, I've not looked closely at dpsam and its interfaces, but I think I will now...

...

Also, dspam appears to store the messages in its database, so I was thinking of making a dspam-database dovecot storage plugin as well (or integrate that with the dspam plugin I need to write anyway). That way, those emails are only stored once. I haven't figured out what it stores though, whether all messages, to a certain limit, only spam, or ..... Needs some thinking, probably, and for a start, I'll just deliver the spam-messages to another maildir and make a namespace for it.

This could be very interesting... certainly a "unique" feature for Dovecot, afaik. Would be very interesting to see how it turns out, and will be happy to test on my home setup.

-- Curtis Maloney

Timo Sirainen

15 Dec 15 Dec

5:07 p.m.

On 14.12.2004, at 17:03, Johannes Berg wrote:

...

Anyway, I found a message dating back a while where Timo says this could easily be implemented in a plugin (in the 1.0 test series), but I can't put my finger on how to do it. Does anyone have an example plugin that can run commands when mail is moved in or out of a special folder? Or is there any plugin-writing documentation?

Is this enough?

http://dovecot.org/patches/1.0/copy_plugin.c

Dovecot would need a real plugin API some day which would make things easier..

John Peacock

5:19 p.m.

Timo Sirainen wrote:

...

Is this enough?

http://dovecot.org/patches/1.0/copy_plugin.c

Dovecot would need a real plugin API some day which would make things easier..

I might be interested in working on this, but I'd need a little more details. How is this linked in (and what library)? How is copy_plugin_init() called (it looks like this replaces the default COPY command)?

John

-- John Peacock Director of Information Research and Technology Rowman & Littlefield Publishing Group 4501 Forbes Boulevard Suite H Lanham, MD 20706 301-459-3366 x.5010 fax 301-429-5748

Timo Sirainen

5:31 p.m.

On 15.12.2004, at 17:19, John Peacock wrote:

...

Timo Sirainen wrote:

...
Is this enough? http://dovecot.org/patches/1.0/copy_plugin.c Dovecot would need a real plugin API some day which would make things easier..

I might be interested in working on this, but I'd need a little more details. How is this linked in (and what library)? How is copy_plugin_init() called (it looks like this replaces the default COPY command)?

See mail_use_modules and mail_modules setting in config file. Just copy the file as shared library in there.

Oh and the file is supposed to be compiled in src/imap/ directory so the relative include paths work right..

You could compile it into the binary itself if you really wanted, but then you'd have to change the makefile and sources a bit.

Johannes Berg

6:51 p.m.

Timo Sirainen schrieb:

...

Is this enough?

http://dovecot.org/patches/1.0/copy_plugin.c

Yes, it'll serve as a starting point on how to do things. Thanks much.

johannes

Johannes Berg

8:12 p.m.

...

...
http://dovecot.org/patches/1.0/copy_plugin.c

Yes, it'll serve as a starting point on how to do things. Thanks much.

Is copy the only way a message can be moved into/out of a mailbox?

johannes

Timo Sirainen

10:27 p.m.

On 15.12.2004, at 20:12, Johannes Berg wrote:

...

Is copy the only way a message can be moved into/out of a mailbox?

Within server, yes. If client copies a message between servers or from local disk to server, it uses APPEND command. You could do pretty much the same thing for it too if you need it.

Or do you need to know what messages were copied/appended? With COPY that's possible to do because you know the messageset, but with APPEND it's not really possible, unless you just do it for every new message that is seen (ie. could have been created by another IMAP session or MTA).

Ben Beuchler

10:28 p.m.

On Wed, Dec 15, 2004 at 07:12:11PM +0100, Johannes Berg wrote:

...

...
Yes, it'll serve as a starting point on how to do things. Thanks much.

Is copy the only way a message can be moved into/out of a mailbox?

My recollection of the IMAP RFCs is that the only way to "move" something is a copy followed by a delete.

-Ben

-- Ben Beuchler There is no spoon. insyte@emt-p.org -- The Matrix

Johannes Berg

16 Dec 16 Dec

2:54 p.m.

Johannes Berg wrote:

...

[...]

I've collected my thoughs now: http://johannes.sipsolutions.net/wiki/Projects/dovecot-dspam-integration

johannes

7533

Age (days ago)

7536

Last active (days ago)

List overview

39 comments

17 participants

participants (17)

Ben Beuchler
Benjamin J. Weiss
Curtis Maloney
Gunter Ohrner
Hauke Fath
Jethro R Binks
Johannes Berg
John Peacock
Kenneth Porter
Marcus Rueckert
Mark E. Mallett
Rick Flower
Rick Jones
Simon Waters
tallison＠tacocat.net
Timo Sirainen
Tom Allison