[Dovecot] plugin problem
I'm trying to do a rewrite of the dspam_plugin for dovecot 1.1b1. There are some API changes that warranted an update of the plugin. Also, I wanted the dspam_plugin to be able to handle pristine mails for dspam retraining, as opposed to the signature based retraining.
Question: How can I retrieve the full unix path for a specific mail?
The original code uses mail_get_first_header() to retrieve the signature header. I need something like mail_get_mail_file_path(&path) which I then could pass on to dspam. Is there some such function available?
Thanks in advance /Lars
On Sat, 2007-09-29 at 12:44 +0200, Lars Stavholm wrote:
I'm trying to do a rewrite of the dspam_plugin for dovecot 1.1b1.
Cool. I never imagined that the plugin would find such wide-spread use :)
How can I retrieve the full unix path for a specific mail?
The original code uses mail_get_first_header() to retrieve the signature header. I need something like mail_get_mail_file_path(&path) which I then could pass on to dspam. Is there some such function available?
I don't think you can since mail might be stored in any kind of format like mbox, dbox, ... Only in maildir would this be possible. You can probably somehow get the raw text of a message though, but I don't know off-hand, and then write it to a temporary file. In any case, I suggest doing that only when no signature is available, and I still don't see how you would end up with mails w/o signature at all except maybe during conversion to a new dspam installation.
johannes
Johannes Berg wrote:
On Sat, 2007-09-29 at 12:44 +0200, Lars Stavholm wrote:
I'm trying to do a rewrite of the dspam_plugin for dovecot 1.1b1.
Cool. I never imagined that the plugin would find such wide-spread use :)
Well, it's only me, don't know if anyone else uses it. Still, I think it's a brilliant idea. Doesn't get any more user friendly.
How can I retrieve the full unix path for a specific mail?
The original code uses mail_get_first_header() to retrieve the signature header. I need something like mail_get_mail_file_path(&path) which I then could pass on to dspam. Is there some such function available?
I don't think you can since mail might be stored in any kind of format like mbox, dbox, ... Only in maildir would this be possible. You can
OK, obviously I'm using Maildir format, but let's not restrict the functionality to that fact.
probably somehow get the raw text of a message though, but I don't know off-hand, and then write it to a temporary file. In any case, I suggest
There you go, that's my solution then, should work with all storage formats.
doing that only when no signature is available, and I still don't see how you would end up with mails w/o signature at all except maybe during conversion to a new dspam installation.
Who said anything about signatures not being available?
As far as I can tell from my tests, the signature's are picked up nicely by the dspam plugin.
However, I'm used to a dspam setup where TrainPristine=on, and the retraining/reclassification requires pristine mail-sources, without the X-DSPAM-... stuff, including the signature.
So, basically, I would read the mail in error, be it spam or ham, and pipe it to the dspam client for retraining/reclassification. The --user option of dspam is used to point dspam to the correct user (since we don't have a signature).
I saw some mail_get_istream() or similar, that seems to be a way to open up some sort of byte stream reading the mail contents. That might be what I'm looking for.
BTW, I would like to keep the previous functionality with the dspam plugin using signatures. In order to do that I need to be able to set dspam plugin specific options somwhow. Any idea?
Cheers /Lars
On Sat, 2007-09-29 at 13:17 +0200, Lars Stavholm wrote:
Cool. I never imagined that the plugin would find such wide-spread use :)
Well, it's only me, don't know if anyone else uses it. Still, I think it's a brilliant idea. Doesn't get any more user friendly.
Some other people are doing some new stuff with it too, you might want to search the mailing list archives. Somebody put the code into git too and worked on some things but I haven't followed what they did since I don't really have time right now to touch my mail setup.
As far as I can tell from my tests, the signature's are picked up nicely by the dspam plugin.
Right.
However, I'm used to a dspam setup where TrainPristine=on, and the retraining/reclassification requires pristine mail-sources, without the X-DSPAM-... stuff, including the signature.
Aha. Why are you using this? As we've discussed previously on the list, much of the processing time dspam requires per mail is for tokenizing the message which you completely skip by loading a pre-tokenized message from disk when training based on the signature. Look at the "dovecot dspam plugin using libdspam" thread from a few weeks ago.
So, basically, I would read the mail in error, be it spam or ham, and pipe it to the dspam client for retraining/reclassification. The --user option of dspam is used to point dspam to the correct user (since we don't have a signature).
Now you're saying again you don't have a signature?
BTW, I would like to keep the previous functionality with the dspam plugin using signatures. In order to do that I need to be able to set dspam plugin specific options somwhow. Any idea?
There are plugin options but I have no idea right now how to use them.
johannes
Johannes Berg wrote:
On Sat, 2007-09-29 at 13:17 +0200, Lars Stavholm wrote:
Cool. I never imagined that the plugin would find such wide-spread use :) Well, it's only me, don't know if anyone else uses it. Still, I think it's a brilliant idea. Doesn't get any more user friendly.
Some other people are doing some new stuff with it too, you might want to search the mailing list archives. Somebody put the code into git too and worked on some things but I haven't followed what they did since I don't really have time right now to touch my mail setup.
As far as I can tell from my tests, the signature's are picked up nicely by the dspam plugin.
Right.
However, I'm used to a dspam setup where TrainPristine=on, and the retraining/reclassification requires pristine mail-sources, without the X-DSPAM-... stuff, including the signature.
Aha. Why are you using this? As we've discussed previously on the list, much of the processing time dspam requires per mail is for tokenizing the message which you completely skip by loading a pre-tokenized message from disk when training based on the signature. Look at the "dovecot dspam plugin using libdspam" thread from a few weeks ago.
True enough, for scalability and high volumes: signature's, no doubt. That's one of the reasons I wanted to do it as an optional feature. Thing is, I never got the signature thing working for me with my dspam setup, see below.
So, basically, I would read the mail in error, be it spam or ham, and pipe it to the dspam client for retraining/reclassification. The --user option of dspam is used to point dspam to the correct user (since we don't have a signature).
Now you're saying again you don't have a signature?
Well, I don't have a signature "in my hand" (when executing the dspam.c code) since i don't use it, I didn't look for it, and didn't retrieve it, so I would need another way of identifying the recipient.
The problem is that with the dspam setup I'm using, i.e. 3.8.0, Hash driver, shared group, etc. I usually get "signature not found", so at an early stage of my dspam experience (a year ago), I got used to doing TrainPristine=on and feed the entire mail messages to dspam for reclassification without signature's (which I found to be a tad bit troublesome, or rather: I never could figure out why it didn't work). And in addition, we (at LinAdd.org) build rpm packages for small and medium sized business and private use, so the high performance and scalability is not really an issue (and, I'm just a tad bit lazy as well:).
BTW, I would like to keep the previous functionality with the dspam plugin using signatures. In order to do that I need to be able to set dspam plugin specific options somwhow. Any idea?
There are plugin options but I have no idea right now how to use them.
OK, fair enough, I' try to see how the other plugins use options.
Cheers, and thanks for your input /Lars
On Sat, 2007-09-29 at 17:03 +0200, Lars Stavholm wrote:
Well, I don't have a signature "in my hand" (when executing the dspam.c code) since i don't use it, I didn't look for it, and didn't retrieve it, so I would need another way of identifying the recipient.
Actually only setups that use virtual users need the uid-in-signature option, usually you can use the user who logged in to dovecot, I personally rely on the system user being right.
The problem is that with the dspam setup I'm using, i.e. 3.8.0, Hash driver, shared group, etc. I usually get "signature not found",
The message from my plugin? I'm guessing then there's some problem with your setup and you didn't configure dspam to put the signature into the header?
Anyway, I don't recommend training from pristine because of the resource overhead and it being hard to guarantee the message is indeed pristine, but if it suits you I can't stop you from doing it :)
johannes
Johannes Berg wrote:
On Sat, 2007-09-29 at 17:03 +0200, Lars Stavholm wrote:
Well, I don't have a signature "in my hand" (when executing the dspam.c code) since i don't use it, I didn't look for it, and didn't retrieve it, so I would need another way of identifying the recipient.
Actually only setups that use virtual users need the uid-in-signature option, usually you can use the user who logged in to dovecot, I personally rely on the system user being right.
Sounds about right.
The problem is that with the dspam setup I'm using, i.e. 3.8.0, Hash driver, shared group, etc. I usually get "signature not found",
The message from my plugin? I'm guessing then there's some problem with your setup and you didn't configure dspam to put the signature into the header?
I definitely have a dspam setup problem, never got the signatures working with the hash drive.
Anyway, I don't recommend training from pristine because of the resource overhead and it being hard to guarantee the message is indeed pristine, but if it suits you I can't stop you from doing it :)
Well, the resource overhead is there, that's for sure, but I don't think it's that significant. In the beginning for a new user there will be some reclassification but in the long run, dspam misses very few spams. I've reached +99% accuracy in a few months (for a single user, myself). But, of course you're right, with signature is better.
However, I'm slowly getting there with the dspam plugin.
Input Options
The plugin input options was easy, it seems that dovecot simply puts the options line into an env variable that can be read with the getenv() call, e.g.:
dovecot.conf: ... protocol imap { mail_plugins = dspam } plugin { # dspam path ':' spam folder ':' [no]signature ':' ignore dspam = /usr/sbin/dspam:Spam:signature:Trash }
...and in the dspam plugin code I simply parse the result from getenv("DSPAM") and there's the input options.
In a future version one might add the ability to ignore more than one folder.
Processing
I've tried to setup dspam with hash drive and signatures, to no avail, I just can't get it to work, dspam does not find the signature in the storage area. Don't know why.
Does anyone have a dspam.conf you could share with me? Possibly dspam build options as well?
Another (mildly stupid maybe) question: why the fork() in the original dspam plugin? Seems to me that the fork()
- waitpid() doesn't really allow for any advantage over a simple popen() and read the output? I have a sneaky feeling that I'm missing something vital here.
Cheers /Lars
On Sun, 2007-09-30 at 09:30 +0200, Lars Stavholm wrote:
The problem is that with the dspam setup I'm using, i.e. 3.8.0, Hash driver, shared group, etc. I usually get "signature not found",
The message from my plugin? I'm guessing then there's some problem with your setup and you didn't configure dspam to put the signature into the header?
I definitely have a dspam setup problem, never got the signatures working with the hash drive.
Odd. I definitely have that working just fine.
Input Options
The plugin input options was easy, it seems that dovecot simply puts the options line into an env variable that can be read with the getenv() call, e.g.:
dovecot.conf: ... protocol imap { mail_plugins = dspam } plugin { # dspam path ':' spam folder ':' [no]signature ':' ignore dspam = /usr/sbin/dspam:Spam:signature:Trash }
...and in the dspam plugin code I simply parse the result from getenv("DSPAM") and there's the input options.
In a future version one might add the ability to ignore more than one folder.
Hey that looks good.
Processing
I've tried to setup dspam with hash drive and signatures, to no avail, I just can't get it to work, dspam does not find the signature in the storage area. Don't know why.
Ah, you have a different problem then, ok.
Does anyone have a dspam.conf you could share with me?
I'll send you mine in private mail.
Possibly dspam build options as well?
Using debian's packages.
Another (mildly stupid maybe) question: why the fork() in the original dspam plugin? Seems to me that the fork()
- waitpid() doesn't really allow for any advantage over a simple popen() and read the output? I have a sneaky feeling that I'm missing something vital here.
popen() just forks/execs too, no? Coming from a kernel hacking background I'm more familiar with the low level details. Blame it on that.
johannes
Johannes Berg wrote:
On Sun, 2007-09-30 at 09:30 +0200, Lars Stavholm wrote:
The problem is that with the dspam setup I'm using, i.e. 3.8.0, Hash driver, shared group, etc. I usually get "signature not found", The message from my plugin? I'm guessing then there's some problem with your setup and you didn't configure dspam to put the signature into the header? I definitely have a dspam setup problem, never got the signatures working with the hash drive.
Odd. I definitely have that working just fine.
Input Options
The plugin input options was easy, it seems that dovecot simply puts the options line into an env variable that can be read with the getenv() call, e.g.:
dovecot.conf: ... protocol imap { mail_plugins = dspam } plugin { # dspam path ':' spam folder ':' [no]signature ':' ignore dspam = /usr/sbin/dspam:Spam:signature:Trash }
...and in the dspam plugin code I simply parse the result from getenv("DSPAM") and there's the input options.
In a future version one might add the ability to ignore more than one folder.
Hey that looks good.
Processing
I've tried to setup dspam with hash drive and signatures, to no avail, I just can't get it to work, dspam does not find the signature in the storage area. Don't know why.
Ah, you have a different problem then, ok.
Does anyone have a dspam.conf you could share with me?
I'll send you mine in private mail.
Possibly dspam build options as well?
Using debian's packages.
Another (mildly stupid maybe) question: why the fork() in the original dspam plugin? Seems to me that the fork()
- waitpid() doesn't really allow for any advantage over a simple popen() and read the output? I have a sneaky feeling that I'm missing something vital here.
popen() just forks/execs too, no? Coming from a kernel hacking background I'm more familiar with the low level details. Blame it on that.
popen() is more like system(), it starts a subprocess and then waits until the subprocess is done. Difference between popen() and system() is that using popen() I can capture the output.
Cheers /Lars
On Sun, 2007-09-30 at 09:30 +0200, Lars Stavholm wrote:
Input Options
The plugin input options was easy, it seems that dovecot simply puts the options line into an env variable that can be read with the getenv() call, e.g.:
dovecot.conf: ... protocol imap { mail_plugins = dspam } plugin { # dspam path ':' spam folder ':' [no]signature ':' ignore dspam = /usr/sbin/dspam:Spam:signature:Trash }
...and in the dspam plugin code I simply parse the result from getenv("DSPAM") and there's the input options.
In a future version one might add the ability to ignore more than one folder.
Coming from this, I think there are multiple things we should do. Let me try to remember the feature requests I've seen over the past year :)
- signature logging instead of direct retraining (could use dovecot's dict service)
- port to dovecot 1.1
- give --user option to dspam (when no user in sig)
- ...
To do this, I'd suggest the following. This should work great since AFAIK dovecot allows % expansion in the plugin options.
- change the options like dovecot does with A=B:C=D:...
- introduce options:
- BINARY=/usr/bin/dspam
- SPAM=folder1 (parser should allow giving it multiple times, i.e. allow SPAM=folder1:SPAM=folder2:SPAM=folder3...)
- TRASH=trash1 (similar to SPAM)
- USER=%u (if given, --user is given to dspam command line)
- OPTION=--mode=teft (arbitrary options for BINARY, can give multiple)
- SIGNATURE=X-DSPAM-Signature (header line in which signature is) here, * options are unique to the direct-retraining approach.
- split off the "backend" into a separate file with some hooks that can
be built as needed, introduce various backends:
- dict signature logger
- direct dspam caller
- ...
I think I'll start from the git tree somebody else published.
johannes
Johannes Berg wrote:
On Sun, 2007-09-30 at 09:30 +0200, Lars Stavholm wrote:
Input Options
The plugin input options was easy, it seems that dovecot simply puts the options line into an env variable that can be read with the getenv() call, e.g.:
dovecot.conf: ... protocol imap { mail_plugins = dspam } plugin { # dspam path ':' spam folder ':' [no]signature ':' ignore dspam = /usr/sbin/dspam:Spam:signature:Trash }
...and in the dspam plugin code I simply parse the result from getenv("DSPAM") and there's the input options.
In a future version one might add the ability to ignore more than one folder.
Coming from this, I think there are multiple things we should do. Let me try to remember the feature requests I've seen over the past year :)
- signature logging instead of direct retraining (could use dovecot's dict service)
Why?
- port to dovecot 1.1
Easy enough, you'll nail this one in minutes.
- give --user option to dspam (when no user in sig)
This is needed for TrainPristine=on as well.
- ...
To do this, I'd suggest the following. This should work great since AFAIK dovecot allows % expansion in the plugin options.
- change the options like dovecot does with A=B:C=D:...
There's a good idea, conformity.
- introduce options:
- BINARY=/usr/bin/dspam
- SPAM=folder1 (parser should allow giving it multiple times, i.e. allow SPAM=folder1:SPAM=folder2:SPAM=folder3...)
- TRASH=trash1 (similar to SPAM)
- USER=%u (if given, --user is given to dspam command line)
- OPTION=--mode=teft (arbitrary options for BINARY, can give multiple)
- SIGNATURE=X-DSPAM-Signature (header line in which signature is) here, * options are unique to the direct-retraining approach.
- split off the "backend" into a separate file with some hooks that can be built as needed, introduce various backends:
- dict signature logger
- direct dspam caller
- ...
Didn't quite get that one: what do you mean by "backend"? Part of dspam.c I assume?
I think I'll start from the git tree somebody else published.
Sounds promising.
Cheers /Lars
On Sun, 2007-09-30 at 10:57 +0200, Lars Stavholm wrote:
Coming from this, I think there are multiple things we should do. Let me try to remember the feature requests I've seen over the past year :)
- signature logging instead of direct retraining (could use dovecot's dict service)
Why?
Why what? Why logging at all? This was part of the "scaling better" plan. Why using dict service? Because it has good fallover behaviour etc.
- port to dovecot 1.1
Easy enough, you'll nail this one in minutes.
I'm just doing a proper build system etc. I'll get around to it :)
- give --user option to dspam (when no user in sig)
This is needed for TrainPristine=on as well.
Right. Haven't thought about pristine training much yet, I still hope you'll nail your bug and get rid of that requirement ;)
- ...
To do this, I'd suggest the following. This should work great since AFAIK dovecot allows % expansion in the plugin options.
- change the options like dovecot does with A=B:C=D:...
There's a good idea, conformity.
Actually, I just realised that it's possible to give multiple options, maybe that's preferable?
plugin { dspam_trashes = trash1,trash2,trash3 dspam_options = --user=xxx --a --b --c --d ... }
Didn't quite get that one: what do you mean by "backend"? Part of dspam.c I assume?
Right now I'm thinking to build the plugin from multiple files depending on the backend selection. I'll publish the git tree this afternoon to show you, right now I have to go have a shower :)
I think I'll start from the git tree somebody else published.
Sounds promising.
Decided to do my own because that one had libdspam integrated already. That's another backend then.
johannes
Johannes Berg wrote:
On Sun, 2007-09-30 at 10:57 +0200, Lars Stavholm wrote:
Coming from this, I think there are multiple things we should do. Let me try to remember the feature requests I've seen over the past year :)
- signature logging instead of direct retraining (could use dovecot's dict service) Why?
Why what? Why logging at all? This was part of the "scaling better" plan. Why using dict service? Because it has good fallover behaviour etc.
I see.
- port to dovecot 1.1 Easy enough, you'll nail this one in minutes.
I'm just doing a proper build system etc. I'll get around to it :)
- give --user option to dspam (when no user in sig) This is needed for TrainPristine=on as well.
Right. Haven't thought about pristine training much yet, I still hope you'll nail your bug and get rid of that requirement ;)
Well, it would seem I've nailed it. However, when using the dspam group feature, there are configurations when the TrainPristine=on is useful, so I'm still thinking about keeping that requirement.
- ...
To do this, I'd suggest the following. This should work great since AFAIK dovecot allows % expansion in the plugin options.
- change the options like dovecot does with A=B:C=D:... There's a good idea, conformity.
Actually, I just realised that it's possible to give multiple options, maybe that's preferable?
It's possible, the expire plugin uses that mechanism.
plugin { dspam_trashes = trash1,trash2,trash3 dspam_options = --user=xxx --a --b --c --d ... }
Donno, matter of taste, or is it even a bit more user friendly.
Didn't quite get that one: what do you mean by "backend"? Part of dspam.c I assume?
Right now I'm thinking to build the plugin from multiple files depending on the backend selection. I'll publish the git tree this afternoon to show you, right now I have to go have a shower :)
I see.
I think I'll start from the git tree somebody else published. Sounds promising.
Decided to do my own because that one had libdspam integrated already. That's another backend then.
Good Luck and thanks for your help /Lars
On Sun, 2007-09-30 at 17:58 +0200, Lars Stavholm wrote:
- port to dovecot 1.1 Easy enough, you'll nail this one in minutes.
I'm just doing a proper build system etc. I'll get around to it :)
- give --user option to dspam (when no user in sig) This is needed for TrainPristine=on as well.
Right. Haven't thought about pristine training much yet, I still hope you'll nail your bug and get rid of that requirement ;)
Well, it would seem I've nailed it. However, when using the dspam group feature, there are configurations when the TrainPristine=on is useful, so I'm still thinking about keeping that requirement.
Good point. Others have that requirement too for other things, and spamassassin probably requires the full mail too. I'll add it when I get around.
It's possible, the expire plugin uses that mechanism.
Right now I'm thinking to build the plugin from multiple files depending on the backend selection. I'll publish the git tree this afternoon to show you, right now I have to go have a shower :)
I see.
As you can see in the code, I've done that now.
johannes
Johannes Berg wrote:
Decided to do my own because that one had libdspam integrated already. That's another backend then.
Hi,
I'm currently traveling around in the US visiting some conferences etc. So I had no time to create a libdspam backend. I will do this if I find some time.
johannes
Best regards,
-- andreas
-- http://www.cynapses.org/ - cybernetic synapses
On Sun, 2007-09-30 at 09:30 +0200, Lars Stavholm wrote:
Another (mildly stupid maybe) question: why the fork() in the original dspam plugin? Seems to me that the fork()
- waitpid() doesn't really allow for any advantage over a simple popen() and read the output? I have a sneaky feeling that I'm missing something vital here.
popen() uses FILE streams, which I at least try to avoid. For example in some systems (Solaris IIRC) they were limited to 256 first file descriptors.
It also executes everything through /bin/sh -c, which is pointless if you're not running a script and possibly dangerous if you're not escaping parameters correctly.
Timo Sirainen wrote:
On Sun, 2007-09-30 at 09:30 +0200, Lars Stavholm wrote:
Another (mildly stupid maybe) question: why the fork() in the original dspam plugin? Seems to me that the fork()
- waitpid() doesn't really allow for any advantage over a simple popen() and read the output? I have a sneaky feeling that I'm missing something vital here.
popen() uses FILE streams, which I at least try to avoid. For example in some systems (Solaris IIRC) they were limited to 256 first file descriptors.
It also executes everything through /bin/sh -c, which is pointless if you're not running a script and possibly dangerous if you're not escaping parameters correctly.
I hear you. What would you suggest instead? pipe() + fork() + execl()? /L
On Sun, 2007-09-30 at 18:00 +0200, Lars Stavholm wrote:
Timo Sirainen wrote:
On Sun, 2007-09-30 at 09:30 +0200, Lars Stavholm wrote:
Another (mildly stupid maybe) question: why the fork() in the original dspam plugin? Seems to me that the fork()
- waitpid() doesn't really allow for any advantage over a simple popen() and read the output? I have a sneaky feeling that I'm missing something vital here.
popen() uses FILE streams, which I at least try to avoid. For example in some systems (Solaris IIRC) they were limited to 256 first file descriptors.
It also executes everything through /bin/sh -c, which is pointless if you're not running a script and possibly dangerous if you're not escaping parameters correctly.
I hear you. What would you suggest instead? pipe() + fork() + execl()?
Yes. Or execv().
Anyway, I don't recommend training from pristine because of the resource overhead and it being hard to guarantee the message is indeed pristine, but if it suits you I can't stop you from doing it :)
Really? I found the downside of using signatures much worse than not using them:
If the backend is mysql, eventually, we have to clean, and optimize the signature tables... And the optimization locks the table. Needless to say, all inbound mail can't be written to the signatures table.
Now, if you have a lot of users, your table optimization could take hours to finish. Postgres has the same problem if you use a 'vacuum full'.
Cheers Johannes.
PS: your modified plugin worked like a charm for us. thanks a bunch!
On Mon, 2007-10-01 at 12:44 -0700, Tom Bombadil wrote:
Really? I found the downside of using signatures much worse than not using them:
If the backend is mysql, eventually, we have to clean, and optimize the signature tables... And the optimization locks the table. Needless to say, all inbound mail can't be written to the signatures table.
Now, if you have a lot of users, your table optimization could take hours to finish. Postgres has the same problem if you use a 'vacuum full'.
Ah, yeah, that could be a problem. I use css at the moment so each mail signature is a single file it simply blows away when no longer needed.
Right now I'm looking into migrating to crm114 for the fun of it ;)
johannes
participants (5)
-
Andreas Schneider
-
Johannes Berg
-
Lars Stavholm
-
Timo Sirainen
-
Tom Bombadil