Sieve: Saving "pristine" messages for backups and spam training
Hello, I'm trying to work out a way to have my Sieve filter save a "pristine" version of email messages as a backup, primarily to use for training the spam filter. I would like is to have every message saved into a single, site-wide directory (in the global sieve) before being processed additionally and delivered. The messages in that directory will be used to train the spam filter without having to worry about removing Spamassassin headers and so forth.
I thought fileinto :copy might do what I wanted, but this creates a backup directory individually for each user. That's unmanageable for the spam training process I use. redirect *could* work, but that adds a header during the process so the email saved would not be "pristine".
I'm thinking of using the extprograms plugin to pipe to a program that will do a simple copy. That feels very hackish, however, and I'm hoping there is a more elegant solution.
Am I missing something obvious here?
Thanks! Jeff
Am 11.08.2014 um 17:42 schrieb Jeff Rice:
Hello, I'm trying to work out a way to have my Sieve filter save a "pristine" version of email messages as a backup, primarily to use for training the spam filter.
why , mail passes your smtp server with i.e spamass-milter ( i.e tagged spam will train bayes auto ,depend to setup ) the very rest of false postive or untagged spam, should send from users to i.e a train script etc, spam tagged mail could be filter auto to Junk folder by sieve global rule ( with pop3 use virtual dove setup )
I would like is to have every message saved into a single,
site-wide directory (in the global sieve) before being processed additionally and delivered. The messages in that directory will be used to train the spam filter without having to worry about removing Spamassassin headers and so forth.
I thought fileinto :copy might do what I wanted, but this creates a backup directory individually for each user. That's unmanageable for the spam training process I use. redirect *could* work, but that adds a header during the process so the email saved would not be "pristine".
I'm thinking of using the extprograms plugin to pipe to a program that will do a simple copy. That feels very hackish, however, and I'm hoping there is a more elegant solution.
Am I missing something obvious here?
keep stuff simple
Thanks! Jeff
Best Regards MfG Robert Schetterer
-- [*] sys4 AG
http://sys4.de, +49 (89) 30 90 46 64 Franziskanerstraße 15, 81669 München
Sitz der Gesellschaft: München, Amtsgericht München: HRB 199263 Vorstand: Patrick Ben Koetter, Marc Schiffbauer Aufsichtsratsvorsitzender: Florian Kirstein
On 8/11/2014 11:42 AM, Jeff Rice wrote:
Hello, I'm trying to work out a way to have my Sieve filter save a "pristine" version of email messages as a backup, primarily to use for training the spam filter. I would like is to have every message saved into a single, site-wide directory (in the global sieve) before being processed additionally and delivered. The messages in that directory will be used to train the spam filter without having to worry about removing Spamassassin headers and so forth.
Provided I understand you correctly, my first thought is that saving a duplicate copy of every single message that arrives on this system seems wasteful.
Why not save only the messages that would actually be useful for spam training purposes?
I thought fileinto :copy might do what I wanted, but this creates a backup directory individually for each user. That's unmanageable for the spam training process I use. redirect *could* work, but that adds a header during the process so the email saved would not be "pristine".
I'm thinking of using the extprograms plugin to pipe to a program that will do a simple copy. That feels very hackish, however, and I'm hoping there is a more elegant solution.
There is; the Dovecot Antispam plug-in. It does exactly what you describe, and it addresses the problem of storing a duplicate copy of all messages.
In short, when a user drags a message from any folder to "Junk", you'll receive a "pristine" copy of the message at any local address you specify, delivered to any folder you specify (e.g., "Train as SPAM") within that "training user's" mailbox.
Conversely, when a user drags a message from "Junk" to any other folder, you'll receive a copy of the message in your "Train as HAM" folder.
Then, you can point your anti-spam solution's training executable to these two "pristine master corpus" folders.
If you ever need to reclassify messages, or expunge them, doing so is trivial with this master corpus approach.
Am I missing something obvious here?
Thanks! Jeff
Happy to provide a sample script for the antispam plugin's mailtrain back-end, as that's the one I use.
Cheers,
-Ben
Ben Johnson <mailto:ben@indietorrent.org> August 11, 2014 at 5:52 PM On 8/11/2014 11:42 AM, Jeff Rice wrote:
I'm trying to work out a way to have my Sieve filter save a "pristine" version of email messages as a backup, primarily to use for training the spam filter. I would like is to have every message saved into a single, site-wide directory (in the global sieve) before being processed additionally and delivered. The messages in that directory will be used to train the spam filter without having to worry about removing Spamassassin headers and so forth.
Provided I understand you correctly, my first thought is that saving a duplicate copy of every single message that arrives on this system seems wasteful.
A bit wasteful, but disk space is cheap and it's a limited, rolling backup. The value of retraining goes down significantly as time passes, so I'm not planning on keeping messages there for an extended period of time.
Cron will clean out older messages after a set period of time.
I'm thinking of using the extprograms plugin to pipe to a program that will do a simple copy. That feels very hackish, however, and I'm hoping there is a more elegant solution.
There is; the Dovecot Antispam plug-in. It does exactly what you describe, and it addresses the problem of storing a duplicate copy of all messages.
In short, when a user drags a message from any folder to "Junk", you'll receive a "pristine" copy of the message at any local address you specify, delivered to any folder you specify (e.g., "Train as SPAM") within that "training user's" mailbox.
Hmm. Perhaps I'm just dense, but I don't see this behavior documented in the Antispam plugin docs. I'm happy to be corrected if I've misunderstood. I'd rather use an existing tool if possible.
What I can see that Antispam will train on the version of the message the user drags into the "Junk" folder. But that message may have had headers added by a sieve filter or Spamassassin, for example. By "pristine", I mean "as received" by the LDA.
CRM114's "reaver_cache" is along the lines of what I'm thinking of.
Jeff
Jeff Rice <mailto:list1@jrice.me> August 11, 2014 at 11:42 AM Hello, I'm trying to work out a way to have my Sieve filter save a "pristine" version of email messages as a backup, primarily to use for training the spam filter. I would like is to have every message saved into a single, site-wide directory (in the global sieve) before being processed additionally and delivered. The messages in that directory will be used to train the spam filter without having to worry about removing Spamassassin headers and so forth.
I thought fileinto :copy might do what I wanted, but this creates a backup directory individually for each user. That's unmanageable for the spam training process I use. redirect *could* work, but that adds a header during the process so the email saved would not be "pristine".
I'm thinking of using the extprograms plugin to pipe to a program that will do a simple copy. That feels very hackish, however, and I'm hoping there is a more elegant solution.
Am I missing something obvious here?
Thanks! Jeff
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On Mon, 11 Aug 2014, Jeff Rice wrote:
Ben Johnson <mailto:ben@indietorrent.org> August 11, 2014 at 5:52 PM On 8/11/2014 11:42 AM, Jeff Rice wrote:
I'm thinking of using the extprograms plugin to pipe to a program that will do a simple copy. That feels very hackish, however, and I'm hoping there is a more elegant solution.
There is; the Dovecot Antispam plug-in. It does exactly what you describe, and it addresses the problem of storing a duplicate copy of all messages.
In short, when a user drags a message from any folder to "Junk", you'll receive a "pristine" copy of the message at any local address you specify, delivered to any folder you specify (e.g., "Train as SPAM") within that "training user's" mailbox.
Hmm. Perhaps I'm just dense, but I don't see this behavior documented in the Antispam plugin docs. I'm happy to be corrected if I've misunderstood. I'd rather use an existing tool if possible.
What I can see that Antispam will train on the version of the message the user drags into the "Junk" folder. But that message may have had headers added by a sieve filter or Spamassassin, for example. By "pristine", I mean "as received" by the LDA.
CRM114's "reaver_cache" is along the lines of what I'm thinking of.
How about this:
Your MTA forwards each message to a central mail account, where they get spooled and purged after n days by cron. If you need a "pristine" copy of a message, you take the message from sieve, e.g. via antispam plugin, determine from Message-Id, recieved headers and whatsoever which "pristine" copy could be meant and use the one from the central store. That will also bypass changes of the message added because of delivery itself, because you know that this accounts has no Sieve and you can remove the last recived header etc.pp.
I thought fileinto :copy might do what I wanted, but this creates a backup directory individually for each user. That's unmanageable for the spam training process I use. redirect *could* work, but that adds a header during the process so the email saved would not be "pristine".
If you think an early "sieve_before" command will do, try a hidden namespace, add write-only ACLs for everyone for one mailbox and "fileinto :copy" there.
The hidden namespace shall keep it from the eyes of most users, which would ask questions.
Steffen Kaiser -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux)
iQEVAwUBU+nNLnz1H7kL/d9rAQK3/Af/XjlnbOXtSBcphjMZifx894IbJnDuCRVq QMWbBLbsk+zsOiww9gKcC+99AA7+dPYyGf5E+33U1UkuFi8EwS2YT/IfkF4FeC4x HZ+ERRHwVW5rJBtkx1BzjxWspuH+0X1R3CerdBGW2vifGZ6vr9uUk1gU1mG+kjB9 qp8cTh1PZxKcye3MR+bnbCH/lPNDAnvFVJtNnNBxweE2Ujd6QG4oepS+OH0QH+8R QqiUF4vSrIDc1pcyLkzQus9oCyRaaveTTnuzq5CdQzTOF4awX3X0Co2HJyXokJHa DToIoqd1Czawn/O2vxYoqTRf4ugxrZGx7oKw2YcGFH+/7SxXpU1fQg== =/38s -----END PGP SIGNATURE-----
participants (4)
-
Ben Johnson
-
Jeff Rice
-
Robert Schetterer
-
Steffen Kaiser