http://dovecot.org/releases/1.1/rc/dovecot-1.1.rc8.tar.gz http://dovecot.org/releases/1.1/rc/dovecot-1.1.rc8.tar.gz.sig
I then decided to add the deliver -c feature to this release. Seems to work in my tests, but who knows if it breaks something.. Although most of the code is called only if -c parameter is given. Anyway we really should have a comprehensive test suite written some day (yes, help is really wanted for this :).
So let's hope this is the last RC release. If there aren't any major problems I'll release v1.1.0 in a couple of weeks.
I'll also try to merge all my different development trees into a single v1.2 code tree within a few weeks. v1.2.0 will probably be released this summer as well, since it mainly has new features that don't change existing code all that much (CONDSTORE is a bit invasive though).
+ deliver: Added -c parameter to provide path to delivered mail.
This allows maildir to save identical mails to multiple recipients
using hard links.
- rc6/rc7 broke POP3 with non-Maildir formats
- mbox: Saving a message without a body or the end-of-headers line
could have caused an assert-crash later.
- Several dbox fixes
Hi
- deliver: Added -c parameter to provide path to delivered mail. This allows maildir to save identical mails to multiple recipients using hard links.
Funnily enough it was on my todo list to whip up a small perl program to go and scan my maildirs and figure out if this theoretical idea actually amounted to anything.
Algorithm would be this:
Open each message, scan for first blank line. SHA the rest of the message, store the SHA in a hash (along with the message size) rinse and repeat and see if we end up with any hashes showing count greater than 1...
This would represent the best case that we could achieve assuming body content fixed and we find some way to manage variable headers.
Next up is to use a mime parser and SHA each message part. Same idea, assuming we used some kind of format to store each part individually, how much gain is this really worth in terms of storage (looks tempting up front, condense all those duplicated jokes, etc - however, does it really bear out in practice...).
I think MS Exchange only does single instance storage like you describe here with delivery time hardlinking of messages? Never analysed what that was worth (back when I had an Exchange system to fiddle with...)
I have a feeling that gzip compression of files would be worth more than this hardlinking (on many but not all mail systems...)
Ed W
On Mon, 2008-06-02 at 23:25 +0100, Ed W wrote:
Hi
- deliver: Added -c parameter to provide path to delivered mail. This allows maildir to save identical mails to multiple recipients using hard links.
Funnily enough it was on my todo list to whip up a small perl program to go and scan my maildirs and figure out if this theoretical idea actually amounted to anything.
Algorithm would be this:
Open each message, scan for first blank line. SHA the rest of the message, store the SHA in a hash (along with the message size) rinse and repeat and see if we end up with any hashes showing count greater than 1...
This would represent the best case that we could achieve assuming body content fixed and we find some way to manage variable headers.
Somewhat faster way would be to get a list of file sizes first and not bother checksumming any files which have a unique size.
Next up is to use a mime parser and SHA each message part. Same idea, assuming we used some kind of format to store each part individually, how much gain is this really worth in terms of storage (looks tempting up front, condense all those duplicated jokes, etc - however, does it really bear out in practice...).
This is in my dbox TODO list (not near future though).
I think MS Exchange only does single instance storage like you describe here with delivery time hardlinking of messages? Never analysed what that was worth (back when I had an Exchange system to fiddle with...)
No idea about Exchange, but dbmail 2.3 does single instance MIME part storing.
I have a feeling that gzip compression of files would be worth more than this hardlinking (on many but not all mail systems...)
Or you could use both. zlib plugin already supports this with maildir.
Timo Sirainen wrote:
On Mon, 2008-06-02 at 23:25 +0100, Ed W wrote:
Hi
- deliver: Added -c parameter to provide path to delivered mail. This allows maildir to save identical mails to multiple recipients using hard links.
Funnily enough it was on my todo list to whip up a small perl program to go and scan my maildirs and figure out if this theoretical idea actually amounted to anything.
Algorithm would be this:
Open each message, scan for first blank line. SHA the rest of the message, store the SHA in a hash (along with the message size) rinse and repeat and see if we end up with any hashes showing count greater than 1...
This would represent the best case that we could achieve assuming body content fixed and we find some way to manage variable headers.
Somewhat faster way would be to get a list of file sizes first and not bother checksumming any files which have a unique size.
Could do, but I was trying to expand to the case that the headers were different, but the body was the same (eg I suspect that mailing list managers might deliver emails one by one (verp), but the body is not customised. Anyway, just wanted to checksum the body of the message not the whole message
Actually the motivation for this was I was wondering about the benefit of a storage backend where the body was stored per file and the headers were stored separately (perhaps in a maildir type format). I haven't looked to see if this is what dbox does already...
I have been looking at git and brackup for backing up maildirs and it's got me thinking a bit more about mail storage algorithms
Ed W
On Tue, Jun 03, 2008 at 07:11:33AM +0100, Ed W wrote:
Could do, but I was trying to expand to the case that the headers were different, but the body was the same (eg I suspect that mailing list managers might deliver emails one by one (verp), but the body is not customised. Anyway, just wanted to checksum the body of the message not the whole message
That could lead to slight problems, like hardlinking totally unrelated messages, e.g. empty messages. Some Headers like From:, To:, Date:, Subject: should probably be identical.
For some consistency, just removing *locally* generated trace headers before fingerprinting might lead to better results.
Jost
| Helft Spam ausrotten! HTML in Mail ist unhöflich. | | Postmaster, JAPH, manchmal Wahrsager am RZ der RUB | | Wahre Worte sind nicht gefällig, gefällige Worte sind nicht wahr.| | Lao Tse, Tao Te King 81 |
On Tue, Jun 03, 2008 at 10:27:32AM +0200, Jost Krieger wrote:
On Tue, Jun 03, 2008 at 07:11:33AM +0100, Ed W wrote:
Could do, but I was trying to expand to the case that the headers were different, but the body was the same (eg I suspect that mailing list managers might deliver emails one by one (verp), but the body is not customised. Anyway, just wanted to checksum the body of the message not the whole message
That could lead to slight problems, like hardlinking totally unrelated messages, e.g. empty messages. Some Headers like From:, To:, Date:, Subject: should probably be identical.
Message-ID perhaps? :-)
For some consistency, just removing *locally* generated trace headers before fingerprinting might lead to better results.
That may still leave identical messages not hard-linked thus wasting space. Eg. if they come from MTA's that do recipient splitting, or messages that are routed via different systems. The Received headers will be different but the body generally identical.
I think a better solution is what was suggested here before, ie. to keep the (unique) message headers in a Maildir-like format, containing links to (single-instance stored) message bodies in a a separate location.
Geert
On Tue, Jun 03, 2008 at 10:45:20AM +0200, Geert Hendrickx wrote:
On Tue, Jun 03, 2008 at 10:27:32AM +0200, Jost Krieger wrote: ...
That could lead to slight problems, like hardlinking totally unrelated messages, e.g. empty messages. Some Headers like From:, To:, Date:, Subject: should probably be identical.
Message-ID perhaps? :-)
Yep, add that ...
For some consistency, just removing *locally* generated trace headers before fingerprinting might lead to better results.
That may still leave identical messages not hard-linked thus wasting space. Eg. if they come from MTA's that do recipient splitting, or messages that are routed via different systems. The Received headers will be different but the body generally identical.
True, but these headers are quite important sometimes.
I think a better solution is what was suggested here before, ie. to keep the (unique) message headers in a Maildir-like format, containing links to (single-instance stored) message bodies in a a separate location.
Probably better, but to make this transparent for the users, it would need quite a bit of work in dovecot.
Jost
| Helft Spam ausrotten! HTML in Mail ist unhöflich. | | Postmaster, JAPH, manchmal Wahrsager am RZ der RUB | | Wahre Worte sind nicht gefällig, gefällige Worte sind nicht wahr.| | Lao Tse, Tao Te King 81 |
On Tue, 2008-06-03 at 07:11 +0100, Ed W wrote:
Actually the motivation for this was I was wondering about the benefit of a storage backend where the body was stored per file and the headers were stored separately (perhaps in a maildir type format). I haven't looked to see if this is what dbox does already...
dbox is half-designed to support this. It supports arbitrary metadata (unlike maildir) and I've already written 3 lines of code to get this implemented ;)
/* Pointer to external message data. Format is:
1*(<start offset> <byte count> <ref>) */
DBOX_METADATA_EXT_REF = 'X',
There's no code to actually read/write such metadata though. Also I'm not exactly sure what the <ref> is. Maybe just a filename used to store the data.
--On Tuesday, June 03, 2008 3:49 PM +0300 Timo Sirainen tss@iki.fi wrote:
dbox is half-designed to support this. It supports arbitrary metadata (unlike maildir) and I've already written 3 lines of code to get this implemented ;)
/* Pointer to external message data. Format is: 1*(<start offset> <byte count> <ref>) */ DBOX_METADATA_EXT_REF = 'X',
There's no code to actually read/write such metadata though. Also I'm not exactly sure what the <ref> is. Maybe just a filename used to store the data.
LOL, so I'm not the only one who designs like that. I think of it like sculpting: Throw some clay on the table and then scrape away anything that's not part of my objective. It's the right-brain side of programming. (And the hardest part.)
Timo Sirainen wrote:
http://dovecot.org/releases/1.1/rc/dovecot-1.1.rc8.tar.gz http://dovecot.org/releases/1.1/rc/dovecot-1.1.rc8.tar.gz.sig I refreshed the ManageSieve patch for the new Dovecot v1.1 release:
http://www.rename-it.nl/dovecot/1.1/dovecot-1.1.rc8-managesieve-0.10.2.diff.... http://www.rename-it.nl/dovecot/1.1/dovecot-1.1.rc8-managesieve-0.10.2.diff....
Regards,
Stephan.
Timo Sirainen wrote:
- deliver: Added -c parameter to provide path to delivered mail. This allows maildir to save identical mails to multiple recipients using hard links.
Now I tried this, with some trouble.
I had to set "maildir_copy_with_hardlinks = yes" for "deliver" to pick it up, even though this is supposed to be the default.
Also, the W=nnnn size thing is not added to filenames when using -p.
Cheers, Anders.
On Wed, 2008-06-04 at 11:37 +0200, Anders wrote:
Timo Sirainen wrote:
- deliver: Added -c parameter to provide path to delivered mail. This allows maildir to save identical mails to multiple recipients using hard links.
Now I tried this, with some trouble.
I had to set "maildir_copy_with_hardlinks = yes" for "deliver" to pick it up, even though this is supposed to be the default.
deliver uses separate config parsing code. Looks like all boolean settings which have "yes" default are "no" as default in deliver. I'll fix it today by adding more kludges.. Hopefully v2.0 will come soon with its unified config parsing code. :)
Also, the W=nnnn size thing is not added to filenames when using -p.
And cache isn't updated either. These are because hard linking can be done without actually reading the mail contents. I won't fix this for v1.1, but I updated the documentation.
I'm not sure what the best final solution to this is though. There could of course be a special deliver-check when the reading is done, but COPY command has the same problem. Should it read the files or not? If the file contents are already in memory it would be a good idea to read them and update cache, but otherwise not. I guess mincore() is the only potential way to check that, but mmaping the file only to check that is probably more trouble than worth.
Timo Sirainen wrote:
On Wed, 2008-06-04 at 11:37 +0200, Anders wrote:
Timo Sirainen wrote:
- deliver: Added -c parameter to provide path to delivered mail. This allows maildir to save identical mails to multiple recipients using hard links.
[...]
Also, the W=nnnn size thing is not added to filenames when using -p.
And cache isn't updated either. These are because hard linking can be done without actually reading the mail contents. I won't fix this for v1.1, but I updated the documentation.
I'm not sure what the best final solution to this is though. There could of course be a special deliver-check when the reading is done, but COPY command has the same problem. Should it read the files or not? If the file contents are already in memory it would be a good idea to read them and update cache, but otherwise not. I guess mincore() is the only potential way to check that, but mmaping the file only to check that is probably more trouble than worth.
For the delivery case, the mail will obviously be in memory, as we have just written it to a temporary file. Is it more than a few lines of code to add an index update to deliver.c after the hardlink? I might want to have that as a local patch.
As an alternative, can I call something from my wrapper script to have the index updated after delivery? I guess this will be impossible in the general case, as there is no way to know where Sieve decided to put the mail.
I am not sure how important the update is in our case, anyway. Most people have the MUA open all day, and I guess a client in IDLE will fetch the headers and have the index updated immediately, right?
Regards, Anders.
On Wed, 2008-06-04 at 12:09 +0200, Anders wrote:
Also, the W=nnnn size thing is not added to filenames when using -p.
And cache isn't updated either. These are because hard linking can be done without actually reading the mail contents. I won't fix this for v1.1, but I updated the documentation.
I'm not sure what the best final solution to this is though. There could of course be a special deliver-check when the reading is done, but COPY command has the same problem. Should it read the files or not? If the file contents are already in memory it would be a good idea to read them and update cache, but otherwise not. I guess mincore() is the only potential way to check that, but mmaping the file only to check that is probably more trouble than worth.
For the delivery case, the mail will obviously be in memory, as we have just written it to a temporary file. Is it more than a few lines of code to add an index update to deliver.c after the hardlink? I might want to have that as a local patch.
I'm not sure. Maybe it wouldn't be too difficult.
As an alternative, can I call something from my wrapper script to have the index updated after delivery? I guess this will be impossible in the general case, as there is no way to know where Sieve decided to put the mail.
You could do e.g.:
1 fetch * bodystructure
But this assumes that client is interested of bodystructure, otherwise it pollutes the cache with an unnecessary field. You could also check if:
1 search * body asdf
updates cache. I vaguely remember that it does, but I'm not sure. idxview anyway shows what exists in cache.
I am not sure how important the update is in our case, anyway. Most people have the MUA open all day, and I guess a client in IDLE will fetch the headers and have the index updated immediately, right?
With most clients, yes.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On Wed, 4 Jun 2008, Timo Sirainen wrote:
Also, the W=nnnn size thing is not added to filenames when using -p.
And cache isn't updated either. These are because hard linking can be done without actually reading the mail contents. I won't fix this for v1.1, but I updated the documentation.
Could you transfer existing attributes from the source file(name) to the destination filename?
Bye,
Steffen Kaiser -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux)
iD8DBQFIRozBVJMDrex4hCIRAn33AKDWn8wCWDX6R7a9F/HPown1mE/oDACeO4XU jx/MuEvyKrQFblYVDyiWk7c= =iYGZ -----END PGP SIGNATURE-----
On Wed, 2008-06-04 at 14:38 +0200, Steffen Kaiser wrote:
Also, the W=nnnn size thing is not added to filenames when using -p.
And cache isn't updated either. These are because hard linking can be done without actually reading the mail contents. I won't fix this for v1.1, but I updated the documentation.
Could you transfer existing attributes from the source file(name) to the destination filename?
Looks like S=n is copied and W=n could also be copied with minimal trouble. I'll add to my TODO.
Noticed an occasional strange problem that appeared after upgrading to RC6. Using Thunderbird on XP I empty the trash and everything in trash folder vanishes. But then it reappears in the message totals. But if I actually click on the folder there's nothing there. Not sure if this is an RC6 bug but I thought I'd mention it just in case.
participants (9)
-
Anders
-
Ed W
-
Geert Hendrickx
-
Jost Krieger
-
Kenneth Porter
-
Marc Perkel
-
Steffen Kaiser
-
Stephan Bosch
-
Timo Sirainen