[Dovecot] dsync, hard-links and refcounts
Hi,
when creating a copy of a mail, dovecot provides a feature that will store its contents only once. In maildir, this is done by means of hard links, while mdbox has some special refcounting mechanism.
My question is, how can we convert mails from maildir to mdbox without duplicating these copies? It seems that dsync does not detect the hard links. Even if the hard-linked mails have the same GUID listed in dovecot-uidlist, dsync creates multiple instances of the text.
- Is there some way to make dsync notice the hard links? (I used dovecot 2.0.17)
- Alternatively, is there some tool to deduplicate the mdbox after it has been converted from maildir?
- NB: I am not talking about single-instance storage for attachments.
Thank you very much, Christoph
-- Christoph Bußenius Rechnerbetriebsgruppe der Fakultäten Informatik und Mathematik Technische Universität München +49 89-289-18519 <> Raum 00.05.055 <> Boltzmannstr. 3 <> Garching
On 13.2.2012, at 16.16, Christoph Bußenius wrote:
when creating a copy of a mail, dovecot provides a feature that will store its contents only once. In maildir, this is done by means of hard links, while mdbox has some special refcounting mechanism.
My question is, how can we convert mails from maildir to mdbox without duplicating these copies? It seems that dsync does not detect the hard links. Even if the hard-linked mails have the same GUID listed in dovecot-uidlist, dsync creates multiple instances of the text.
- Is there some way to make dsync notice the hard links? (I used dovecot 2.0.17)
It should deduplicate when GUIDs are the same.. I guess I'll have to look into why it's not working.
On 02/13/2012 03:40 PM, Timo Sirainen wrote:
It should deduplicate when GUIDs are the same.. I guess I'll have to look into why it's not working.
I’d very much appreciate that. We will have to migrate many large mailboxes, and it would be a pity to needlessly waste space.
I got the same result with some different configurations, using "mirror", "backup", or "-R backup" (though I have not tried 2.1 yet); so it should be easy to reproduce. However let me know if you need my configuration or anything.
Cheers, Christoph
On 13.2.2012, at 16.40, Timo Sirainen wrote:
On 13.2.2012, at 16.16, Christoph Bußenius wrote:
when creating a copy of a mail, dovecot provides a feature that will store its contents only once. In maildir, this is done by means of hard links, while mdbox has some special refcounting mechanism.
My question is, how can we convert mails from maildir to mdbox without duplicating these copies? It seems that dsync does not detect the hard links. Even if the hard-linked mails have the same GUID listed in dovecot-uidlist, dsync creates multiple instances of the text.
- Is there some way to make dsync notice the hard links? (I used dovecot 2.0.17)
It should deduplicate when GUIDs are the same.. I guess I'll have to look into why it's not working.
It worked when the GUID already existed somewhere in destination, but not if it was added only during the same session. The attached patch fixes it. I'll commit it to v2.1 hg after I'll release v2.1.0..
On 15.02.2012 04:46, Timo Sirainen wrote:
The attached patch fixes it. I'll commit it to v2.1 hg after I'll release v2.1.0..
Thanks. I guess it would be wise to upgrade our new Dovecot mail store to 2.1 before we migrate all our Courier maildir users to it...
Cheers, Chnristoph
-- Christoph Bußenius Rechnerbetriebsgruppe der Fakultäten Informatik und Mathematik Technische Universität München +49 89-289-18519 <> Raum 00.05.055 <> Boltzmannstr. 3 <> Garching
On 15.02.2012 04:46, Timo Sirainen wrote:
On 13.2.2012, at 16.40, Timo Sirainen wrote:
It should deduplicate when GUIDs are the same.. I guess I'll have to look into why it's not working.
The attached patch fixes it. I'll commit it to v2.1 hg after I'll release v2.1.0..
After replacing "doveadm/dsync" with "dsync", the patch applied in 2.0.18 and works fine. (Is there any chance this will be in a 2.0 bugfix release?)
Cheers, Christoph
-- Christoph Bußenius Rechnerbetriebsgruppe der Fakultäten Informatik und Mathematik Technische Universität München +49 89-289-18519 <> Raum 00.05.055 <> Boltzmannstr. 3 <> Garching
Hi,
On 15.02.2012 04:46, Timo Sirainen wrote:
It worked when the GUID already existed somewhere in destination, but not if it was added only during the same session. The attached patch fixes it. I'll commit it to v2.1 hg after I'll release v2.1.0..
sorry to bother you again, but I think there is a problem with this patch:
If a maildir contains several copies of the same message all in the same folder, dsync will not deduplicate them.
While IMAP cannot directly create copies of a message in the same folder, it does still happen if you copy (or move) a message back and forth between two folders.
Cheers, Christoph
-- Christoph Bußenius Rechnerbetriebsgruppe der Fakultäten Informatik und Mathematik Technische Universität München +49 89-289-18519 <> Raum 00.05.055 <> Boltzmannstr. 3 <> Garching
On Tue, 2012-02-21 at 11:23 +0100, Christoph Bußenius wrote:
Hi,
On 15.02.2012 04:46, Timo Sirainen wrote:
It worked when the GUID already existed somewhere in destination, but not if it was added only during the same session. The attached patch fixes it. I'll commit it to v2.1 hg after I'll release v2.1.0..
sorry to bother you again, but I think there is a problem with this patch:
If a maildir contains several copies of the same message all in the same folder, dsync will not deduplicate them.
Correct. I nearly finished implementing this also, but then I thought it just makes the code unnecessarily complex for no good reason.
While IMAP cannot directly create copies of a message in the same folder,
It can: SELECT INBOX, COPY 1 INBOX
it does still happen if you copy (or move) a message back and forth between two folders.
Is it common enough to be an actual problem?
On 21.02.2012 12:04, Timo Sirainen wrote:
On Tue, 2012-02-21 at 11:23 +0100, Christoph Bußenius wrote:
Hi,
On 15.02.2012 04:46, Timo Sirainen wrote:
It worked when the GUID already existed somewhere in destination, but not if it was added only during the same session. The attached patch fixes it. I'll commit it to v2.1 hg after I'll release v2.1.0..
sorry to bother you again, but I think there is a problem with this patch:
If a maildir contains several copies of the same message all in the same folder, dsync will not deduplicate them.
Correct. I nearly finished implementing this also, but then I thought it just makes the code unnecessarily complex for no good reason.
While IMAP cannot directly create copies of a message in the same folder,
It can: SELECT INBOX, COPY 1 INBOX
Oh, mea culpa :)
it does still happen if you copy (or move) a message back and forth between two folders.
Is it common enough to be an actual problem?
Actually we have some mailboxes with massively duplicated messages in the same folder. Of course I cannot tell how common it is in general. I could imagine that some people routinely copy all INBOX messages into archive folders and do not check whether the archive already contains these messages.
Apart from the waste of space, I was wondering: Is it okay for an mdbox to have several duplicate instances of a message with the same GUID? Might some kind of corruption arise from this?
Cheers, Christoph
-- Christoph Bußenius Rechnerbetriebsgruppe der Fakultäten Informatik und Mathematik Technische Universität München +49 89-289-18519 <> Raum 00.05.055 <> Boltzmannstr. 3 <> Garching
On 21.2.2012, at 13.55, Christoph Bußenius wrote:
Apart from the waste of space, I was wondering: Is it okay for an mdbox to have several duplicate instances of a message with the same GUID? Might some kind of corruption arise from this?
No corruption. And they might even become deduplicated if you do doveadm force-resync + purge.
On 21.02.2012 13:15, Timo Sirainen wrote:
And they might even become deduplicated if you do doveadm force-resync + purge.
I hadn't tried that yet. Thanks for the hint, this is probably all we need.
Cheers, Christoph
-- Christoph Bußenius Rechnerbetriebsgruppe der Fakultäten Informatik und Mathematik Technische Universität München +49 89-289-18519 <> Raum 00.05.055 <> Boltzmannstr. 3 <> Garching
participants (2)
-
Christoph Bußenius
-
Timo Sirainen