Dovecot and data migration
Hi,
Our legacy data store retains a single copy of a message regardless of the number of mailboxes in which that message resides. It does this by creating hard links to that message in the mailboxes containing that message. Thus, when we perform data migration to the server target (Dovecot), the copies of the same message are copied over with the migration process (imapsync). We use the storage format maildir. With a small message store, this means that a lot of messages are duplicated unnecessarily. How to reduce message store size due to duplicate storage of identical messages ?
Does a relinking function exist and can be run in real-time mode ? how can we configure Dovecot to deduplicate for all users using a hash to determine whether the file could be already exist ?
Thanks in advance.
Memo
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On Wed, 6 May 2015, Alain BERNARD wrote:
Our legacy data store retains a single copy of a message regardless of the number of mailboxes in which that message resides. It does this by creating hard links to that message in the mailboxes containing that message.
Thus, when we perform data migration to the server target (Dovecot), the copies of the same message are copied over with the migration process (imapsync). We use the storage format maildir.
With a small message store, this means that a lot of messages are duplicated unnecessarily. How to reduce message store size due to duplicate storage of identical messages ?
There is no function in Dovecot doing that. For the synchronisation you can come up with some filesystem related script doing that easily.
Does a relinking function exist and can be run in real-time mode ? how can we configure Dovecot to deduplicate for all users using a hash to determine whether the file could be already exist ?
In Dovecot v1 I did this with an external script, that hard linked equal files in cur and new directories that resides in more than 10 or so mailboxes.
But in the production phase with Dovecot v2 you will face some culprit: with LMTP all messages are different now, because of the user-related Delivered-To and final Recieved header. With deliver this does not happen, but your MTA possibly adds different headers then, because usually LDAs are called per recipient. Dovecot deliver has the "-p" option to optionally hard link to file message file to the argument of -p. But then you must use some scripting to have your MTA call that script for all final recipients. You should also check, if Sieve is compatible with -p, because I remember some bug reports.
Steffen Kaiser -----BEGIN PGP SIGNATURE----- Version: GnuPG v1
iQEVAwUBVUsCN3z1H7kL/d9rAQJp7Af/dPVmZcYQN48P4rgThc6RLFoB4PeLTF3B X42XqLmyje0d1Hv2YJMJXdSJccYJ4vp14MWJ0h11I3jOor17lnBGBTBqPyxZI7gL bYDJI2DUSh1CoQ2Sed9vRe5uKaDDlfuPFIym5JE4EJky8m8uEYSa+RRr/jtxbzpn RyKTn0SWls818hC5rISowvYyej5tvgZcq1lQn7yglqbriudJY33PHaa4EA7aaKVC ok4kiL9R0hKLTVjmeibxe0ZfI5MALVqkr1m5UOKXVj0M8lMHxx+qOoMlmkU3fXqI vwgvgYusvp3OeJJw23CJ5T0haaltzRcHJFil9F/4CLwMrsI44NnhgA== =JbnI -----END PGP SIGNATURE-----
Thank you.
In fact, Postfix adds an individual Delivered-To: header line with the final envelope recipient address in order to stop mail forwarding loops as early as possible. This is a real problem with a multiple recipient email and find exact duplicates by comparing the hash values of emails.
To perform a test, I used the -p parameter :
# /usr/libexec/dovecot/deliver -p tempfile -d fredb # /usr/libexec/dovecot/deliver -p tempfile -d gregk
# ll /usr/libexec/dovecot/deliver lrwxrwxrwx 1 root mail 11 16 févr. 09:15 /usr/libexec/dovecot/deliver -> dovecot-lda
# ll /store/vmail/gam/fredb/Maildir/cur/ 1430986604.M985408P7547.mail6.domain.org\,S\=1037\,W\=1059\:2\,a -rw------- 1 vmail vmail 1037 7 mai 10:16 /store/vmail/gam/fredb/Maildir/cur/1430986604.M985408P7547.mail6.domain.org ,S=1037,W=1059:2,a
However, the file isn't hard linked. So, fredb and gregk have the same file but I see that the number of hard links isn't 2 (files with a different inode number).
Regards,
2015-05-07 8:12 GMT+02:00 Steffen Kaiser <skdovecot@smail.inf.fh-brs.de>:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On Wed, 6 May 2015, Alain BERNARD wrote:
Our legacy data store retains a single copy of a message regardless of the
number of mailboxes in which that message resides. It does this by creating hard links to that message in the mailboxes containing that message.
Thus, when we perform data migration to the server target (Dovecot), the
copies of the same message are copied over with the migration process (imapsync). We use the storage format maildir.
With a small message store, this means that a lot of messages are
duplicated unnecessarily. How to reduce message store size due to duplicate storage of identical messages ?
There is no function in Dovecot doing that. For the synchronisation you can come up with some filesystem related script doing that easily.
Does a relinking function exist and can be run in real-time mode ? how can
we configure Dovecot to deduplicate for all users using a hash to determine whether the file could be already exist ?
In Dovecot v1 I did this with an external script, that hard linked equal files in cur and new directories that resides in more than 10 or so mailboxes.
But in the production phase with Dovecot v2 you will face some culprit: with LMTP all messages are different now, because of the user-related Delivered-To and final Recieved header. With deliver this does not happen, but your MTA possibly adds different headers then, because usually LDAs are called per recipient. Dovecot deliver has the "-p" option to optionally hard link to file message file to the argument of -p. But then you must use some scripting to have your MTA call that script for all final recipients. You should also check, if Sieve is compatible with -p, because I remember some bug reports.
- -- Steffen Kaiser -----BEGIN PGP SIGNATURE----- Version: GnuPG v1
iQEVAwUBVUsCN3z1H7kL/d9rAQJp7Af/dPVmZcYQN48P4rgThc6RLFoB4PeLTF3B X42XqLmyje0d1Hv2YJMJXdSJccYJ4vp14MWJ0h11I3jOor17lnBGBTBqPyxZI7gL bYDJI2DUSh1CoQ2Sed9vRe5uKaDDlfuPFIym5JE4EJky8m8uEYSa+RRr/jtxbzpn RyKTn0SWls818hC5rISowvYyej5tvgZcq1lQn7yglqbriudJY33PHaa4EA7aaKVC ok4kiL9R0hKLTVjmeibxe0ZfI5MALVqkr1m5UOKXVj0M8lMHxx+qOoMlmkU3fXqI vwgvgYusvp3OeJJw23CJ5T0haaltzRcHJFil9F/4CLwMrsI44NnhgA== =JbnI -----END PGP SIGNATURE-----
participants (2)
-
Alain BERNARD
-
Steffen Kaiser