[Dovecot] dovecot 2.2.0 corrupts mailboxes?
Hi
On april 17th, I upgraded from dovecot 2.1.13 to 2.2.0. Since that time, I had two different users that reported received three incident of messages that disapeared from their mailboxes.
The mailbox format is mbox on local FFS filesystem (no NFS), and I use filesystem quotas (but both users are far from filling their quotas). When the message disapeared, it was always a whole rand of dates. On the last incident reported, the user also saw some message being duplicated many times.
There is something interesting in the logs:
May 4 20:16:30 volanges dovecot: imap(jdoe): Error: Cached message size smaller than expected (2000 < 8063) May 4 20:16:30 volanges dovecot: imap(jdoe): Error: Corrupted index cache file /mail/indexes/jdoe/.imap/INBOX/dovecot.index.cache: Broken physical size for mail UID 141869 May 4 20:19:48 volanges dovecot: imap(jdoe): Error: Cached message size smaller than expected (9711 < 16248) May 4 20:19:48 volanges dovecot: imap(jdoe): Error: Corrupted index cache file /mail/indexes/jdoe/mail/.imap/Arxiv/dovecot.index.cache: Broken physical size for mail UID 4383 May 4 21:14:35 volanges dovecot: imap(jdoe): Error: Cached message size smaller than expected (1878 < 8066) May 4 21:14:35 volanges dovecot: imap(jdoe): Error: Corrupted index cache file /mail/indexes/jdoe/mail/.imap/CNRS/dovecot.index.cache: Broken physical size for mail UID 290 May 4 21:15:17 volanges dovecot: imap(jdoe): Error: Cached message size smaller than expected (17285 < 24440) May 4 21:15:17 volanges dovecot: imap(jdoe): Error: Corrupted index cache file /mail/indexes/jdoe/mail/.imap/Commandes/dovecot.index.cache: Broken physical size for mail UID 680
Does that ring a bell? I am tempted to downgrade to 2.1.13. Does it makes sense? Is it safe to do so?
-- Emmanuel Dreyfus http://hcpnet.free.fr/pubz manu@netbsd.org
On 05.05.2013 02:56, Emmanuel Dreyfus wrote:
May 4 20:16:30 volanges dovecot: imap(jdoe): Error: Cached message size smaller than expected (2000 < 8063) May 4 20:16:30 volanges dovecot: imap(jdoe): Error: Corrupted index cache file /mail/indexes/jdoe/.imap/INBOX/dovecot.index.cache: Broken physical size for mail UID 141869 May 4 20:19:48 volanges dovecot: imap(jdoe): Error: Cached message size smaller than expected (9711 < 16248) May 4 20:19:48 volanges dovecot: imap(jdoe): Error: Corrupted index cache file /mail/indexes/jdoe/mail/.imap/Arxiv/dovecot.index.cache: Broken physical size for mail UID 4383 May 4 21:14:35 volanges dovecot: imap(jdoe): Error: Cached message size smaller than expected (1878 < 8066) May 4 21:14:35 volanges dovecot: imap(jdoe): Error: Corrupted index cache file /mail/indexes/jdoe/mail/.imap/CNRS/dovecot.index.cache: Broken physical size for mail UID 290 May 4 21:15:17 volanges dovecot: imap(jdoe): Error: Cached message size smaller than expected (17285 < 24440) May 4 21:15:17 volanges dovecot: imap(jdoe): Error: Corrupted index cache file /mail/indexes/jdoe/mail/.imap/Commandes/dovecot.index.cache: Broken physical size for mail UID 680
Does that ring a bell? I am tempted to downgrade to 2.1.13. Does it makes sense? Is it safe to do so?
This bug has been fixed with dovecot 2.1.14.
Please check: http://hg.dovecot.org/dovecot-2.1/rev/0b0399f1b6aa http://dovecot.org/list/dovecot/2013-February/088313.html
Best regards,
Morten
On Mon, May 06, 2013 at 04:20:52PM +0200, Morten Stevens wrote:
May 4 21:15:17 volanges dovecot: imap(jdoe): Error: Corrupted index cache file /mail/indexes/jdoe/mail/.imap/Commandes/dovecot.index.cache: Broken physical size for mail UID 680
Does that ring a bell? I am tempted to downgrade to 2.1.13. Does it makes sense? Is it safe to do so?
This bug has been fixed with dovecot 2.1.14.
But I am running 2.2.0 ...
-- Emmanuel Dreyfus manu@netbsd.org
On 05/ 6/13 11:55 AM, Emmanuel Dreyfus wrote:
On Mon, May 06, 2013 at 04:20:52PM +0200, Morten Stevens wrote:
May 4 21:15:17 volanges dovecot: imap(jdoe): Error: Corrupted index cache file /mail/indexes/jdoe/mail/.imap/Commandes/dovecot.index.cache: Broken physical size for mail UID 680
Does that ring a bell? I am tempted to downgrade to 2.1.13. Does it makes sense? Is it safe to do so? This bug has been fixed with dovecot 2.1.14. But I am running 2.2.0 ...
Have you tried 2.2.1?
On Mon, May 06, 2013 at 01:52:55PM -0400, Oscar del Rio wrote:
Have you tried 2.2.1?
Will do, but since the problem cannot be reliabily reproduced, I have no way of knowing it is fixed. Is there anything in 2.2.1 changelog that hints it could be fixed?
-- Emmanuel Dreyfus manu@netbsd.org
On 5.5.2013, at 3.56, Emmanuel Dreyfus manu@netbsd.org wrote:
On april 17th, I upgraded from dovecot 2.1.13 to 2.2.0. Since that time, I had two different users that reported received three incident of messages that disapeared from their mailboxes.
The mailbox format is mbox on local FFS filesystem (no NFS), and I use filesystem quotas (but both users are far from filling their quotas). When the message disapeared, it was always a whole rand of dates. On the last incident reported, the user also saw some message being duplicated many times.
There are some locking code changes between v2.1 and v2.2, which I guess might be buggy. But I can't reproduce any corruption with stress testing. What's your doveconf -n output? Are you delivering mails via dovecot-lda or something external?
Does that ring a bell? I am tempted to downgrade to 2.1.13. Does it makes sense? Is it safe to do so?
When downgrading, I recommend latest v2.1.
On Wed, May 15, 2013 at 02:50:55PM +0300, Timo Sirainen wrote:
There are some locking code changes between v2.1 and v2.2, which I guess might be buggy. But I can't reproduce any corruption with stress testing. What's your doveconf -n output? Are you delivering mails via dovecot-lda or something external?
dovecot -n is below. dovecot takes care of delivery, through LMTP.
Additionnal thoughts on possible problems:
- one of the users was using mutt locally and accessed its mailbox directly without going through dovecot.
- I experimented dsync replication from another machine that was not accessible through POP/IMAP/SMTP, perhaps this is what caused chaos?
auth_mechanisms = plain login disable_plaintext_auth = no first_valid_uid = 400 mail_location = mbox:~/mail:INBOX=/var/mail/%u:INDEX=/mail/indexes/%u:SUBSCRIPTI ONS=../.mailboxlist mbox_very_dirty_syncs = yes passdb { args = max_requests=1 cache_key=%u dovecot driver = pam } passdb { args = /etc/dovecot-ldap.conf driver = ldap } plugin { autosubscribe = INBOX quota = fs:User quota quota_warning = storage=95%% quota-warning %u } quota_full_tempfail = yes service anvil { client_limit = 1639 } service auth { client_limit = 1736 user = root } service imap-login { chroot = login process_limit = 1024 } service imap { process_limit = 680 } service lmtp { process_min_avail = 5 unix_listener lmtp { group = smmsp mode = 0660 } } service pop3-login { chroot = login process_limit = 512 } service pop3 { process_limit = 680 } service quota-warning { executable = script /usr/local/sbin/morts unix_listener quota-warning { mode = 0666 } user = root } ssl_ca =
-- Emmanuel Dreyfus manu@netbsd.org
On 15.5.2013, at 20.33, Emmanuel Dreyfus manu@netbsd.org wrote:
On Wed, May 15, 2013 at 02:50:55PM +0300, Timo Sirainen wrote:
There are some locking code changes between v2.1 and v2.2, which I guess might be buggy. But I can't reproduce any corruption with stress testing. What's your doveconf -n output? Are you delivering mails via dovecot-lda or something external?
dovecot -n is below. dovecot takes care of delivery, through LMTP.
Additionnal thoughts on possible problems:
- one of the users was using mutt locally and accessed its mailbox directly without going through dovecot.
That shouldn't cause problems if locking was configured the same.
- I experimented dsync replication from another machine that was not accessible through POP/IMAP/SMTP, perhaps this is what caused chaos?
That might cause trouble. I tested today and dsync was doing some strange things with mbox.
On Wed, May 15, 2013 at 09:36:54PM +0300, Timo Sirainen wrote:
- one of the users was using mutt locally and accessed its mailbox directly without going through dovecot. That shouldn't cause problems if locking was configured the same.
I never looked at it, but I assume they both use flock or fcntl since this is local storage. And it worked fine for a while, therefore there is no hint it could be wrong.
- I experimented dsync replication from another machine that was not accessible through POP/IMAP/SMTP, perhaps this is what caused chaos? That might cause trouble. I tested today and dsync was doing some strange things with mbox.
What is the advised setup? Here is the additionnal config I tried on the inacessible host:
mail_plugins = $mail_plugins notify replication service replicator { process_min_avail = 1 } dsync_remote_cmd = ssh -lroot %{host} doveadm dsync-server -u%u plugin { mail_replica = remote:root@server1.example.net } service aggregator { fifo_listener replication-notify-fifo { user = dovecot } unix_listener replication-notify { user = dovecot } } service replicator { unix_listener replicator-doveadm { mode = 0600 } } service replicator { unix_listener replicator-doveadm { mode = 0600 } } service doveadm { inet_listener { port = 12345 ssl = yes } } doveadm_port = 12345 ssl_client_ca_file = /etc/openssl/certs/tcs-chain.crt doveadm_proxy_port = 0
-- Emmanuel Dreyfus manu@netbsd.org
On 2013-05-15 9:01 PM, Emmanuel Dreyfus manu@netbsd.org wrote:
On Wed, May 15, 2013 at 09:36:54PM +0300, Timo Sirainen wrote:
- one of the users was using mutt locally and accessed its mailbox directly without going through dovecot. That shouldn't cause problems if locking was configured the same.
I never looked at it, but I assume they both use flock or fcntl
Can't help with your actual problem, but...
What was it that 'assumption' is supposedly the mother of?
;)
--
Best regards,
Charles
On Thu, May 16, 2013 at 06:37:45AM -0400, Charles Marcus wrote:
I never looked at it, but I assume they both use flock or fcntl
Can't help with your actual problem, but... What was it that 'assumption' is supposedly the mother of?
I don't buy that explanation: everything worked fine for years.
-- Emmanuel Dreyfus manu@netbsd.org
On 2013-05-16 2:25 PM, Emmanuel Dreyfus manu@netbsd.org wrote:
On Thu, May 16, 2013 at 06:37:45AM -0400, Charles Marcus wrote:
I never looked at it, but I assume they both use flock or fcntl Can't help with your actual problem, but... What was it that 'assumption' is supposedly the mother of? I don't buy that explanation: everything worked fine for years.
You miss the point entirely.
--
Best regards,
Charles
On 16.5.2013, at 4.01, Emmanuel Dreyfus manu@netbsd.org wrote:
- I experimented dsync replication from another machine that was not accessible through POP/IMAP/SMTP, perhaps this is what caused chaos? That might cause trouble. I tested today and dsync was doing some strange things with mbox.
What is the advised setup?
Not using mbox, at least with dsync, at least for now.
participants (6)
-
Charles Marcus
-
Emmanuel Dreyfus
-
manu@netbsd.org
-
Morten Stevens
-
Oscar del Rio
-
Timo Sirainen