Corrupted sizes in cache once again
Hi,
I run a fairly large Dovecot Installation (around 100k mailboxes) on several servers.
gzip compression is on.
Every once in a while I get the dreaded "cache corruption" messages in the log:
Error: Corrupted record in index cache file /[redacted]/Maildir/dovecot.index.cache: UID 3868: Broken physical size in mailbox INBOX: read(zlib(/[redacted]/Maildir/cur/1674129792.M797543P21755.node2,S=8099,W=8276:2,)) failed: Cached message size smaller than expected (2877 < 8099, box=INBOX, UID=3868)
Error: Corrupted record in index cache file /[redacted]/Maildir/dovecot.index.cache: UID 3875: Broken physical size in mailbox INBOX: read(zlib(/[redacted]/Maildir/cur/1674212201.M985809P29112.node2,S=13907,W=14121:2,)) failed: Cached message size smaller than expected (5533 < 8192, box=INBOX, UID=3875)
The first entry shows 2877 (size on disk) vs. 8099 (real size unzipped, also in the filename: S=8099).
The second entry shows 5533 (size on disk) vs. 8192 - this is not correct in any way. Size on disk is 13907 as noted in the filename.
Both mails were delivered trough LMTP and retrieved by the POP3 service.
Anyone with an idea what might be happening here? I've read all available info in the doc and in the previous discussions / bug reports, but nothing seems to match my case. And where does that 8192 come from - it looks suspicious?
Version is 2.3.7.2 (Ubuntu 20.04)
doveconf -n excerpt:
protocol lmtp { mail_plugins = " old_stats mail_log notify zlib sieve quota" } protocol imap { mail_max_userip_connections = 50 mail_plugins = " old_stats mail_log notify zlib imap_old_stats quota imap_quota imap_sieve" } protocol pop3 { mail_max_userip_connections = 50 mail_plugins = " old_stats mail_log notify zlib" }
Tim
I would like to add some more observations:
I hav another mailbox with one mail's size being saved wrong in the cache file:
doveadm dump:
RECORD: seq=40, uid=1202, flags=0x00 - ext 5 cache : 8504 (38210000) - ext 6 vsize : 145625 (d9380200) : vsize = 5166845802713 : highest_uid = 0 : message_count = 8716 - cache offset=8504 size=212, prev_offset = 0 - hdr.From: 42: From: [redacted] - hdr.Message-ID: 45: Message-ID: [redacted] - hdr.Subject: 44: Subject: [redacted] - flags: (20000000)
RECORD: seq=41, uid=1203, flags=0x00 - ext 5 cache : 8716 (0c220000) - ext 6 vsize : 105141 (b59a0100) : vsize = 105141 : highest_uid = 0 : message_count = 0 - cache offset=8716 size=280, prev_offset = 0 - hdr.From: 104: From: [redacted] - hdr.Message-ID: 108: Message-ID: [redacted] - hdr.Subject: 105: Subject: [redacted] - flags: (20000000)
The message with seq=40 is the mail dovecot complains about:
Error: Corrupted record in index cache file /[redacted]/Maildir/dovecot.index.cache: UID 1202: Broken physical size in mailbox INBOX: read(zlib(/[redacted]/Maildir/new/1675176154.M829333P14495.node2,S=232816,W=236045)) failed: Cached message size smaller than expected (145082 < 146788, box=INBOX, UID=1202)
Size on disk is 145082
Size after decompression is 232816 (as expected)
Message with seq=41 is correct with both vsize in cache and decompressed size being equal.
Before the POP3 access that produced the error both mails were only ever touched by the dovecot lmtp delivery process.
Am 01.02.23 um 14:13 schrieb Tim Evers:
Hi,
I run a fairly large Dovecot Installation (around 100k mailboxes) on several servers.
gzip compression is on.
Every once in a while I get the dreaded "cache corruption" messages in the log:
Error: Corrupted record in index cache file /[redacted]/Maildir/dovecot.index.cache: UID 3868: Broken physical size in mailbox INBOX: read(zlib(/[redacted]/Maildir/cur/1674129792.M797543P21755.node2,S=8099,W=8276:2,)) failed: Cached message size smaller than expected (2877 < 8099, box=INBOX, UID=3868)
Error: Corrupted record in index cache file /[redacted]/Maildir/dovecot.index.cache: UID 3875: Broken physical size in mailbox INBOX: read(zlib(/[redacted]/Maildir/cur/1674212201.M985809P29112.node2,S=13907,W=14121:2,)) failed: Cached message size smaller than expected (5533 < 8192, box=INBOX, UID=3875)
The first entry shows 2877 (size on disk) vs. 8099 (real size unzipped, also in the filename: S=8099).
The second entry shows 5533 (size on disk) vs. 8192 - this is not correct in any way. Size on disk is 13907 as noted in the filename.
Both mails were delivered trough LMTP and retrieved by the POP3 service.
Anyone with an idea what might be happening here? I've read all available info in the doc and in the previous discussions / bug reports, but nothing seems to match my case. And where does that 8192 come from - it looks suspicious?
Version is 2.3.7.2 (Ubuntu 20.04)
doveconf -n excerpt:
protocol lmtp { mail_plugins = " old_stats mail_log notify zlib sieve quota" } protocol imap { mail_max_userip_connections = 50 mail_plugins = " old_stats mail_log notify zlib imap_old_stats quota imap_quota imap_sieve" } protocol pop3 { mail_max_userip_connections = 50 mail_plugins = " old_stats mail_log notify zlib" }
Tim
On 2023-02-01, Tim Evers <te-ml-ext@artfiles.de> wrote:
I run a fairly large Dovecot Installation (around 100k mailboxes) on several servers.
gzip compression is on.
Every once in a while I get the dreaded "cache corruption" messages in the log:
Error: Corrupted record in index cache file /[redacted]/Maildir/dovecot.index.cache: UID 3868: Broken physical size in mailbox INBOX: read(zlib(/[redacted]/Maildir/cur/1674129792.M797543P21755.node2,S=8099,W=8276:2,)) failed: Cached message size smaller than expected (2877 < 8099, box=INBOX, UID=3868)
Error: Corrupted record in index cache file /[redacted]/Maildir/dovecot.index.cache: UID 3875: Broken physical size in mailbox INBOX: read(zlib(/[redacted]/Maildir/cur/1674212201.M985809P29112.node2,S=13907,W=14121:2,)) failed: Cached message size smaller than expected (5533 < 8192, box=INBOX, UID=3875)
The first entry shows 2877 (size on disk) vs. 8099 (real size unzipped, also in the filename: S=8099).
The second entry shows 5533 (size on disk) vs. 8192 - this is not correct in any way. Size on disk is 13907 as noted in the filename.
Both mails were delivered trough LMTP and retrieved by the POP3 service.
Anyone with an idea what might be happening here? I've read all available info in the doc and in the previous discussions / bug reports, but nothing seems to match my case. And where does that 8192 come from - it looks suspicious?
Version is 2.3.7.2 (Ubuntu 20.04)
2.3.7.2 is rather old now. There were definitely fixes regarding compression around the 2.3.10-2.3.12 timeframe or thereabouts (I forget all the details but it took a release or two before some remaining issues were sorted out after changes in the area). I'd be looking to get it updated to a current version first.
On 02/02/2023 17:19 EET Stuart Henderson <stu@spacehopper.org> wrote:
On 2023-02-01, Tim Evers <te-ml-ext@artfiles.de> wrote:
I run a fairly large Dovecot Installation (around 100k mailboxes) on several servers.
gzip compression is on.
Every once in a while I get the dreaded "cache corruption" messages in the log:
Error: Corrupted record in index cache file /[redacted]/Maildir/dovecot.index.cache: UID 3868: Broken physical size in mailbox INBOX: read(zlib(/[redacted]/Maildir/cur/1674129792.M797543P21755.node2,S=8099,W=8276:2,)) failed: Cached message size smaller than expected (2877 < 8099, box=INBOX, UID=3868)
Error: Corrupted record in index cache file /[redacted]/Maildir/dovecot.index.cache: UID 3875: Broken physical size in mailbox INBOX: read(zlib(/[redacted]/Maildir/cur/1674212201.M985809P29112.node2,S=13907,W=14121:2,)) failed: Cached message size smaller than expected (5533 < 8192, box=INBOX, UID=3875)
The first entry shows 2877 (size on disk) vs. 8099 (real size unzipped, also in the filename: S=8099).
The second entry shows 5533 (size on disk) vs. 8192 - this is not correct in any way. Size on disk is 13907 as noted in the filename.
Both mails were delivered trough LMTP and retrieved by the POP3 service.
Anyone with an idea what might be happening here? I've read all available info in the doc and in the previous discussions / bug reports, but nothing seems to match my case. And where does that 8192 come from - it looks suspicious?
Version is 2.3.7.2 (Ubuntu 20.04)
2.3.7.2 is rather old now. There were definitely fixes regarding compression around the 2.3.10-2.3.12 timeframe or thereabouts (I forget all the details but it took a release or two before some remaining issues were sorted out after changes in the area). I'd be looking to get it updated to a current version first.
For bug reports, we do ask that you try to reproduce it with 2.3.20 (current latest), you can get packages from https://repo.dovecot.org/ and would be nice if you can provide steps to reproduce this issue.
Also please see https://www.dovecot.org/bugreport-mail/ if you have not yet.
Aki
Am 02.02.23 um 16:23 schrieb Aki Tuomi:
On 02/02/2023 17:19 EET Stuart Henderson <stu@spacehopper.org> wrote:
On 2023-02-01, Tim Evers <te-ml-ext@artfiles.de> wrote:
I run a fairly large Dovecot Installation (around 100k mailboxes) on several servers.
gzip compression is on.
Every once in a while I get the dreaded "cache corruption" messages in the log:
Error: Corrupted record in index cache file /[redacted]/Maildir/dovecot.index.cache: UID 3868: Broken physical size in mailbox INBOX: read(zlib(/[redacted]/Maildir/cur/1674129792.M797543P21755.node2,S=8099,W=8276:2,)) failed: Cached message size smaller than expected (2877 < 8099, box=INBOX, UID=3868)
Error: Corrupted record in index cache file /[redacted]/Maildir/dovecot.index.cache: UID 3875: Broken physical size in mailbox INBOX: read(zlib(/[redacted]/Maildir/cur/1674212201.M985809P29112.node2,S=13907,W=14121:2,)) failed: Cached message size smaller than expected (5533 < 8192, box=INBOX, UID=3875)
The first entry shows 2877 (size on disk) vs. 8099 (real size unzipped, also in the filename: S=8099).
The second entry shows 5533 (size on disk) vs. 8192 - this is not correct in any way. Size on disk is 13907 as noted in the filename.
Both mails were delivered trough LMTP and retrieved by the POP3 service.
Anyone with an idea what might be happening here? I've read all available info in the doc and in the previous discussions / bug reports, but nothing seems to match my case. And where does that 8192 come from - it looks suspicious?
Version is 2.3.7.2 (Ubuntu 20.04) 2.3.7.2 is rather old now. There were definitely fixes regarding compression around the 2.3.10-2.3.12 timeframe or thereabouts (I forget all the details but it took a release or two before some remaining issues were sorted out after changes in the area). I'd be looking to get it updated to a current version first. For bug reports, we do ask that you try to reproduce it with 2.3.20 (current latest), you can get packages from https://repo.dovecot.org/ and would be nice if you can provide steps to reproduce this issue.
Also please see https://www.dovecot.org/bugreport-mail/ if you have not yet.
Aki
This is not a bug report (yet) - I am asking for ideas to narrow down the issue myself. I would not expect something this obvious and prevalent being a bug in a 10+ years old subsystem (zlib).
Tim
On 2023-02-02, Tim Evers <te-ml-ext@artfiles.de> wrote:
Am 02.02.23 um 16:23 schrieb Aki Tuomi:
For bug reports, we do ask that you try to reproduce it with 2.3.20 (current latest), you can get packages from https://repo.dovecot.org/ and would be nice if you can provide steps to reproduce this issue.
Also please see https://www.dovecot.org/bugreport-mail/ if you have not yet.
Aki
This is not a bug report (yet) - I am asking for ideas to narrow down the issue myself. I would not expect something this obvious and prevalent being a bug in a 10+ years old subsystem (zlib).
Some of the errors seen that were fixed in the period that I mentioned looked *exactly* like this.
Hi again,
so this is the actual bug report :)
I installed 2.3.20 from repo yet the errors persist.
I made the following observations:
For a mailbox producing the "broken physical size" messages the culprit seems not to be the index.cache file but the dovecot-uidlist file. Removing the cache file does nothing regarding the bug while removing the dovecot-uidlist file immediately cleared the errors and all subsequent access went through without error.
An example of an uidlist entry *with* errors:
70755 G1674471014.M447254P8688.node2 S937 :1674471014.M447254P8688.node2,S=1693,W=1730
Note: the S937 entry. This is definitely wrong.
Same mail *without* the errors (dovecot-uidlist auto-recreated after deletion):
76040 :1674471014.M447254P8688.node2,S=1693,W=1730
Errors in log (domain anonymized):
Feb 15 21:02:57 node2 dovecot: imap(local_user@domain.com)<36929><Ee47fcL0UM3UNbQr>: Error: Mailbox INBOX.Trash: Deleting corrupted cache record uid=70755: UID 70755: Broken physical size in mailbox INBOX.Trash: read(compress(/home/mail/domains/v/a/domain.com/local_user/Maildir/.Trash/cur/1674471014.M447254P8688.node2,S=1693,W=1730:2,)) failed: Cached message size smaller than expected (937 < 1693, box=INBOX.Trash, UID=70755) Feb 15 21:02:57 node2 dovecot: imap(local_user@domain.com)<36929><Ee47fcL0UM3UNbQr>: Error: Mailbox INBOX.Trash: UID=70755: read(compress(/home/mail/domains/v/a/domain.com/local_user/Maildir/.Trash/cur/1674471014.M447254P8688.node2,S=1693,W=1730:2,)) failed: Cached message size smaller than expected (937 < 1693, box=INBOX.Trash, UID=70755) (read reason=)
Note: reported physical size from "cache", 937 is the compressed size on disk. Decompressed size is 1693.
Excerpt from strace (full strace attached):
36929 openat(AT_FDCWD, "/home/mail/domains/v/a/domain.com/local_user/Maildir/.Trash/dovecot.index.cache", O_RDWR) = -1 ENOENT (No such file or directory) 36929 openat(AT_FDCWD, "/home/mail/domains/v/a/domain.com/local_user/Maildir/.Trash/cur/1674471014.M447254P8688.node2,S=1693,W=1730:2,", O_RDONLY) = 20 36929 fstat(20, {st_mode=S_IFREG|0600, st_size=937, ...}) = 0 36929 pread64(20, "\37\213\10\0\0\0\0\0\0\3\255\224]O\3438\30\205\257\311\257\260\346\252hI\326\316g\223\231\2I\232\360!\272Ci\27\255\26\241\310M\334\326C\342tl\247|\374\372y\303\262\2\26\0207\233H\215\353\274\366y\217\237\243\\\260\222\361-\253\"\264\224m\203\32\312\353\266\323\266E\245^\362\232)\253bhpeS\214#\262\f\206\21]b;\212\354\360z\327\330Y\334?\326saclU\345F1\271e\262_q\313\365\0321\325\350\215B\203\371\331\214XN\224\245\343\343\254\230e\351\271\355\371\27\244(.fqq>\233\365O\230?\216a\272(\342lV\364\203\243t\22\301\23T\6\331\35o\220k\205\256e\303_^!\262"..., 8192, 0) = 937 36929 openat(AT_FDCWD, "/home/mail/domains/v/a/domain.com/local_user/Maildir/.Trash/dovecot.index.cache", O_RDWR) = -1 ENOENT (No such file or directory) 36929 openat(AT_FDCWD, "/home/mail/domains/v/a/domain.com/local_user/Maildir/.Trash/dovecot.index.cache", O_RDWR) = -1 ENOENT (No such file or directory) 36929 alarm(180) = 0 36929 fcntl(16, F_SETLKW, {l_type=F_WRLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = 0 36929 alarm(0) = 180 36929 stat("/home/mail/domains/v/a/domain.com/local_user/Maildir/.Trash/dovecot.index.log", {st_mode=S_IFREG|0600, st_size=144, ...}) = 0 36929 fstat(16, {st_mode=S_IFREG|0600, st_size=144, ...}) = 0 36929 write(16, "\200\200\200\203\0\0\10\0208\0\0\0\200\200\200\207@\0\0\20\2\0\0\0002\315\321\\\0\0\0\0\4\0\4\0\1\0\0\0\200\200\200\204\0\2\0\20c\24\1\0\0\0\0\0", 56) = 56 36929 fcntl(16, F_SETLK, {l_type=F_UNLCK, l_whence=SEEK_SET, l_start=0, l_len=0}) = 0 36929 stat("/home/mail/domains/v/a/domain.com/local_user/Maildir/.Trash/dovecot.index.log", {st_mode=S_IFREG|0600, st_size=200, ...}) = 0 36929 fstat(16, {st_mode=S_IFREG|0600, st_size=200, ...}) = 0 36929 write(2, "\1\30436929 47 imap(local_user@domain.com)<36929><Ee47fcL0UM3UNbQr>: Mailbox INBOX.Trash: Deleting corrupted cache record uid=70755: UID 70755: Broken physical size in mailbox INBOX.Trash: read(compress(/home/ma"..., 389) = 389
No crash, no core, no backtrace.
That's what I found so far. I have no idea how the S937 made it into the uidlist file. The email itself seems to be uploaded (not imap copied) through Outlook, which mangled the received headers in the process. Don't know if that is of any significance, but they are no longer in chronologically descending order.
Please not that in the config zlib is off for lmtp to reduce the impact of this issue. It was on when this mail was delivered.
I will be more than happy to provide further information.
Thanks
Tim
Am 02.02.23 um 16:23 schrieb Aki Tuomi:
On 02/02/2023 17:19 EET Stuart Henderson <stu@spacehopper.org> wrote:
On 2023-02-01, Tim Evers <te-ml-ext@artfiles.de> wrote:
I run a fairly large Dovecot Installation (around 100k mailboxes) on several servers.
gzip compression is on.
Every once in a while I get the dreaded "cache corruption" messages in the log:
Error: Corrupted record in index cache file /[redacted]/Maildir/dovecot.index.cache: UID 3868: Broken physical size in mailbox INBOX: read(zlib(/[redacted]/Maildir/cur/1674129792.M797543P21755.node2,S=8099,W=8276:2,)) failed: Cached message size smaller than expected (2877 < 8099, box=INBOX, UID=3868)
Error: Corrupted record in index cache file /[redacted]/Maildir/dovecot.index.cache: UID 3875: Broken physical size in mailbox INBOX: read(zlib(/[redacted]/Maildir/cur/1674212201.M985809P29112.node2,S=13907,W=14121:2,)) failed: Cached message size smaller than expected (5533 < 8192, box=INBOX, UID=3875)
The first entry shows 2877 (size on disk) vs. 8099 (real size unzipped, also in the filename: S=8099).
The second entry shows 5533 (size on disk) vs. 8192 - this is not correct in any way. Size on disk is 13907 as noted in the filename.
Both mails were delivered trough LMTP and retrieved by the POP3 service.
Anyone with an idea what might be happening here? I've read all available info in the doc and in the previous discussions / bug reports, but nothing seems to match my case. And where does that 8192 come from - it looks suspicious?
Version is 2.3.7.2 (Ubuntu 20.04) 2.3.7.2 is rather old now. There were definitely fixes regarding compression around the 2.3.10-2.3.12 timeframe or thereabouts (I forget all the details but it took a release or two before some remaining issues were sorted out after changes in the area). I'd be looking to get it updated to a current version first. For bug reports, we do ask that you try to reproduce it with 2.3.20 (current latest), you can get packages from https://repo.dovecot.org/ and would be nice if you can provide steps to reproduce this issue.
Also please see https://www.dovecot.org/bugreport-mail/ if you have not yet.
Aki
Something to try, this all could be happening because of underlying disk failure on the array it is running on. If this is a VM, can you move the operation to another host or data store to rule out hardware issues?
On 2/2/2023 9:19 AM, Stuart Henderson wrote:
On 2023-02-01, Tim Evers <te-ml-ext@artfiles.de> wrote:
I run a fairly large Dovecot Installation (around 100k mailboxes) on several servers.
gzip compression is on.
Every once in a while I get the dreaded "cache corruption" messages in the log:
Error: Corrupted record in index cache file /[redacted]/Maildir/dovecot.index.cache: UID 3868: Broken physical size in mailbox INBOX: read(zlib(/[redacted]/Maildir/cur/1674129792.M797543P21755.node2,S=8099,W=8276:2,)) failed: Cached message size smaller than expected (2877 < 8099, box=INBOX, UID=3868)
Error: Corrupted record in index cache file /[redacted]/Maildir/dovecot.index.cache: UID 3875: Broken physical size in mailbox INBOX: read(zlib(/[redacted]/Maildir/cur/1674212201.M985809P29112.node2,S=13907,W=14121:2,)) failed: Cached message size smaller than expected (5533 < 8192, box=INBOX, UID=3875)
The first entry shows 2877 (size on disk) vs. 8099 (real size unzipped, also in the filename: S=8099).
The second entry shows 5533 (size on disk) vs. 8192 - this is not correct in any way. Size on disk is 13907 as noted in the filename.
Both mails were delivered trough LMTP and retrieved by the POP3 service.
Anyone with an idea what might be happening here? I've read all available info in the doc and in the previous discussions / bug reports, but nothing seems to match my case. And where does that 8192 come from - it looks suspicious?
Version is 2.3.7.2 (Ubuntu 20.04) 2.3.7.2 is rather old now. There were definitely fixes regarding compression around the 2.3.10-2.3.12 timeframe or thereabouts (I forget all the details but it took a release or two before some remaining issues were sorted out after changes in the area). I'd be looking to get it updated to a current version first.
-- Christopher Wensink IS Administrator Five Star Plastics, Inc 1339 Continental Drive Eau Claire, WI 54701 Office: 715-831-1682 Mobile: 715-563-3112 Fax: 715-831-6075 cwensink@five-star-plastics.com www.five-star-plastics.com
Could even be memory. I had once on an office machine a faulty memory module (without ecc), and it caused the md5sum from files on truecrypt usb backup drives to change constantly. Removed the module, and no more issues.
Something to try, this all could be happening because of underlying disk failure on the array it is running on. If this is a VM, can you move the operation to another host or data store to rule out hardware issues?
On 2/2/2023 9:19 AM, Stuart Henderson wrote:
On 2023-02-01, Tim Evers <te-ml-ext@artfiles.de> wrote:
I run a fairly large Dovecot Installation (around 100k mailboxes) on several servers.
gzip compression is on.
Every once in a while I get the dreaded "cache corruption" messages in the log:
Error: Corrupted record in index cache file /[redacted]/Maildir/dovecot.index.cache: UID 3868: Broken physical size in mailbox INBOX:
read(zlib(/[redacted]/Maildir/cur/1674129792.M797543P21755.node2,S=8099,W=8276 :2,))
failed: Cached message size smaller than expected (2877 < 8099, box=INBOX, UID=3868)
Error: Corrupted record in index cache file /[redacted]/Maildir/dovecot.index.cache: UID 3875: Broken physical size in mailbox INBOX:
read(zlib(/[redacted]/Maildir/cur/1674212201.M985809P29112.node2,S=13907,W=141 21:2,))
failed: Cached message size smaller than expected (5533 < 8192, box=INBOX, UID=3875)
The first entry shows 2877 (size on disk) vs. 8099 (real size unzipped, also in the filename: S=8099).
The second entry shows 5533 (size on disk) vs. 8192 - this is not correct in any way. Size on disk is 13907 as noted in the filename.
Both mails were delivered trough LMTP and retrieved by the POP3 service.
Anyone with an idea what might be happening here? I've read all available info in the doc and in the previous discussions / bug reports, but nothing seems to match my case. And where does that 8192 come from - it looks suspicious?
Version is 2.3.7.2 (Ubuntu 20.04) 2.3.7.2 is rather old now. There were definitely fixes regarding compression around the 2.3.10-2.3.12 timeframe or thereabouts (I forget all the details but it took a release or two before some remaining issues were sorted out after changes in the area). I'd be looking to get it updated to a current version first.
Good point - these are 8 diferrent DRBD clusters. I failed over one testing this theory. Problem persists.
So I would rule out underlying issues.
Especially since the "wrong" value is suspiciously often the on-disk size rather than a random value one would expect if there is corruption underneath.
Tim
Am 02.02.23 um 16:43 schrieb Christopher Wensink:
Something to try, this all could be happening because of underlying disk failure on the array it is running on. If this is a VM, can you move the operation to another host or data store to rule out hardware issues?
On 2/2/2023 9:19 AM, Stuart Henderson wrote:
I run a fairly large Dovecot Installation (around 100k mailboxes) on several servers.
gzip compression is on.
Every once in a while I get the dreaded "cache corruption" messages in the log:
Error: Corrupted record in index cache file /[redacted]/Maildir/dovecot.index.cache: UID 3868: Broken physical size in mailbox INBOX: read(zlib(/[redacted]/Maildir/cur/1674129792.M797543P21755.node2,S=8099,W=8276:2,))
failed: Cached message size smaller than expected (2877 < 8099, box=INBOX, UID=3868)
Error: Corrupted record in index cache file /[redacted]/Maildir/dovecot.index.cache: UID 3875: Broken physical size in mailbox INBOX: read(zlib(/[redacted]/Maildir/cur/1674212201.M985809P29112.node2,S=13907,W=14121:2,))
failed: Cached message size smaller than expected (5533 < 8192, box=INBOX, UID=3875)
The first entry shows 2877 (size on disk) vs. 8099 (real size unzipped, also in the filename: S=8099).
The second entry shows 5533 (size on disk) vs. 8192 - this is not correct in any way. Size on disk is 13907 as noted in the filename.
Both mails were delivered trough LMTP and retrieved by the POP3 service.
Anyone with an idea what might be happening here? I've read all available info in the doc and in the previous discussions / bug reports, but nothing seems to match my case. And where does that 8192 come from - it looks suspicious?
Version is 2.3.7.2 (Ubuntu 20.04) 2.3.7.2 is rather old now. There were definitely fixes regarding compression around the 2.3.10-2.3.12 timeframe or thereabouts (I forget all the
On 2023-02-01, Tim Evers <te-ml-ext@artfiles.de> wrote: details but it took a release or two before some remaining issues were sorted out after changes in the area). I'd be looking to get it updated to a current version first.
Can you isolate the problem account on a separate VM to see if the problem follows the account or the original vm?
Chris
On 2/2/2023 9:58 AM, Tim Evers wrote:
Good point - these are 8 diferrent DRBD clusters. I failed over one testing this theory. Problem persists.
So I would rule out underlying issues.
Especially since the "wrong" value is suspiciously often the on-disk size rather than a random value one would expect if there is corruption underneath.
Tim
Am 02.02.23 um 16:43 schrieb Christopher Wensink:
Something to try, this all could be happening because of underlying disk failure on the array it is running on. If this is a VM, can you move the operation to another host or data store to rule out hardware issues?
On 2/2/2023 9:19 AM, Stuart Henderson wrote:
I run a fairly large Dovecot Installation (around 100k mailboxes) on several servers.
gzip compression is on.
Every once in a while I get the dreaded "cache corruption" messages in the log:
Error: Corrupted record in index cache file /[redacted]/Maildir/dovecot.index.cache: UID 3868: Broken physical size in mailbox INBOX: read(zlib(/[redacted]/Maildir/cur/1674129792.M797543P21755.node2,S=8099,W=8276:2,))
failed: Cached message size smaller than expected (2877 < 8099, box=INBOX, UID=3868)
Error: Corrupted record in index cache file /[redacted]/Maildir/dovecot.index.cache: UID 3875: Broken physical size in mailbox INBOX: read(zlib(/[redacted]/Maildir/cur/1674212201.M985809P29112.node2,S=13907,W=14121:2,))
failed: Cached message size smaller than expected (5533 < 8192, box=INBOX, UID=3875)
The first entry shows 2877 (size on disk) vs. 8099 (real size unzipped, also in the filename: S=8099).
The second entry shows 5533 (size on disk) vs. 8192 - this is not correct in any way. Size on disk is 13907 as noted in the filename.
Both mails were delivered trough LMTP and retrieved by the POP3 service.
Anyone with an idea what might be happening here? I've read all available info in the doc and in the previous discussions / bug reports, but nothing seems to match my case. And where does that 8192 come from - it looks suspicious?
Version is 2.3.7.2 (Ubuntu 20.04) 2.3.7.2 is rather old now. There were definitely fixes regarding compression around the 2.3.10-2.3.12 timeframe or thereabouts (I forget all the
On 2023-02-01, Tim Evers <te-ml-ext@artfiles.de> wrote: details but it took a release or two before some remaining issues were sorted out after changes in the area). I'd be looking to get it updated to a current version first.
-- Christopher Wensink IS Administrator Five Star Plastics, Inc 1339 Continental Drive Eau Claire, WI 54701 Office: 715-831-1682 Mobile: 715-563-3112 Fax: 715-831-6075 cwensink@five-star-plastics.com www.five-star-plastics.com
Maybe I was a bit unclear: I have about 1000 error messages per day from random accounts (about 500 in total so far) on all clusters. These are transparent to the user, so it's more like background noise at the moment.
No VM involved. All machines are baremetal DRBD two-node clusters.
As far as I see it I can not nail it down to specific accounts, POP3 vs. IMAP, LMTP delivery vs. IMAP store or Sieve vs. non-Sieve etc.
Tim
Am 02.02.23 um 17:55 schrieb Christopher Wensink:
Can you isolate the problem account on a separate VM to see if the problem follows the account or the original vm?
Chris
On 2/2/2023 9:58 AM, Tim Evers wrote:
Good point - these are 8 diferrent DRBD clusters. I failed over one testing this theory. Problem persists.
So I would rule out underlying issues.
Especially since the "wrong" value is suspiciously often the on-disk size rather than a random value one would expect if there is corruption underneath.
Tim
Am 02.02.23 um 16:43 schrieb Christopher Wensink:
Something to try, this all could be happening because of underlying disk failure on the array it is running on. If this is a VM, can you move the operation to another host or data store to rule out hardware issues?
On 2/2/2023 9:19 AM, Stuart Henderson wrote:
I run a fairly large Dovecot Installation (around 100k mailboxes) on several servers.
gzip compression is on.
Every once in a while I get the dreaded "cache corruption" messages in the log:
Error: Corrupted record in index cache file /[redacted]/Maildir/dovecot.index.cache: UID 3868: Broken physical size in mailbox INBOX: read(zlib(/[redacted]/Maildir/cur/1674129792.M797543P21755.node2,S=8099,W=8276:2,))
failed: Cached message size smaller than expected (2877 < 8099, box=INBOX, UID=3868)
Error: Corrupted record in index cache file /[redacted]/Maildir/dovecot.index.cache: UID 3875: Broken physical size in mailbox INBOX: read(zlib(/[redacted]/Maildir/cur/1674212201.M985809P29112.node2,S=13907,W=14121:2,))
failed: Cached message size smaller than expected (5533 < 8192, box=INBOX, UID=3875)
The first entry shows 2877 (size on disk) vs. 8099 (real size unzipped, also in the filename: S=8099).
The second entry shows 5533 (size on disk) vs. 8192 - this is not correct in any way. Size on disk is 13907 as noted in the filename.
Both mails were delivered trough LMTP and retrieved by the POP3 service.
Anyone with an idea what might be happening here? I've read all available info in the doc and in the previous discussions / bug reports, but nothing seems to match my case. And where does that 8192 come from - it looks suspicious?
Version is 2.3.7.2 (Ubuntu 20.04) 2.3.7.2 is rather old now. There were definitely fixes regarding compression around the 2.3.10-2.3.12 timeframe or thereabouts (I forget all the
On 2023-02-01, Tim Evers <te-ml-ext@artfiles.de> wrote: details but it took a release or two before some remaining issues were sorted out after changes in the area). I'd be looking to get it updated to a current version first.
Maybe I was a bit unclear: I have about 1000 error messages per day from random accounts (about 500 in total so far) on all clusters. These are transparent to the user, so it's more like background noise at the moment.
Do you have ecc memory?
No VM involved. All machines are baremetal DRBD two-node clusters.
How old are your drives? Do you scrub the raids? How reliable is your drbd setup? Does drbd even sync raid fixes? Do you have networking issues?
Connect a hdd directly and use that for a few accounts, do you still have the problem? I will bet not.
Why do you have this bare metal? Do you need performance? Otherwise switch to reliable storage that is a bit less performing. You will get headaches from these multiple drbd setups.
As far as I see it I can not nail it down to specific accounts, POP3 vs. IMAP, LMTP delivery vs. IMAP store or Sieve vs. non-Sieve etc.
It is not dovecot, otherwise it would be here more often listed.
participants (5)
-
Aki Tuomi
-
Christopher Wensink
-
Marc
-
Stuart Henderson
-
Tim Evers