mdbox / selecting folder accesses hundrets of others
Dovecot: 2.3.4
Hi, i am debugging a problem where after a force-resync removed all cache files (All files are 44 byte). Trying to open a folder takes literally minutes (Huge mailbox - 220K folder, 45GByte compressed storage)
After i straced the imap process i can see the imap process accesses hundrets of other completely unrelated folders:
Thats from the logfile - You can see mutt selecting the mailbox "zzzzzzt":
May 07 18:05:19 imap(inbox@wasteland)<20381><hlkrmL/B5qwgAwDvnxMdAGcc/5MSpYxQ>: Debug: fts: No fts setting - plugin disabled May 07 18:05:19 imap(inbox@wasteland)<20381><hlkrmL/B5qwgAwDvnxMdAGcc/5MSpYxQ>: Debug: quota: quota_over_flag check: quota_over_script unset - skipping May 07 18:05:19 imap(inbox@wasteland)<20381><hlkrmL/B5qwgAwDvnxMdAGcc/5MSpYxQ>: Debug: Mailbox zzzzzzt: Mailbox opened because: SELECT
From strace -p 20381 - This goes on for minutes opening thousands of unrelated folders. It is noteworthy that it mostly opens nested folders which are rare here:
[...] open("/var/vmail/wasteland/inbox/mdbox/mailboxes/gerries/are/hawt", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 15 fstat(15, {st_mode=S_IFDIR|0700, st_size=4096, ...}) = 0 getdents(15, /* 3 entries */, 32768) = 72 stat("/var/vmail/wasteland/inbox/mdbox/mailboxes/gerries/are/hawt/yeah", {st_mode=S_IFDIR|0700, st_size=4096, ...}) = 0 stat("/var/vmail/wasteland/inbox/mdbox/mailboxes/gerries/are/hawt/yeah/dbox-Mails", {st_mode=S_IFDIR|0700, st_size=4096, ...}) = 0 getdents(15, /* 0 entries */, 32768) = 0 close(15) = 0 open("/var/vmail/wasteland/inbox/mdbox/mailboxes/karl588594", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 15 fstat(15, {st_mode=S_IFDIR|0700, st_size=4096, ...}) = 0 getdents(15, /* 3 entries */, 32768) = 80 stat("/var/vmail/wasteland/inbox/mdbox/mailboxes/karl588594/b3914", {st_mode=S_IFDIR|0700, st_size=4096, ...}) = 0 stat("/var/vmail/wasteland/inbox/mdbox/mailboxes/karl588594/b3914/dbox-Mails", {st_mode=S_IFDIR|0700, st_size=4096, ...}) = 0 getdents(15, /* 0 entries */, 32768) = 0 close(15) = 0 open("/var/vmail/wasteland/inbox/mdbox/mailboxes/ben", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 15 fstat(15, {st_mode=S_IFDIR|0700, st_size=4096, ...}) = 0 getdents(15, /* 4 entries */, 32768) = 104 stat("/var/vmail/wasteland/inbox/mdbox/mailboxes/ben/gang", {st_mode=S_IFDIR|0700, st_size=4096, ...}) = 0 stat("/var/vmail/wasteland/inbox/mdbox/mailboxes/ben/gang/dbox-Mails", {st_mode=S_IFDIR|0700, st_size=4096, ...}) = 0
Reopening the same folder again is quick and seamless. So it seems this is some cache recreation code looking into other folders.
I am bit puzzled on why it does this. mail_debug=yes did now show anything. Any more debug options for mdbox i might set to find out about its decisions?
Flo
Florian Lohoff f@zz.de Any sufficiently advanced technology is indistinguishable from magic.
On Fri, May 07, 2021 at 06:15:55PM +0200, Florian Lohoff wrote:
Dovecot: 2.3.4
Hi, i am debugging a problem where after a force-resync removed all cache files (All files are 44 byte). Trying to open a folder takes literally minutes (Huge mailbox - 220K folder, 45GByte compressed storage)
As i am running out of options i upgraded to 2.3.14-5+debian9 though no success. Between Debian/Stretch 2.2.27 and 2.3.4 something broke horribly for huge mdboxes. Opening the mailbox takes (on an otherwise idle machine, 32GByte RAM, RAID1 spinning rust)
root@pax:~# time doveadm mailbox list -u inbox@wasteland.rfc822.org >/var/tmp/list doveadm(inbox@wasteland.rfc822.org): Warning: Transaction log file /var/vmail/wasteland.rfc822.org/inbox/mdbox/dovecot.list.index.log was locked for 107 seconds (Mailbox was synchronized)
real 1m47.508s user 1m46.856s sys 0m0.480s
We see the ~220K folders in this mailbox:
root@pax:~# wc -l /var/tmp/list 213922 /var/tmp/list
Looking at an strace on this command we see:
root@pax:~# grep stat /var/tmp/strace.doveadm.list | wc -l 428696
So every folder will at least get 2 stat calls - Okay - may be because of the mailbox list command - lets try deleting a folder.
root@pax:~# strace -f -o /var/tmp/strace.doveadm.mailbox.delete doveadm mailbox delete -e -u inbox@wasteland.rfc822.org bernd.bottich doveadm(inbox@wasteland.rfc822.org): Warning: Transaction log file /var/vmail/wasteland.rfc822.org/inbox/mdbox/dovecot.list.index.log was locked for 115 seconds (Mailbox was synchronized) doveadm(inbox@wasteland.rfc822.org): Error: Can't delete mailbox bernd.bottich: Mailbox doesn't exist: bernd.bottich
root@pax:~# grep stat /var/tmp/strace.doveadm.mailbox.delete | wc -l 428702
Okay - so we opened the mailbox, stattet all folders to tell me my mailbox is not existent. Took again around 90 seconds. And this happens every, single, time.
root@pax:~# time doveadm mailbox delete -e -u inbox@wasteland.rfc822.org bernd.bottich doveadm(inbox@wasteland.rfc822.org): Warning: Transaction log file /var/vmail/wasteland.rfc822.org/inbox/mdbox/dovecot.list.index.log was locked for 108 seconds (Mailbox was synchronized) doveadm(inbox@wasteland.rfc822.org): Error: Can't delete mailbox bernd.bottich: Mailbox doesn't exist: bernd.bottich
real 1m48.232s user 1m47.472s sys 0m0.580s
So logging is broken (Yes i had a look at the code. It complains about its own lock been taken to long on unlocking)
So i am in a deadend.
2.2.27 eats mdboxes for breakfast and cant repair them 2.3.4+ is unusuable with this mailbox.
All these operations were in the sub-second range with 2.2.27 ... Guess i need to find something else to store mails ...
Flo
Florian Lohoff f@zz.de Any sufficiently advanced technology is indistinguishable from magic.
On May 8, 2021, at 7:08 AM, Florian Lohoff <f@zz.de> wrote:
On Fri, May 07, 2021 at 06:15:55PM +0200, Florian Lohoff wrote:
Dovecot: 2.3.4
Hi, i am debugging a problem where after a force-resync removed all cache files (All files are 44 byte). Trying to open a folder takes literally minutes (Huge mailbox - 220K folder, 45GByte compressed storage)
I don’t have an answer for you, unfortunately, as you all exercise your mail system far more thanI do.
Taking a wild guess, I’d think that having a large number of folders has potential to be more of an issue than having a large amount of mail in fewer folders. Operating under that assumption, is it possible to bound the problem? (Hypothetically say 128k folders were good, more than that and something went out to lunch or something similar.)
I am, however, watching this with interest.
-- Coy Hile coy.hile@coyhile.com
On Sat, May 08, 2021 at 08:57:07AM -0400, Coy Hile wrote:
Dovecot: 2.3.4 Hi, i am debugging a problem where after a force-resync removed all cache files (All files are 44 byte). Trying to open a folder takes literally minutes (Huge mailbox - 220K folder, 45GByte compressed storage)
I don’t have an answer for you, unfortunately, as you all exercise your mail system far more thanI do.
Taking a wild guess, I’d think that having a large number of folders has potential to be more of an issue than having a large amount of mail in fewer folders. Operating under that assumption, is it possible to bound the problem? (Hypothetically say 128k folders were good, more than that and something went out to lunch or something similar.)
I am, however, watching this with interest.
The point is: If this is not a bug/misbehaviour caused by this specific mailbox and you have for larger installation an average of 10 folders per mailbox you have an up to 10 fold iop amplification for certain operations.
So it could be relevant for all sorts of larger installation. It may not be these extremes with 220k folders. This just puts the issue right in front of your face because its that extreme thats its noticeable.
Flo
Florian Lohoff f@zz.de Any sufficiently advanced technology is indistinguishable from magic.
On 8. May 2021, at 13.08, Florian Lohoff <f@zz.de> wrote:
On Fri, May 07, 2021 at 06:15:55PM +0200, Florian Lohoff wrote:
Dovecot: 2.3.4
Hi, i am debugging a problem where after a force-resync removed all cache files (All files are 44 byte). Trying to open a folder takes literally minutes (Huge mailbox - 220K folder, 45GByte compressed storage)
As i am running out of options i upgraded to 2.3.14-5+debian9 though no success. Between Debian/Stretch 2.2.27 and 2.3.4 something broke horribly for huge mdboxes. Opening the mailbox takes (on an otherwise idle machine, 32GByte RAM, RAID1 spinning rust)
<root@pax:~# time doveadm mailbox list -u inbox@wasteland.rfc822.org >/var/tmp/list doveadm(inbox@wasteland.rfc822.org): Warning: Transaction log file /var/vmail/wasteland.rfc822.org/inbox/mdbox/dovecot.list.index.log was locked for 107 seconds (Mailbox was synchronized)
real 1m47.508s user 1m46.856s sys 0m0.480s
We see the ~220K folders in this mailbox:
root@pax:~# wc -l /var/tmp/list 213922 /var/tmp/list
Looking at an strace on this command we see:
root@pax:~# grep stat /var/tmp/strace.doveadm.list | wc -l 428696
So every folder will at least get 2 stat calls - Okay - may be because of the mailbox list command - lets try deleting a folder.
Sounds like dovecot.list.index.
root@pax:~# strace -f -o /var/tmp/strace.doveadm.mailbox.delete doveadm mailbox delete -e -u inbox@wasteland.rfc822.org bernd.bottich doveadm(inbox@wasteland.rfc822.org): Warning: Transaction log file /var/vmail/wasteland.rfc822.org/inbox/mdbox/dovecot.list.index.log was locked for 115 seconds (Mailbox was synchronized) doveadm(inbox@wasteland.rfc822.org): Error: Can't delete mailbox bernd.bottich: Mailbox doesn't exist: bernd.bottich
root@pax:~# grep stat /var/tmp/strace.doveadm.mailbox.delete | wc -l 428702
Okay - so we opened the mailbox, stattet all folders to tell me my mailbox is not existent. Took again around 90 seconds. And this happens every, single, time.
root@pax:~# time doveadm mailbox delete -e -u inbox@wasteland.rfc822.org bernd.bottich doveadm(inbox@wasteland.rfc822.org): Warning: Transaction log file /var/vmail/wasteland.rfc822.org/inbox/mdbox/dovecot.list.index.log was locked for 108 seconds (Mailbox was synchronized) doveadm(inbox@wasteland.rfc822.org): Error: Can't delete mailbox bernd.bottich: Mailbox doesn't exist: bernd.bottich
real 1m48.232s user 1m47.472s sys 0m0.580s
So logging is broken (Yes i had a look at the code. It complains about its own lock been taken to long on unlocking)
It's intentional. It warns that it kept the lock much longer than it was expected to. This helps figuring out why other processes could be seeing lock timeouts.
So i am in a deadend.
2.2.27 eats mdboxes for breakfast and cant repair them 2.3.4+ is unusuable with this mailbox.
All these operations were in the sub-second range with 2.2.27 ... Guess i need to find something else to store mails ...
v2.2 has mailbox_list_index=no by default, while v2.3 has changed it to "yes". So you could try changing it back to "no". Or maybe mailbox_list_index_very_dirty_syncs=yes would help also.
Reopening the same folder again is quick and seamless. So it seems this is some cache recreation code looking into other folders.
Do you mean that whenever any of the 220k folders is opened, it's first slow but then fast afterwards (even if other folders are opened)? So it's only when a new folder is opened it's being slow? That does sound like some kind of a bug.
BTW. doveconf -n output could also be helpful. This behavior might also be due to some setting.
On Mon, May 10, 2021 at 11:29:05AM +0200, Timo Sirainen wrote:
So every folder will at least get 2 stat calls - Okay - may be because of the mailbox list command - lets try deleting a folder.
Sounds like dovecot.list.index.
[ ... ]
So i am in a deadend.
2.2.27 eats mdboxes for breakfast and cant repair them 2.3.4+ is unusuable with this mailbox.
All these operations were in the sub-second range with 2.2.27 ... Guess i need to find something else to store mails ...
v2.2 has mailbox_list_index=no by default, while v2.3 has changed it to "yes". So you could try changing it back to "no". Or maybe mailbox_list_index_very_dirty_syncs=yes would help also.
Reopening the same folder again is quick and seamless. So it seems this is some cache recreation code looking into other folders.
Do you mean that whenever any of the 220k folders is opened, it's first slow but then fast afterwards (even if other folders are opened)? So it's only when a new folder is opened it's being slow? That does sound like some kind of a bug.
BTW. doveconf -n output could also be helpful. This behavior might also be due to some setting.
Gotcha - Thats it - I disabled mailbox_list_index (and mailbox_list_index_very_dirty_syncs) and now it churns through the half a million mails queued up since the last week and even web/imap accesses are now happy as they were.
So somthing is pretty broken here. No operation on this mailbox should ever request a mailbox list. All accesses are to explicitly named folders. Be it deliver or imap list commands.
It seems even when simply opening a folder there are corner cases where imapd/doveadm trys to rebuild the dovecot.list.index. It does not say so in the log for what reason it does so.
Doing that on large mailboxes like this takes ages and locks the mailbox for writing. So this machine was rebuilding the dovecot.list.index over and over again. Delivery and other accesses were halted to a crawl as rebuilding the index took multiple minutes.
Flo
Florian Lohoff f@zz.de Any sufficiently advanced technology is indistinguishable from magic.
participants (3)
-
Coy Hile
-
Florian Lohoff
-
Timo Sirainen