warning: NFS hangs with dovecot 2.3.8 on Debian buster
A warning to those considering to upgrade to Debian 10 (buster): we have seen occasional NFS hangs with dovecot when using the stock debian buster kernel (4.19.67-2+deb10u1).
When we downgrade to the debian stretch kernel (4.9.189-3+deb9u1), the issue does not occur. Note that we *only* downgraded the kernel, the rest of the OS is still debian buster. Dovecot 2.3.8.
A little more info: we have a dovecot cluster, using mdbox for storage, on an NFS server (netapp Cmode version 9.6P2). We use a dovecot director layer, so a user is always connected to the same back-end dovecot server. The NFS hang occurs on the back-end server.
Once the process hangs, other processes trying to write to the same mailbox, will get an error like this:
Timeout (180s) while waiting for lock for transaction log file /var/mail/.../index/storage/dovecot.map.index.log (WRITE lock held by pid XXXX)
The stuck process itself doesn't seem to do anything, is stuck in "D" disk state, "strace" doesn't show anything (and after attaching, strace itself needs a kill -KILL to stop). The only way to unwedge the process seems to be to do a kill -KILL of the stuck process. Reading from the mailbox is still possible.
We are in the process of contacting the linux-nfs folks about this, but I'm trying to reproduce this on a test-cluster first, to be able to present a well-documented case. Since this hang doesn't happen immediately, but takes a few hours to a day to occur in the wild, or a few thousand writes to the same mailbox, it's a bit hard to debug.
-- Jan-Pieter Cornet <johnpc@xs4all.net> Systeembeheer XS4ALL Internet bv www.xs4all.nl
On 25-10-19 19:41, Jan-Pieter Cornet via dovecot wrote:
We are in the process of contacting the linux-nfs folks about this, but I'm trying to reproduce this on a test-cluster first, to be able to present a well-documented case. Since this hang doesn't happen immediately, but takes a few hours to a day to occur in the wild, or a few thousand writes to the same mailbox, it's a bit hard to debug.
Just FTR, I finally sent mail to the linux-nfs list about this. See eg https://marc.info/?l=linux-nfs&m=157260601632323&w=2
No replies yet - if^H^Hwhen this gets resolved I'll report back to this list.
-- Jan-Pieter Cornet <johnpc@xs4all.net> Systeembeheer XS4ALL Internet bv www.xs4all.nl
participants (1)
-
Jan-Pieter Cornet