Broken uidlist when using NFS on newer kernels
I know this has been reported in the past, but I think I have some useful new information on the problem. After an OS upgrade from Ubuntu Xenial (4.4.0 kernel) to Ubuntu Focal (5.4.0 kernel) and corresponding upgrade from Dovecot 2.2.27 to 2.3.7.2, we've started seeing broken uidlist files to an extent that's making larger mail boxes nearly unusable because the file is constantly being regenerated. I've also used the 2.3.16-2+ubuntu20.04 version distributed from dovecot.org and the behavior is unchanged. The environment consists of NFS mounts from a NetApp device, with a couple dozen MX servers receiving mail and about a hundred IMAP/POP servers.
This is the exact error (note the blank after "invalid data"): Error: Mailbox INBOX: Broken file /mnt/morty/morty2/gravest/x15775549/Maildir/dovecot-uidlist line 373: Invalid data:
I've been able to trigger the problem rather easily by piping an email to dovecot-lda in a loop and reading the resulting dovecot-uidlist file on a different server. What it shows is that occasionally we're seeing the last line of the file prepended with a number of null bytes equal to the line that's being written (for example, if the entry is "35322 :1633719038.M516419P3623238.pdx1-sub0-mail-mx202,S=2777,W=2832", we'll have it prepended by 69 null bytes). This then breaks the IMAP process' ability to read the file. My first thought was to extend the retry functionality so the imap proces makes more attempts to read the file when it detects a problem like this, but would love input from someone more familiar with the codebase.
Hi Jeremy,
I had the same problem as you.
We run an email hosting service with Maildir on NetApp NFS, Dovecot Director and Backend servers for POP/IMAP and messagges deliverd via dovecot-lda by MXs. After the upgrade from CentOS 6 to CentOS 7 I found the same issue as you (on dovecot-uidlist).
After many tests we decided to switch from LDA to LMTP, that was already in our roadmap, so read and delivery of messagges is always on the same backend. And the problem was solved.
I haven't found any others workarounds.
Swith from LDA to LMTP was not so simple for us since our MX wasn't able to talk LMTP but we have write some custom C++ code and was done. You should also consider to add some directors since also incoming emails will transit from it.
If you would like to talk about how we solve on MXs side I will happy to talk with you.
Ciao
Il 08/10/21 21:01, Jeremy Hanmer ha scritto:
I know this has been reported in the past, but I think I have some useful new information on the problem. After an OS upgrade from Ubuntu Xenial (4.4.0 kernel) to Ubuntu Focal (5.4.0 kernel) and corresponding upgrade from Dovecot 2.2.27 to 2.3.7.2, we've started seeing broken uidlist files to an extent that's making larger mail boxes nearly unusable because the file is constantly being regenerated. I've also used the 2.3.16-2+ubuntu20.04 version distributed from dovecot.org http://dovecot.org and the behavior is unchanged. The environment consists of NFS mounts from a NetApp device, with a couple dozen MX servers receiving mail and about a hundred IMAP/POP servers.
This is the exact error (note the blank after "invalid data"): Error: Mailbox INBOX: Broken file /mnt/morty/morty2/gravest/x15775549/Maildir/dovecot-uidlist line 373: Invalid data:
I've been able to trigger the problem rather easily by piping an email to dovecot-lda in a loop and reading the resulting dovecot-uidlist file on a different server. What it shows is that occasionally we're seeing the last line of the file prepended with a number of null bytes equal to the line that's being written (for example, if the entry is "35322 :1633719038.M516419P3623238.pdx1-sub0-mail-mx202,S=2777,W=2832", we'll have it prepended by 69 null bytes). This then breaks the IMAP process' ability to read the file. My first thought was to extend the retry functionality so the imap proces makes more attempts to read the file when it detects a problem like this, but would love input from someone more familiar with the codebase.
-- Alessio Cecchi Postmaster @ http://www.qboxmail.it https://www.linkedin.com/in/alessice
I looked into LMTP, but reconfiguring our 1.5 million mailboxes just to work around what seems like an obvious bug in the code is a hard sell. I patched maildir-uidlist.c to strip out the leading null bytes and things seem to behave just fine, but it feels wrong and I was hoping to get input from someone more familiar with the codebase.
On Tue, Oct 12, 2021 at 8:39 AM Alessio Cecchi alessio@skye.it wrote:
Hi Jeremy,
I had the same problem as you.
We run an email hosting service with Maildir on NetApp NFS, Dovecot Director and Backend servers for POP/IMAP and messagges deliverd via dovecot-lda by MXs. After the upgrade from CentOS 6 to CentOS 7 I found the same issue as you (on dovecot-uidlist).
After many tests we decided to switch from LDA to LMTP, that was already in our roadmap, so read and delivery of messagges is always on the same backend. And the problem was solved.
I haven't found any others workarounds.
Swith from LDA to LMTP was not so simple for us since our MX wasn't able to talk LMTP but we have write some custom C++ code and was done. You should also consider to add some directors since also incoming emails will transit from it.
If you would like to talk about how we solve on MXs side I will happy to talk with you.
Ciao Il 08/10/21 21:01, Jeremy Hanmer ha scritto:
I know this has been reported in the past, but I think I have some useful new information on the problem. After an OS upgrade from Ubuntu Xenial (4.4.0 kernel) to Ubuntu Focal (5.4.0 kernel) and corresponding upgrade from Dovecot 2.2.27 to 2.3.7.2, we've started seeing broken uidlist files to an extent that's making larger mail boxes nearly unusable because the file is constantly being regenerated. I've also used the 2.3.16-2+ubuntu20.04 version distributed from dovecot.org and the behavior is unchanged. The environment consists of NFS mounts from a NetApp device, with a couple dozen MX servers receiving mail and about a hundred IMAP/POP servers.
This is the exact error (note the blank after "invalid data"): Error: Mailbox INBOX: Broken file /mnt/morty/morty2/gravest/x15775549/Maildir/dovecot-uidlist line 373: Invalid data:
I've been able to trigger the problem rather easily by piping an email to dovecot-lda in a loop and reading the resulting dovecot-uidlist file on a different server. What it shows is that occasionally we're seeing the last line of the file prepended with a number of null bytes equal to the line that's being written (for example, if the entry is "35322 :1633719038.M516419P3623238.pdx1-sub0-mail-mx202,S=2777,W=2832", we'll have it prepended by 69 null bytes). This then breaks the IMAP process' ability to read the file. My first thought was to extend the retry functionality so the imap proces makes more attempts to read the file when it detects a problem like this, but would love input from someone more familiar with the codebase.
-- Alessio Cecchi Postmaster @ http://www.qboxmail.ithttps://www.linkedin.com/in/alessice
Hi!
LDA should work just fine, as long as you follow the same rules as with LMTP, you must only access the user concurrently on one backend. The problem usually comes when you accidentically access the user from the other backend while the user is active on other backend.
The fix you made might seemingly work, but it's going to break something in future. The \0 are not introduced by dovecot.
Aki
On 12/10/2021 21:45 Jeremy Hanmer jhanmer@gmail.com wrote:
I looked into LMTP, but reconfiguring our 1.5 million mailboxes just to work around what seems like an obvious bug in the code is a hard sell. I patched maildir-uidlist.c to strip out the leading null bytes and things seem to behave just fine, but it feels wrong and I was hoping to get input from someone more familiar with the codebase.
On Tue, Oct 12, 2021 at 8:39 AM Alessio Cecchi alessio@skye.it wrote:
Hi Jeremy, I had the same problem as you. We run an email hosting service with Maildir on NetApp NFS, Dovecot Director and Backend servers for POP/IMAP and messagges deliverd via dovecot-lda by MXs. After the upgrade from CentOS 6 to CentOS 7 I found the same issue as you (on dovecot-uidlist). After many tests we decided to switch from LDA to LMTP, that was already in our roadmap, so read and delivery of messagges is always on the same backend. And the problem was solved. I haven't found any others workarounds. Swith from LDA to LMTP was not so simple for us since our MX wasn't able to talk LMTP but we have write some custom C++ code and was done. You should also consider to add some directors since also incoming emails will transit from it.
If you would like to talk about how we solve on MXs side I will happy to talk with you. Ciao
Il 08/10/21 21:01, Jeremy Hanmer ha scritto:
I know this has been reported in the past, but I think I have some useful new information on the problem. After an OS upgrade from Ubuntu Xenial (4.4.0 kernel) to Ubuntu Focal (5.4.0 kernel) and corresponding upgrade from Dovecot 2.2.27 to 2.3.7.2, we've started seeing broken uidlist files to an extent that's making larger mail boxes nearly unusable because the file is constantly being regenerated. I've also used the 2.3.16-2+ubuntu20.04 version distributed from dovecot.org (http://dovecot.org) and the behavior is unchanged. The environment consists of NFS mounts from a NetApp device, with a couple dozen MX servers receiving mail and about a hundred IMAP/POP servers.
This is the exact error (note the blank after "invalid data"):
Error: Mailbox INBOX: Broken file /mnt/morty/morty2/gravest/x15775549/Maildir/dovecot-uidlist line 373: Invalid data:
I've been able to trigger the problem rather easily by piping an email to dovecot-lda in a loop and reading the resulting dovecot-uidlist file on a different server. What it shows is that occasionally we're seeing the last line of the file prepended with a number of null bytes equal to the line that's being written (for example, if the entry is "35322 :1633719038.M516419P3623238.pdx1-sub0-mail-mx202,S=2777,W=2832", we'll have it prepended by 69 null bytes). This then breaks the IMAP process' ability to read the file. My first thought was to extend the retry functionality so the imap proces makes more attempts to read the file when it detects a problem like this, but would love input from someone more familiar with the codebase.
-- Alessio Cecchi Postmaster @ http://www.qboxmail.it https://www.linkedin.com/in/alessice
I understand switching to Director is suggested, but that's not really feasible (in a reasonable timeframe, anyway) for a mail cluster like ours that requires procmail and processes over 1.5 million emails/day. It's probably worth mentioning that the bug is easily triggered by writing an email w/ dovecot-lda and the test user is isolated from any imap/pop3 access.
I'm happy to debug further if need be, but even if these null bytes aren't written by dovecot it seems that this NFS behavior has existed long enough that it would make sense for dovecot to handle the situation more smartly than it is.The patch I wrote is little more than an adaptation of what's already in place for handling extension fields, so it doesn't seem to me like a crazy change to make.
This snippet of an strace seems to catch the problem the first time it's encountered, fwiw. The number of null bytes is *always* equal to the number of bytes in the string LDA is about to write to the uidlist, which implies to me that it's independent from what any other host might be doing with the file at the time.
stat("/mnt/dale/dale2/buttersnap/x9508664/Maildir", {st_mode=S_IFDIR|S_ISGID|0710, st_size=4096, ...}) = 0 chown("/mnt/dale/dale2/buttersnap/x9508664/Maildir", 9508664, -1) = 0 openat(AT_FDCWD, "/mnt/dale/dale2/buttersnap/x9508664/Maildir/dovecot-uidlist", O_RDONLY) = 13 close(13) = 0 stat("/mnt/dale/dale2/buttersnap/x9508664/Maildir/dovecot-uidlist", {st_mode=S_IFREG|0600, st_size=10573, ...}) = 0 fstat(11, {st_mode=S_IFREG|0600, st_size=10573, ...}) = 0 lseek(11, 0, SEEK_SET) = 0 fstat(11, {st_mode=S_IFREG|0600, st_size=10573, ...}) = 0 fstat(11, {st_mode=S_IFREG|0600, st_size=10573, ...}) = 0 pread64(11, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 8192, 10510) = 63 pread64(11, "", 8129, 10573) = 0
On Tue, Oct 12, 2021 at 9:20 PM Aki Tuomi aki.tuomi@open-xchange.com wrote:
Hi!
LDA should work just fine, as long as you follow the same rules as with LMTP, you must only access the user concurrently on one backend. The problem usually comes when you accidentically access the user from the other backend while the user is active on other backend.
The fix you made might seemingly work, but it's going to break something in future. The \0 are not introduced by dovecot.
Aki
On 12/10/2021 21:45 Jeremy Hanmer jhanmer@gmail.com wrote:
I looked into LMTP, but reconfiguring our 1.5 million mailboxes just to work around what seems like an obvious bug in the code is a hard sell. I patched maildir-uidlist.c to strip out the leading null bytes and things seem to behave just fine, but it feels wrong and I was hoping to get input from someone more familiar with the codebase.
Hi Jeremy, I had the same problem as you. We run an email hosting service with Maildir on NetApp NFS, Dovecot Director and Backend servers for POP/IMAP and messagges deliverd via dovecot-lda by MXs. After the upgrade from CentOS 6 to CentOS 7 I found the same issue as you (on dovecot-uidlist). After many tests we decided to switch from LDA to LMTP, that was already in our roadmap, so read and delivery of messagges is always on the same backend. And the problem was solved. I haven't found any others workarounds. Swith from LDA to LMTP was not so simple for us since our MX wasn't able to talk LMTP but we have write some custom C++ code and was done. You should also consider to add some directors since also incoming emails will
On Tue, Oct 12, 2021 at 8:39 AM Alessio Cecchi alessio@skye.it wrote: transit from it.
If you would like to talk about how we solve on MXs side I will happy
Ciao
Il 08/10/21 21:01, Jeremy Hanmer ha scritto:
I know this has been reported in the past, but I think I have some useful new information on the problem. After an OS upgrade from Ubuntu Xenial (4.4.0 kernel) to Ubuntu Focal (5.4.0 kernel) and corresponding upgrade from Dovecot 2.2.27 to 2.3.7.2, we've started seeing broken uidlist files to an extent that's making larger mail boxes nearly unusable because
This is the exact error (note the blank after "invalid data"):
Error: Mailbox INBOX: Broken file
/mnt/morty/morty2/gravest/x15775549/Maildir/dovecot-uidlist line 373: Invalid data:
I've been able to trigger the problem rather easily by piping an
email to dovecot-lda in a loop and reading the resulting dovecot-uidlist file on a different server. What it shows is that occasionally we're seeing
to talk with you. the file is constantly being regenerated. I've also used the 2.3.16-2+ubuntu20.04 version distributed from dovecot.org ( http://dovecot.org) and the behavior is unchanged. The environment consists of NFS mounts from a NetApp device, with a couple dozen MX servers receiving mail and about a hundred IMAP/POP servers. the last line of the file prepended with a number of null bytes equal to the line that's being written (for example, if the entry is "35322 :1633719038.M516419P3623238.pdx1-sub0-mail-mx202,S=2777,W=2832", we'll have it prepended by 69 null bytes). This then breaks the IMAP process' ability to read the file. My first thought was to extend the retry functionality so the imap proces makes more attempts to read the file when it detects a problem like this, but would love input from someone more familiar with the codebase.
-- Alessio Cecchi Postmaster @ http://www.qboxmail.it https://www.linkedin.com/in/alessice
participants (3)
-
Aki Tuomi
-
Alessio Cecchi
-
Jeremy Hanmer