Broken uidlist when using NFS on newer kernels

Fri Oct 15 00:01:48 EEST 2021

I understand switching to Director is suggested, but that's not really
feasible (in a reasonable timeframe, anyway) for a mail cluster like ours
that requires procmail and processes over 1.5 million emails/day. It's
probably worth mentioning that the bug is easily triggered by writing an
email w/ dovecot-lda and the test user is isolated from any imap/pop3
access.

I'm happy to debug further if need be, but even if these null bytes aren't
written by dovecot it seems that this NFS behavior has existed long enough
that it would make sense for dovecot to handle the situation more smartly
than it is.The patch I wrote is little more than an adaptation of what's
already in place for handling extension fields, so it doesn't seem to me
like a crazy change to make.

This snippet of an strace seems to catch the problem the first time it's
encountered, fwiw. The number of null bytes is *always* equal to the number
of bytes in the string LDA is about to write to the uidlist, which implies
to me that it's independent from what any other host might be doing with
the file at the time.

stat("/mnt/dale/dale2/buttersnap/x9508664/Maildir",
{st_mode=S_IFDIR|S_ISGID|0710, st_size=4096, ...}) = 0
chown("/mnt/dale/dale2/buttersnap/x9508664/Maildir", 9508664, -1) = 0
openat(AT_FDCWD,
"/mnt/dale/dale2/buttersnap/x9508664/Maildir/dovecot-uidlist", O_RDONLY) =
13
close(13)                               = 0
stat("/mnt/dale/dale2/buttersnap/x9508664/Maildir/dovecot-uidlist",
{st_mode=S_IFREG|0600, st_size=10573, ...}) = 0
fstat(11, {st_mode=S_IFREG|0600, st_size=10573, ...}) = 0
lseek(11, 0, SEEK_SET)                  = 0
fstat(11, {st_mode=S_IFREG|0600, st_size=10573, ...}) = 0
fstat(11, {st_mode=S_IFREG|0600, st_size=10573, ...}) = 0
pread64(11,
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0",
8192, 10510) = 63
pread64(11, "", 8129, 10573)            = 0

On Tue, Oct 12, 2021 at 9:20 PM Aki Tuomi <aki.tuomi at open-xchange.com>
wrote:

> Hi!
>
> LDA should work just fine, as long as you follow the same rules as with
> LMTP, you must only access the user concurrently on one backend. The
> problem usually comes when you accidentically access the user from the
> other backend while the user is active on other backend.
>
> The fix you made might seemingly work, but it's going to break something
> in future. The \0 are not introduced by dovecot.
>
> Aki
>
> > On 12/10/2021 21:45 Jeremy Hanmer <jhanmer at gmail.com> wrote:
> >
> >
> > I looked into LMTP, but reconfiguring our 1.5 million mailboxes just to
> work around what seems like an obvious bug in the code is a hard sell. I
> patched maildir-uidlist.c to strip out the leading null bytes and things
> seem to behave just fine, but it feels wrong and I was hoping to get input
> from someone more familiar with the codebase.
> >
> >
> > On Tue, Oct 12, 2021 at 8:39 AM Alessio Cecchi <alessio at skye.it> wrote:
> > > Hi Jeremy,
> > > I had the same problem as you.
> > > We run an email hosting service with Maildir on NetApp NFS, Dovecot
> Director and Backend servers for POP/IMAP and messagges deliverd via
> dovecot-lda by MXs. After the upgrade from CentOS 6 to CentOS 7 I found the
> same issue as you (on dovecot-uidlist).
> > > After many tests we decided to switch from LDA to LMTP, that was
> already in our roadmap, so read and delivery of messagges is always on the
> same backend. And the problem was solved.
> > > I haven't found any others workarounds.
> > > Swith from LDA to LMTP was not so simple for us since our MX wasn't
> able to talk LMTP but we have write some custom C++ code and was done. You
> should also consider to add some directors since also incoming emails will
> transit from it.
> > >
> > > If you would like to talk about how we solve on MXs side I will happy
> to talk with you.
> > > Ciao
> > >
> > > Il 08/10/21 21:01, Jeremy Hanmer ha scritto:
> > >
> > > > I know this has been reported in the past, but I think I have some
> useful new information on the problem. After an OS upgrade from Ubuntu
> Xenial (4.4.0 kernel) to Ubuntu Focal (5.4.0 kernel) and corresponding
> upgrade from Dovecot 2.2.27 to 2.3.7.2, we've started seeing broken uidlist
> files to an extent that's making larger mail boxes nearly unusable because
> the file is constantly being regenerated. I've also used the
> 2.3.16-2+ubuntu20.04 version distributed from dovecot.org (
> http://dovecot.org) and the behavior is unchanged. The environment
> consists of NFS mounts from a NetApp device, with a couple dozen MX servers
> receiving mail and about a hundred IMAP/POP servers.
> > > >
> > > >
> > > >
> > > > This is the exact error (note the blank after "invalid data"):
> > > >
> > > > Error: Mailbox INBOX: Broken file
> /mnt/morty/morty2/gravest/x15775549/Maildir/dovecot-uidlist line 373:
> Invalid data:
> > > >
> > > >
> > > >
> > > > I've been able to trigger the problem rather easily by piping an
> email to dovecot-lda in a loop and reading the resulting dovecot-uidlist
> file on a different server. What it shows is that occasionally we're seeing
> the last line of the file prepended with a number of null bytes equal to
> the line that's being written (for example, if the entry is "35322
> :1633719038.M516419P3623238.pdx1-sub0-mail-mx202,S=2777,W=2832", we'll have
> it prepended by 69 null bytes). This then breaks the IMAP process' ability
> to read the file. My first thought was to extend the retry functionality so
> the imap proces makes more attempts to read the file when it detects a
> problem like this, but would love input from someone more familiar with the
> codebase.
> > > >
> > > --
> > > Alessio Cecchi
> > > Postmaster @ http://www.qboxmail.it
> > > https://www.linkedin.com/in/alessice
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://dovecot.org/pipermail/dovecot/attachments/20211014/e7992187/attachment.html>