I am running Dovecot IMAP on Linux, on a LizardFS storage cluster with Maildir storage. This has worked well for most of the accounts for several months.
However in the last couple of weeks we are seeing increasing errors regarding corrupted index files. Some of the accounts affected are unable to retrieve messages due to timeouts.
It appeared the problems were due to the accounts being accessed from multiple servers simultaneously, so I forced them all to access one server, but the errors remained. It looks like it has something to do with file locking, but LizardFS supports advisory file locking and I do have it enabled.
Deleting the corrupted indexes fixes the problem for a while, but it eventually returns, particularly for some accounts.
Here are some errors I'm seeing (just a random grab). Actual home directories are munged for confidentiality.
imap[25157]: (clientes.standby) Error: Failed to fix view for HOME/clientes:standby/dovecot.index: Missing middle file seq=1 (between 1..1, we have seqs 8): File is already open imap[5565]: (stadiumchair) Error: Transaction log file HOME/stadiumchair/.Drafts/dovecot.index.log: marked corrupted imap[5005]: (stadiumchair) Error: Corrupted transaction log file HOME/stadiumchair/.Drafts/dovecot.index.log seq 2: indexid changed 1418941056 -> 1500658549 (sync_offset=0) imap[20243]: (martha) Error: Transaction log HOME/martha/dovecot.index.log: duplicate transaction log sequence (539) imap[4665]: (emsspam) Error: Index file HOME/emsspam/dovecot.index: indexid changed: 1500658479 -> 1297175382 imap[4665]: (emsspam) Error: Corrupted transaction log file HOME/emsspam/dovecot.index.log seq 3: indexid changed: 1500658479 -> 1297175382 (sync_offset=316) imap[22985]: (emsspam) Error: Corrupted transaction log file HOME/emsspam/dovecot.index.log seq 10742: Invalid transaction log size (9296 vs 9296): HOME/emsspam/dovecot.index.log (sync_offset=9296) imap[3267]: (emsspam) Error: Failed to map view for HOME/emsspam/dovecot.index: Failed to map file seq=10742 offset=9052..18446744073709551615 (ret=0): corrupted, indexid=0 imap[3267]: (emsspam) Error: HOME/emsspam/dovecot.index view is inconsistent: uid=3062271 inserted in the middle of mailbox
The output of dovecot -n is pasted in below. Note that some of the boxes are running 4.9, some running 4.4, all have the same problems. Also note that I am using a custom authentication front end for our virtual mailboxes, but it just sets up the minimal environment variables and runs imap.
Is there anything I can change to eliminate these problems? Are there any other diagnostics I can provide to shed light on this?
# 2.2.31 (65cde28): /etc/dovecot/dovecot.conf # OS: Linux 4.4.66 x86_64 Gentoo Base System release 2.3 log_path = /dev/stderr mail_debug = yes mail_fsync = always mail_location = maildir:~/.maildir mail_log_prefix = "%s[%p]: (%u) " mmap_disable = yes namespace inbox { inbox = yes location = mailbox Drafts { special_use = \Drafts } mailbox Junk { special_use = \Junk } mailbox Sent { special_use = \Sent } mailbox "Sent Messages" { special_use = \Sent } mailbox Trash { special_use = \Trash } prefix = INBOX separator = type = private } passdb { args = * driver = pam } passdb { args = /etc/dovecot/dovecot-sql.conf.ext driver = sql } plugin { mail_log_events = delete undelete expunge copy mailbox_delete mailbox_rename } ssl_cert =
-- Bruce Guenter bruce@untroubled.org http://untroubled.org/
Am 21.07.2017 um 19:47 schrieb Bruce Guenter:
I am running Dovecot IMAP on Linux, on a LizardFS storage cluster with Maildir storage. This has worked well for most of the accounts for several months.
However in the last couple of weeks we are seeing increasing errors regarding corrupted index files.
you should avoid this one solution is to use loadbalancers with persistance and/or with i.e
https://wiki2.dovecot.org/Director
i dont know LizardFS but problems are somekind equal with all storage clusters and there are different solutions to handle this so i dont know what may the best at your place
i would read and ask here for settings with storage clusters, a good start could be
https://wiki2.dovecot.org/NFS https://wiki2.dovecot.org/SharedMailboxes/ClusterSetup https://wiki2.dovecot.org/MailLocation/SharedDisk
Some of the accounts affected are
unable to retrieve messages due to timeouts.
index settings and mailbox format has impact about this maildir mostly is self healing but that may fail sometimes on cluster
It appeared the problems were due to the accounts being accessed from multiple servers simultaneously, so I forced them all to access one server, but the errors remained. It looks like it has something to do with file locking, but LizardFS supports advisory file locking and I do have it enabled.
Deleting the corrupted indexes fixes the problem for a while, but it eventually returns, particularly for some accounts.
yeah that is perhaps per design
Here are some errors I'm seeing (just a random grab). Actual home directories are munged for confidentiality.
imap[25157]: (clientes.standby) Error: Failed to fix view for HOME/clientes:standby/dovecot.index: Missing middle file seq=1 (between 1..1, we have seqs 8): File is already open imap[5565]: (stadiumchair) Error: Transaction log file HOME/stadiumchair/.Drafts/dovecot.index.log: marked corrupted imap[5005]: (stadiumchair) Error: Corrupted transaction log file HOME/stadiumchair/.Drafts/dovecot.index.log seq 2: indexid changed 1418941056 -> 1500658549 (sync_offset=0) imap[20243]: (martha) Error: Transaction log HOME/martha/dovecot.index.log: duplicate transaction log sequence (539) imap[4665]: (emsspam) Error: Index file HOME/emsspam/dovecot.index: indexid changed: 1500658479 -> 1297175382 imap[4665]: (emsspam) Error: Corrupted transaction log file HOME/emsspam/dovecot.index.log seq 3: indexid changed: 1500658479 -> 1297175382 (sync_offset=316) imap[22985]: (emsspam) Error: Corrupted transaction log file HOME/emsspam/dovecot.index.log seq 10742: Invalid transaction log size (9296 vs 9296): HOME/emsspam/dovecot.index.log (sync_offset=9296) imap[3267]: (emsspam) Error: Failed to map view for HOME/emsspam/dovecot.index: Failed to map file seq=10742 offset=9052..18446744073709551615 (ret=0): corrupted, indexid=0 imap[3267]: (emsspam) Error: HOME/emsspam/dovecot.index view is inconsistent: uid=3062271 inserted in the middle of mailbox
The output of dovecot -n is pasted in below. Note that some of the boxes are running 4.9, some running 4.4, all have the same problems. Also note that I am using a custom authentication front end for our virtual mailboxes, but it just sets up the minimal environment variables and runs imap.
Is there anything I can change to eliminate these problems? Are there any other diagnostics I can provide to shed light on this?
# 2.2.31 (65cde28): /etc/dovecot/dovecot.conf # OS: Linux 4.4.66 x86_64 Gentoo Base System release 2.3 log_path = /dev/stderr mail_debug = yes mail_fsync = always mail_location = maildir:~/.maildir mail_log_prefix = "%s[%p]: (%u) " mmap_disable = yes namespace inbox { inbox = yes location = mailbox Drafts { special_use = \Drafts } mailbox Junk { special_use = \Junk } mailbox Sent { special_use = \Sent } mailbox "Sent Messages" { special_use = \Sent } mailbox Trash { special_use = \Trash } prefix = INBOX separator = type = private } passdb { args = * driver = pam } passdb { args = /etc/dovecot/dovecot-sql.conf.ext driver = sql } plugin { mail_log_events = delete undelete expunge copy mailbox_delete mailbox_rename } ssl_cert =
i think you could rare the corrupt with optimize settings to i.e
mail_fsync = always mail_nfs_storage = yes mail_nfs_index = yes mmap_disable = yes
etc but to fix it at all you may have to rethink your whole setup dovecot gurus may help and search the list archive about cluster setups
Best Regards MfG Robert Schetterer
-- [*] sys4 AG
http://sys4.de, +49 (89) 30 90 46 64 Schleißheimer Straße 26/MG, 80333 München
Sitz der Gesellschaft: München, Amtsgericht München: HRB 199263 Vorstand: Patrick Ben Koetter, Marc Schiffbauer Aufsichtsratsvorsitzender: Florian Kirstein
On Fri, Jul 21, 2017 at 08:50:16PM +0200, Robert Schetterer wrote:
you should avoid this one solution is to use loadbalancers with persistance
We had been using a loadbalancer with persistence to reduce the problems, and today I switched to everything running on a single box to avoid any cross-node contention. Unfortunately, the problem still happens, even when they were all running imap on a single box.
We are moving to a director type setup instead of a persistent load balancer to eliminate the last source of cross-node access.
i think you could rare the corrupt with optimize settings to i.e
mail_fsync = always mmap_disable = yes
I have those, but...
mail_nfs_storage = yes mail_nfs_index = yes
I missed seeing those.
Thanks
-- Bruce Guenter bruce@untroubled.org http://untroubled.org/
On Fri, Jul 21, 2017 at 03:25:39PM -0600, Bruce Guenter wrote:
We had been using a loadbalancer with persistence to reduce the problems, and today I switched to everything running on a single box to avoid any cross-node contention. Unfortunately, the problem still happens, even when they were all running imap on a single box.
I just confirmed this. One of the mailboxes was deleted and recreated from scratch, and since recreation it has only been accessed on a single box. It *still* is having corrupt index problems.
This is not just caused by accessing the mailboxes on different servers.
-- Bruce Guenter bruce@untroubled.org http://untroubled.org/
Am 21.07.2017 um 23:58 schrieb Bruce Guenter:
On Fri, Jul 21, 2017 at 03:25:39PM -0600, Bruce Guenter wrote:
We had been using a loadbalancer with persistence to reduce the problems, and today I switched to everything running on a single box to avoid any cross-node contention. Unfortunately, the problem still happens, even when they were all running imap on a single box.
I just confirmed this. One of the mailboxes was deleted and recreated from scratch, and since recreation it has only been accessed on a single box. It *still* is having corrupt index problems.
This is not just caused by accessing the mailboxes on different servers.
there may exist additional problems, but do you moved away from cluster filesystem too, switching back related parameters ? On a single box with local storage you shouldnt have a problem unless there are hardware failures or other broken config settings, again rethink your whole setup, if you in production and afraid brake something in total, you should call paid guru support
Best Regards MfG Robert Schetterer
-- [*] sys4 AG
http://sys4.de, +49 (89) 30 90 46 64 Schleißheimer Straße 26/MG, 80333 München
Sitz der Gesellschaft: München, Amtsgericht München: HRB 199263 Vorstand: Patrick Ben Koetter, Marc Schiffbauer Aufsichtsratsvorsitzender: Florian Kirstein
On 21.07.2017 20:47, Bruce Guenter wrote:
I am running Dovecot IMAP on Linux, on a LizardFS storage cluster with Maildir storage. This has worked well for most of the accounts for several months.
However in the last couple of weeks we are seeing increasing errors regarding corrupted index files. Some of the accounts affected are unable to retrieve messages due to timeouts.
It appeared the problems were due to the accounts being accessed from multiple servers simultaneously, so I forced them all to access one server, but the errors remained. It looks like it has something to do with file locking, but LizardFS supports advisory file locking and I do have it enabled.
Deleting the corrupted indexes fixes the problem for a while, but it eventually returns, particularly for some accounts.
Here are some errors I'm seeing (just a random grab). Actual home directories are munged for confidentiality.
imap[25157]: (clientes.standby) Error: Failed to fix view for HOME/clientes:standby/dovecot.index: Missing middle file seq=1 (between 1..1, we have seqs 8): File is already open imap[5565]: (stadiumchair) Error: Transaction log file HOME/stadiumchair/.Drafts/dovecot.index.log: marked corrupted imap[5005]: (stadiumchair) Error: Corrupted transaction log file HOME/stadiumchair/.Drafts/dovecot.index.log seq 2: indexid changed 1418941056 -> 1500658549 (sync_offset=0) imap[20243]: (martha) Error: Transaction log HOME/martha/dovecot.index.log: duplicate transaction log sequence (539) imap[4665]: (emsspam) Error: Index file HOME/emsspam/dovecot.index: indexid changed: 1500658479 -> 1297175382 imap[4665]: (emsspam) Error: Corrupted transaction log file HOME/emsspam/dovecot.index.log seq 3: indexid changed: 1500658479 -> 1297175382 (sync_offset=316) imap[22985]: (emsspam) Error: Corrupted transaction log file HOME/emsspam/dovecot.index.log seq 10742: Invalid transaction log size (9296 vs 9296): HOME/emsspam/dovecot.index.log (sync_offset=9296) imap[3267]: (emsspam) Error: Failed to map view for HOME/emsspam/dovecot.index: Failed to map file seq=10742 offset=9052..18446744073709551615 (ret=0): corrupted, indexid=0 imap[3267]: (emsspam) Error: HOME/emsspam/dovecot.index view is inconsistent: uid=3062271 inserted in the middle of mailbox
The output of dovecot -n is pasted in below. Note that some of the boxes are running 4.9, some running 4.4, all have the same problems. Also note that I am using a custom authentication front end for our virtual mailboxes, but it just sets up the minimal environment variables and runs imap.
Is there anything I can change to eliminate these problems? Are there any other diagnostics I can provide to shed light on this?
# 2.2.31 (65cde28): /etc/dovecot/dovecot.conf # OS: Linux 4.4.66 x86_64 Gentoo Base System release 2.3 log_path = /dev/stderr mail_debug = yes mail_fsync = always mail_location = maildir:~/.maildir mail_log_prefix = "%s[%p]: (%u) " mmap_disable = yes namespace inbox { inbox = yes location = mailbox Drafts { special_use = \Drafts } mailbox Junk { special_use = \Junk } mailbox Sent { special_use = \Sent } mailbox "Sent Messages" { special_use = \Sent } mailbox Trash { special_use = \Trash } prefix = INBOX separator = type = private } passdb { args = * driver = pam } passdb { args = /etc/dovecot/dovecot-sql.conf.ext driver = sql } plugin { mail_log_events = delete undelete expunge copy mailbox_delete mailbox_rename } ssl_cert =
Do you have users accessing the files concurrently from more than one dovecot instance at a time?
Aki
On Mon, Jul 24, 2017 at 08:39:36AM +0300, Aki Tuomi wrote:
Do you have users accessing the files concurrently from more than one dovecot instance at a time?
Yes. Apparently it is fairly common behavior for some IMAP clients to open up multiple connections to the same mailbox. Some times the multiple accesses came from different servers (stand alone IMAP client and a webmail system), but there is corruption even when all the accesses are going through the same server.
(Yes, we need a director. I am working on integrating that into our network.)
-- Bruce Guenter bruce@untroubled.org http://untroubled.org/
On July 24, 2017 at 7:54 PM Bruce Guenter bruce@untroubled.org wrote:
On Mon, Jul 24, 2017 at 08:39:36AM +0300, Aki Tuomi wrote:
Do you have users accessing the files concurrently from more than one dovecot instance at a time?
Yes. Apparently it is fairly common behavior for some IMAP clients to open up multiple connections to the same mailbox. Some times the multiple accesses came from different servers (stand alone IMAP client and a webmail system), but there is corruption even when all the accesses are going through the same server.
(Yes, we need a director. I am working on integrating that into our network.)
-- Bruce Guenter bruce@untroubled.org http://untroubled.org/
Well, dovecot does not really guarantee access concurrency safety if you access indexes using more than one instance of dovecot at the same time.
Nevertheless, did you try w/o LizardFS, just to rule out any bugs or incompabilities?
Aki
On Mon, Jul 24, 2017 at 07:56:23PM +0300, Aki Tuomi wrote:
Well, dovecot does not really guarantee access concurrency safety if you access indexes using more than one instance of dovecot at the same time.
Pardon my ignorance, but how does Dovecot handle when an IMAP client connects multiple times concurrently? Does it not launch multiple instances?
Nevertheless, did you try w/o LizardFS, just to rule out any bugs or incompabilities?
Moving everybody off of LizardFS is not an option, and this has affected many separate mailboxes. Now that we have implemented a director front end (instead of just a semi-persistent load balancer), the instances have been reduced, but it is still happening on accounts that are not being accessed across multiple servers. I will see if I can pin these down to a single server and move them onto non-shared storage there.
-- Bruce Guenter bruce@untroubled.org http://untroubled.org/
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On Mon, 31 Jul 2017, Bruce Guenter wrote:
On Mon, Jul 24, 2017 at 07:56:23PM +0300, Aki Tuomi wrote:
Well, dovecot does not really guarantee access concurrency safety if you access indexes using more than one instance of dovecot at the same time.
Pardon my ignorance, but how does Dovecot handle when an IMAP client connects multiple times concurrently? Does it not launch multiple instances?
Aki means that multiple physically instances of Dovecot may access the index files not cleanly.
If there is one instance only and you connect to this single instance multiple times, there is no problem.
Steffen Kaiser -----BEGIN PGP SIGNATURE----- Version: GnuPG v1
iQEVAwUBWYAfa3z1H7kL/d9rAQJ9TggAs8JIB1o8QvYdoTdAGLfxhdjXhvpiX9Pc 9vbRUl5Ha21ZYPL4iZ9zdaf+ftaj2SYcXEWORCSy5hTl85rw5TMKtr2uJd9c8H8C O/7T7jGRJ/2LFSeiHFbyFYyZTgtpC5QNbwtAnMp0SAwN51NYcYBfcM+TJHzbAcmA SCvDO8DbQPer82RJ2h9dXp3TTFYnYFeVOmgEjTEFiyMi69w07cKXKsJCxu+jLPE7 E8ZpRCS68xiCZriGpuoqDfpNBD21wBYxiEaaE9HEK4ZIhDx812Xgu5ORj5zxy/MR cTEKDcgLAtWb/EoLLDLP47Mtw6NQiB7nDZLUVyhvBRLzNl8RzIUh/A== =kx9f -----END PGP SIGNATURE-----
On 1 Aug 2017, at 6.23, Bruce Guenter bruce@untroubled.org wrote:
On Mon, Jul 24, 2017 at 07:56:23PM +0300, Aki Tuomi wrote:
Well, dovecot does not really guarantee access concurrency safety if you access indexes using more than one instance of dovecot at the same time.
Pardon my ignorance, but how does Dovecot handle when an IMAP client connects multiple times concurrently? Does it not launch multiple instances?
Each imap connection is one imap process. So if (unlimited) thunderbird connects to dovecot imap and sees 30 folders, it will open 30 imap connections and dovecot will launch imap process to manage each connection. And on top of that LMTP delivery again is another process to access the FS.
If these processes get different view of the filesystem you will face corruption. Specially when you are dealing with multiple servers the metadata syncing across servers is a challenge for the cluster filesystems. I’m a bit surprised that the corruption does happen even with one server.
Moving everybody off of LizardFS is not an option, and this has affected many separate mailboxes. Now that we have implemented a director front end (instead of just a semi-persistent load balancer), the instances have been reduced, but it is still happening on accounts that are not being accessed across multiple servers. I will see if I can pin these down to a single server and move them onto non-shared storage there.
Without testing I still would say that it’s LizardFS to blame. There was similar problems with another cluster filesystem GlusterFS too.
Sami
On Tue, Aug 01, 2017 at 01:14:10PM +0300, Sami Ketola wrote:
Each imap connection is one imap process. So if (unlimited) thunderbird connects to dovecot imap and sees 30 folders, it will open 30 imap connections and dovecot will launch imap process to manage each connection. And on top of that LMTP delivery again is another process to access the FS.
That is what I expected, yes.
If these processes get different view of the filesystem you will face corruption. Specially when you are dealing with multiple servers the metadata syncing across servers is a challenge for the cluster filesystems.
Of course. It can be managed, but still best avoid it by using a director. We had been using a load balancer that persistently mapped remote IPs to one server or another, but discovered that was not sufficient (clients using both webmail and a desktop client came in from different remote IPs, oops), so we implemented the Dovecot director on our load balancer.
I’m a bit surprised that the corruption does happen even with one server.
That is what I am seeing. Even when the account is accessing the maildir from a single server, there is still corruption.
Without testing I still would say that it’s LizardFS to blame. There was similar problems with another cluster filesystem GlusterFS too.
Is there anything I could test to diagnose this further? I have moved the account with the most frequent problems to physical storage, but this won't tell me why LizardFS is causing problems.
-- Bruce Guenter bruce@untroubled.org http://untroubled.org/
participants (5)
-
Aki Tuomi
-
Bruce Guenter
-
Robert Schetterer
-
Sami Ketola
-
Steffen Kaiser