Replication - I/O has stalled

29 Mar 2021


      Hi!
I'm running Dovecot 2.3.14 from the Dovecot repo on Debian-9. I've
configured replication and often notice the following log messages:
Mar 29 09:23:13 atlantia dovecot: doveadm: Error: Couldn't lock /var/spool/vmail/stm/.dovecot-sync.lock: fcntl(/var/spool/vmail/stm/.dovecot-sync.lock, write-lock, F_SETLKW) locking failed: Timed out after 30 seconds (WRITE lock held by pid 30810)
Mar 29 09:27:43 atlantia dovecot: dsync-local(stm)<d79ZNRZ/YWBaeAAAr9pkTg>: Error: dsync(pacifica.moeding.net): I/O has stalled, no activity for 600 seconds (last sent=mailbox, last recv=mailbox_state)
Mar 29 09:27:43 atlantia dovecot: dsync-local(stm)<d79ZNRZ/YWBaeAAAr9pkTg>: Error: Timeout during state=sync_mails (send=mailbox recv=mailbox)
Process 30810 is doveadm-server when this happended:
PID TTY      STAT   TIME COMMAND
1080 ?        Ss     0:07 /usr/sbin/dovecot -F
1091 ?        S      0:01  \_ dovecot/replicator
1094 ?        S      0:01  \_ dovecot/anvil [2 connections]
1095 ?        S      0:02  \_ dovecot/log
1096 ?        S      0:06  \_ dovecot/stats [6 connections]
1098 ?        S      0:14  \_ dovecot/config
1101 ?        S      0:07  \_ dovecot/auth [0 wait, 0 passdb, 0 userdb]
4728 ?        S      0:00  \_ dovecot/aggregator
30668 ?        S      0:00  \_ dovecot/imap-login
30670 ?        S      0:00  \_ dovecot/imap
30810 ?        S      0:00  \_ dovecot/doveadm-server [stm System send:mailbox recv:mailbox]
Sometimes these error occur once every hour. I have
replication_full_sync_interval = 1 hours, so I have the strong feeling
that this is the cause.
Maybe there is a race condition when full syncs are started concurrently
on both sides?
Is anybody else observing this?
--
Stefan

Replication - I/O has stalled

Stefan Möding