no full syncs after upgrading to dovecot 2.3.18
Reuben Farrelly
reuben-dovecot at reub.net
Tue Apr 26 13:03:53 UTC 2022
I ran into this back in February and documented a reproducible test case
(and sent it to this list). In short - I was able to reproduce this by
having a valid and consistent mailbox on the source/local, creating a
very standard empty Maildir/(new|cur|tmp) folder on the remote replica,
and then initiating the replicate from the source. This consistently
caused dsync to fail replication with the error "dovecot.index reset,
view is now inconsistent" and sync aborted, leaving the replica mailbox
in a screwed up inconsistent state. Client connections on the source
replica were also dropped when this error occurred. You can see the
error by enabling debug level logging if you initiate dsync manually on
a test mailbox.
The only workaround I found was to remove the remote Maildir and let
Dovecot create the whole thing from scratch. Dovecot did not like any
existing folders on the destination replica even if they were the same
names as the source and completely empty. I was able to reproduce this
the bare minimum of folders - just an INBOX!
I have no idea if any of the developers saw my post or if the bug has
been fixed for the next release. But it seemed to be quite a common
problem over time (saw a few posts from people going back a long way
with the same problem) and it is seriously disruptive to clients. The
error message is not helpful in tracking down the problem either.
Secondly, I also have had an ongoing and longstanding problem using
tcps: for replication. For some reason using tcps: (with no other
changes at all to the config) results in a lot of timeout messages
"Error: dsync I/O has stalled, no activity for 600 seconds". This goes
away if I revert back to tcp: instead of tcps - with tcp: I very rarely
get timeouts. No idea why, guess this is a bug of some sort also.
It's disappointing that there appears to be no way to have these sorts
or problems addressed like there once was. I am not using Dovecot for
commercial purposes so paying a fortune for a support contract for a
high end installation just isn't going to happen, and this list seems to
be quite ordinary for getting support and reporting bugs nowadays....
Reuben
On 26/04/2022 7:21 pm, Paul Kudla (SCOM.CA Internet Services Inc.) wrote:
>
> side issue
>
> if you are getting inconsistant dsyncs there is no real way to fix
> this in the long run.
>
> i know its a pain (already had to my self)
>
> i needed to do a full sync, take one server offline, delete the user
> dir (with dovecot offline) and then rsync (or somehow duplicate the
> main server's user data) over the the remote again.
>
> then bring remote back up and it kind or worked worked
>
> best suggestion is to bring the main server down at night so the copy
> is clean?
>
> if using postfix you can enable the soft bounce option and the mail
> will back spool until everything comes back online
>
> (needs to be enable on bother servers)
>
> replication was still an issue on accounts with 300+ folders in them,
> still working on a fix for that.
>
>
> Happy Tuesday !!!
> Thanks - paul
>
> Paul Kudla
>
>
> Scom.ca Internet Services <http://www.scom.ca>
> 004-1009 Byron Street South
> Whitby, Ontario - Canada
> L1N 4S3
>
> Toronto 416.642.7266
> Main 1.866.411.7266
> Fax 1.888.892.7266
>
> On 4/25/2022 10:01 AM, Arnaud Abélard wrote:
>> Ah, I'm now getting errors in the logs, that would explains the
>> increasing number of failed sync requests:
>>
>> dovecot: imap(xxxxx)<2961235><Bs6w43rdQPAqAcsFiXEmAInUhhA3Rfqh>:
>> Error: Mailbox INBOX: /vmail/l/i/xxxxx/dovecot.index reset, view is
>> now inconsistent
>>
>>
>> And sure enough:
>>
>> # dovecot replicator status xxxxx
>>
>> xxxxx none 00:02:54 07:11:28 - y
>>
>>
>> What could explain that error?
>>
>> Arnaud
>>
>>
>>
>> On 25/04/2022 15:13, Arnaud Abélard wrote:
>>> Hello,
>>>
>>> On my side we are running Linux (Debian Buster).
>>>
>>> I'm not sure my problem is actually the same as Paul or you
>>> Sebastian since I have a lot of boxes but those are actually small
>>> (quota of 110MB) so I doubt any of them have more than a dozen imap
>>> folders.
>>>
>>> The main symptom is that I have tons of full sync requests awaiting
>>> but even though no other sync is pending the replicator just waits
>>> for something to trigger those syncs.
>>>
>>> Today, with users back I can see that normal and incremental syncs
>>> are being done on the 15 connections, with an occasional full sync
>>> here or there and lots of "Waiting 'failed' requests":
>>>
>>> Queued 'sync' requests 0
>>>
>>> Queued 'high' requests 0
>>>
>>> Queued 'low' requests 0
>>>
>>> Queued 'failed' requests 122
>>>
>>> Queued 'full resync' requests 28785
>>>
>>> Waiting 'failed' requests 4294
>>>
>>> Total number of known users 42512
>>>
>>>
>>>
>>> So, why didn't the replicator take advantage of the weekend to
>>> replicate the mailboxes while no user were using them?
>>>
>>> Arnaud
>>>
>>>
>>>
>>>
>>> On 25/04/2022 13:54, Sebastian Marske wrote:
>>>> Hi there,
>>>>
>>>> thanks for your insights and for diving deeper into this Paul!
>>>>
>>>> For me, the users ending up in 'Waiting for dsync to finish' all have
>>>> more than 256 Imap folders as well (ranging from 288 up to >5500;
>>>> as per
>>>> 'doveadm mailbox list -u <username> | wc -l'). For more details on my
>>>> setup please see my post from February [1].
>>>>
>>>> @Arnaud: What OS are you running on?
>>>>
>>>>
>>>> Best
>>>> Sebastian
>>>>
>>>>
>>>> [1] https://dovecot.org/pipermail/dovecot/2022-February/124168.html
>>>>
>>>>
>>>> On 4/24/22 19:36, Paul Kudla (SCOM.CA Internet Services Inc.) wrote:
>>>>>
>>>>> Question having similiar replication issues
>>>>>
>>>>> pls read everything below and advise the folder counts on the
>>>>> non-replicated users?
>>>>>
>>>>> i find the total number of folders / account seems to be a factor
>>>>> and
>>>>> NOT the size of the mail box
>>>>>
>>>>> ie i have customers with 40G of emails no problem over 40 or so
>>>>> folders
>>>>> and it works ok
>>>>>
>>>>> 300+ folders seems to be the issue
>>>>>
>>>>> i have been going through the replication code
>>>>>
>>>>> no errors being logged
>>>>>
>>>>> i am assuming that the replication --> dhclient --> other server is
>>>>> timing out or not reading the folder lists correctly (ie dies after X
>>>>> folders read)
>>>>>
>>>>> thus i am going through the code patching for log entries etc to find
>>>>> the issues.
>>>>>
>>>>> see
>>>>>
>>>>> [13:33:57] mail18.scom.ca [root:0] /usr/local/var/lib/dovecot
>>>>> # ll
>>>>> total 86
>>>>> drwxr-xr-x 2 root wheel uarch 4B Apr 24 11:11 .
>>>>> drwxr-xr-x 4 root wheel uarch 4B Mar 8 2021 ..
>>>>> -rw-r--r-- 1 root wheel uarch 73B Apr 24 11:11 instances
>>>>> -rw-r--r-- 1 root wheel uarch 160K Apr 24 13:33 replicator.db
>>>>>
>>>>> [13:33:58] mail18.scom.ca [root:0] /usr/local/var/lib/dovecot
>>>>> #
>>>>>
>>>>> replicator.db seems to get updated ok but never processed properly.
>>>>>
>>>>> # sync.users
>>>>> nick at elirpa.com high 00:09:41 463:47:01 - y
>>>>> keith at elirpa.com high 00:09:23 463:45:43 - y
>>>>> paul at scom.ca high 00:09:41 463:46:51 - y
>>>>> ed at scom.ca high 00:09:43 463:47:01 - y
>>>>> ed.hanna at dssmgmt.com high 00:09:42 463:46:58 - y
>>>>> paul at paulkudla.net high 00:09:44 463:47:03
>>>>> 580:35:07
>>>>> y
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> so ....
>>>>>
>>>>>
>>>>>
>>>>> two things :
>>>>>
>>>>> first to get the production stuff to work i had to write a script
>>>>> that
>>>>> whould find the bad sync's and the force a dsync between the servers
>>>>>
>>>>> i run this every five minutes or each server.
>>>>>
>>>>> in crontab
>>>>>
>>>>> */10 * * * * root /usr/bin/nohup
>>>>> /programs/common/sync.recover > /dev/null
>>>>>
>>>>>
>>>>> python script to sort things out
>>>>>
>>>>> # cat /programs/common/sync.recover
>>>>> #!/usr/local/bin/python3
>>>>>
>>>>> #Force sync between servers that are reporting bad?
>>>>>
>>>>> import os,sys,django,socket
>>>>> from optparse import OptionParser
>>>>>
>>>>>
>>>>> from lib import *
>>>>>
>>>>> #Sample Re-Index MB
>>>>> #doveadm -D force-resync -u paul at scom.ca -f INBOX*
>>>>>
>>>>>
>>>>>
>>>>> USAGE_TEXT = '''\
>>>>> usage: %%prog %s[options]
>>>>> '''
>>>>>
>>>>> parser = OptionParser(usage=USAGE_TEXT % '', version='0.4')
>>>>>
>>>>> parser.add_option("-m", "--send_to", dest="send_to", help="Send
>>>>> Email To")
>>>>> parser.add_option("-e", "--email", dest="email_box", help="Box to
>>>>> Index")
>>>>> parser.add_option("-d", "--detail",action='store_true',
>>>>> dest="detail",default =False, help="Detailed report")
>>>>> parser.add_option("-i", "--index",action='store_true',
>>>>> dest="index",default =False, help="Index")
>>>>>
>>>>> options, args = parser.parse_args()
>>>>>
>>>>> print (options.email_box)
>>>>> print (options.send_to)
>>>>> print (options.detail)
>>>>>
>>>>> #sys.exit()
>>>>>
>>>>>
>>>>>
>>>>> print ('Getting Current User Sync Status')
>>>>> command = commands("/usr/local/bin/doveadm replicator status '*'")
>>>>>
>>>>>
>>>>> #print command
>>>>>
>>>>> sync_user_status = command.output.split('\n')
>>>>>
>>>>> #print sync_user_status
>>>>>
>>>>> synced = []
>>>>>
>>>>> for n in range(1,len(sync_user_status)) :
>>>>> user = sync_user_status[n]
>>>>> print ('Processing User : %s' %user.split(' ')[0])
>>>>> if user.split(' ')[0] != options.email_box :
>>>>> if options.email_box != None :
>>>>> continue
>>>>>
>>>>> if options.index == True :
>>>>> command = '/usr/local/bin/doveadm -D force-resync
>>>>> -u %s
>>>>> -f INBOX*' %user.split(' ')[0]
>>>>> command = commands(command)
>>>>> command = command.output
>>>>>
>>>>> #print user
>>>>> for nn in range (len(user)-1,0,-1) :
>>>>> #print nn
>>>>> #print user[nn]
>>>>>
>>>>> if user[nn] == '-' :
>>>>> #print 'skipping ... %s' %user.split(' ')[0]
>>>>>
>>>>> break
>>>>>
>>>>>
>>>>>
>>>>> if user[nn] == 'y': #Found a Bad Mailbox
>>>>> print ('syncing ... %s' %user.split(' ')[0])
>>>>>
>>>>>
>>>>> if options.detail == True :
>>>>> command = '/usr/local/bin/doveadm -D
>>>>> sync -u %s -d -N -l 30 -U' %user.split(' ')[0]
>>>>> print (command)
>>>>> command = commands(command)
>>>>> command = command.output.split('\n')
>>>>> print (command)
>>>>> print ('Processed Mailbox for ...
>>>>> %s'
>>>>> %user.split(' ')[0] )
>>>>> synced.append('Processed Mailbox
>>>>> for ...
>>>>> %s' %user.split(' ')[0])
>>>>> for nnn in range(len(command)):
>>>>> synced.append(command[nnn] + '\n')
>>>>> break
>>>>>
>>>>>
>>>>> if options.detail == False :
>>>>> #command =
>>>>> '/usr/local/bin/doveadm -D
>>>>> sync -u %s -d -N -l 30 -U' %user.split(' ')[0]
>>>>> #print (command)
>>>>> #command = os.system(command)
>>>>> command = subprocess.Popen(
>>>>> ["/usr/local/bin/doveadm sync -u %s -d -N -l 30 -U" %user.split('
>>>>> ')[0]
>>>>> ], \
>>>>> shell = True, stdin=None,
>>>>> stdout=None,
>>>>> stderr=None, close_fds=True)
>>>>>
>>>>> print ( 'Processed Mailbox for
>>>>> ... %s'
>>>>> %user.split(' ')[0] )
>>>>> synced.append('Processed Mailbox
>>>>> for ...
>>>>> %s' %user.split(' ')[0])
>>>>> #sys.exit()
>>>>> break
>>>>>
>>>>> if len(synced) != 0 :
>>>>> #send email showing bad synced boxes ?
>>>>>
>>>>> if options.send_to != None :
>>>>> send_from = 'monitor at scom.ca'
>>>>> send_to = ['%s' %options.send_to]
>>>>> send_subject = 'Dovecot Bad Sync Report for : %s'
>>>>> %(socket.gethostname())
>>>>> send_text = '\n\n'
>>>>> for n in range (len(synced)) :
>>>>> send_text = send_text + synced[n] + '\n'
>>>>>
>>>>> send_files = []
>>>>> sendmail (send_from, send_to, send_subject,
>>>>> send_text,
>>>>> send_files)
>>>>>
>>>>>
>>>>>
>>>>> sys.exit()
>>>>>
>>>>> second :
>>>>>
>>>>> i posted this a month ago - no response
>>>>>
>>>>> please appreciate that i am trying to help ....
>>>>>
>>>>> after much testing i can now reporduce the replication issues at hand
>>>>>
>>>>> I am running on freebsd 12 & 13 stable (both test and production
>>>>> servers)
>>>>>
>>>>> sdram drives etc ...
>>>>>
>>>>> Basically replication works fine until reaching a folder quantity
>>>>> of ~
>>>>> 256 or more
>>>>>
>>>>> to reproduce using doveadm i created folders like
>>>>>
>>>>> INBOX/folder-0
>>>>> INBOX/folder-1
>>>>> INBOX/folder-2
>>>>> INBOX/folder-3
>>>>> and so forth ......
>>>>>
>>>>> I created 200 folders and they replicated ok on both servers
>>>>>
>>>>> I created another 200 (400 total) and the replicator got stuck and
>>>>> would
>>>>> not update the mbox on the alternate server anymore and is still
>>>>> updating 4 days later ?
>>>>>
>>>>> basically replicator goes so far and either hangs or more likely
>>>>> bails
>>>>> on an error that is not reported to the debug reporting ?
>>>>>
>>>>> however dsync will sync the two servers but only when run manually
>>>>> (ie
>>>>> all the folders will sync)
>>>>>
>>>>> I have two test servers avaliable if you need any kind of access -
>>>>> again
>>>>> here to help.
>>>>>
>>>>> [07:28:42] mail18.scom.ca [root:0] ~
>>>>> # sync.status
>>>>> Queued 'sync' requests 0
>>>>> Queued 'high' requests 6
>>>>> Queued 'low' requests 0
>>>>> Queued 'failed' requests 0
>>>>> Queued 'full resync' requests 0
>>>>> Waiting 'failed' requests 0
>>>>> Total number of known users 255
>>>>>
>>>>> username type status
>>>>> paul at scom.ca normal Waiting for dsync to
>>>>> finish
>>>>> keith at elirpa.com incremental Waiting for dsync to
>>>>> finish
>>>>> ed.hanna at dssmgmt.com incremental Waiting for dsync to
>>>>> finish
>>>>> ed at scom.ca incremental Waiting for dsync to
>>>>> finish
>>>>> nick at elirpa.com incremental Waiting for dsync to
>>>>> finish
>>>>> paul at paulkudla.net incremental Waiting for dsync to
>>>>> finish
>>>>>
>>>>>
>>>>> i have been going through the c code and it seems the replication
>>>>> gets
>>>>> requested ok
>>>>>
>>>>> replicator.db does get updated ok with the replicated request for the
>>>>> mbox in question.
>>>>>
>>>>> however i am still looking for the actual replicator function in the
>>>>> lib's that do the actual replication requests
>>>>>
>>>>> the number of folders & subfolders is defanately the issue - not the
>>>>> mbox pyhsical size as thought origionally.
>>>>>
>>>>> if someone can point me in the right direction, it seems either the
>>>>> replicator is not picking up on the number of folders to replicat
>>>>> properly or it has a hard set limit like 256 / 512 / 65535 etc and
>>>>> stops
>>>>> the replication request thereafter.
>>>>>
>>>>> I am mainly a machine code programmer from the 80's and have
>>>>> concentrated on python as of late, 'c' i am starting to go through
>>>>> just
>>>>> to give you a background on my talents.
>>>>>
>>>>> It took 2 months to finger this out.
>>>>>
>>>>> this issue also seems to be indirectly causing the duplicate messages
>>>>> supression not to work as well.
>>>>>
>>>>> python programming to reproduce issue (loops are for last run
>>>>> started @
>>>>> 200 - fyi) :
>>>>>
>>>>> # cat mbox.gen
>>>>> #!/usr/local/bin/python2
>>>>>
>>>>> import os,sys
>>>>>
>>>>> from lib import *
>>>>>
>>>>>
>>>>> user = 'paul at paulkudla.net'
>>>>>
>>>>> """
>>>>> for count in range (0,600) :
>>>>> box = 'INBOX/folder-%s' %count
>>>>> print count
>>>>> command = '/usr/local/bin/doveadm mailbox create -s -u %s
>>>>> %s'
>>>>> %(user,box)
>>>>> print command
>>>>> a = commands.getoutput(command)
>>>>> print a
>>>>> """
>>>>>
>>>>> for count in range (0,600) :
>>>>> box = 'INBOX/folder-0/sub-%' %count
>>>>> print count
>>>>> command = '/usr/local/bin/doveadm mailbox create -s -u %s
>>>>> %s'
>>>>> %(user,box)
>>>>> print command
>>>>> a = commands.getoutput(command)
>>>>> print a
>>>>>
>>>>>
>>>>>
>>>>> #sys.exit()
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Happy Sunday !!!
>>>>> Thanks - paul
>>>>>
>>>>> Paul Kudla
>>>>>
>>>>>
>>>>> Scom.ca Internet Services <http://www.scom.ca>
>>>>> 004-1009 Byron Street South
>>>>> Whitby, Ontario - Canada
>>>>> L1N 4S3
>>>>>
>>>>> Toronto 416.642.7266
>>>>> Main 1.866.411.7266
>>>>> Fax 1.888.892.7266
>>>>>
>>>>> On 4/24/2022 10:22 AM, Arnaud Abélard wrote:
>>>>>> Hello,
>>>>>>
>>>>>> I am working on replicating a server (and adding compression on the
>>>>>> other side) and since I had "Error: dsync I/O has stalled, no
>>>>>> activity
>>>>>> for 600 seconds (version not received)" errors I upgraded both
>>>>>> source
>>>>>> and destination server with the latest 2.3 version (2.3.18). While
>>>>>> before the upgrade all the 15 replication connections were busy
>>>>>> after
>>>>>> upgrading dovecot replicator dsync-status shows that most of the
>>>>>> time
>>>>>> nothing is being replicated at all. I can see some brief
>>>>>> replications
>>>>>> that last, but 99,9% of the time nothing is happening at all.
>>>>>>
>>>>>> I have a replication_full_sync_interval of 12 hours but I have
>>>>>> thousands of users with their last full sync over 90 hours ago.
>>>>>>
>>>>>> "doveadm replicator status" also shows that i have over 35,000
>>>>>> queued
>>>>>> full resync requests, but no sync, high or low queued requests so
>>>>>> why
>>>>>> aren't the full requests occuring?
>>>>>>
>>>>>> There are no errors in the logs.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Arnaud
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>
>>
More information about the dovecot
mailing list