no full syncs after upgrading to dovecot 2.3.18
Arnaud Abélard
arnaud.abelard at univ-nantes.fr
Mon Apr 25 14:01:24 UTC 2022
Ah, I'm now getting errors in the logs, that would explains the
increasing number of failed sync requests:
dovecot: imap(xxxxx)<2961235><Bs6w43rdQPAqAcsFiXEmAInUhhA3Rfqh>: Error:
Mailbox INBOX: /vmail/l/i/xxxxx/dovecot.index reset, view is now
inconsistent
And sure enough:
# dovecot replicator status xxxxx
xxxxx none 00:02:54 07:11:28 - y
What could explain that error?
Arnaud
On 25/04/2022 15:13, Arnaud Abélard wrote:
> Hello,
>
> On my side we are running Linux (Debian Buster).
>
> I'm not sure my problem is actually the same as Paul or you Sebastian
> since I have a lot of boxes but those are actually small (quota of
> 110MB) so I doubt any of them have more than a dozen imap folders.
>
> The main symptom is that I have tons of full sync requests awaiting but
> even though no other sync is pending the replicator just waits for
> something to trigger those syncs.
>
> Today, with users back I can see that normal and incremental syncs are
> being done on the 15 connections, with an occasional full sync here or
> there and lots of "Waiting 'failed' requests":
>
> Queued 'sync' requests 0
>
> Queued 'high' requests 0
>
> Queued 'low' requests 0
>
> Queued 'failed' requests 122
>
> Queued 'full resync' requests 28785
>
> Waiting 'failed' requests 4294
>
> Total number of known users 42512
>
>
>
> So, why didn't the replicator take advantage of the weekend to replicate
> the mailboxes while no user were using them?
>
> Arnaud
>
>
>
>
> On 25/04/2022 13:54, Sebastian Marske wrote:
>> Hi there,
>>
>> thanks for your insights and for diving deeper into this Paul!
>>
>> For me, the users ending up in 'Waiting for dsync to finish' all have
>> more than 256 Imap folders as well (ranging from 288 up to >5500; as per
>> 'doveadm mailbox list -u <username> | wc -l'). For more details on my
>> setup please see my post from February [1].
>>
>> @Arnaud: What OS are you running on?
>>
>>
>> Best
>> Sebastian
>>
>>
>> [1] https://dovecot.org/pipermail/dovecot/2022-February/124168.html
>>
>>
>> On 4/24/22 19:36, Paul Kudla (SCOM.CA Internet Services Inc.) wrote:
>>>
>>> Question having similiar replication issues
>>>
>>> pls read everything below and advise the folder counts on the
>>> non-replicated users?
>>>
>>> i find the total number of folders / account seems to be a factor and
>>> NOT the size of the mail box
>>>
>>> ie i have customers with 40G of emails no problem over 40 or so folders
>>> and it works ok
>>>
>>> 300+ folders seems to be the issue
>>>
>>> i have been going through the replication code
>>>
>>> no errors being logged
>>>
>>> i am assuming that the replication --> dhclient --> other server is
>>> timing out or not reading the folder lists correctly (ie dies after X
>>> folders read)
>>>
>>> thus i am going through the code patching for log entries etc to find
>>> the issues.
>>>
>>> see
>>>
>>> [13:33:57] mail18.scom.ca [root:0] /usr/local/var/lib/dovecot
>>> # ll
>>> total 86
>>> drwxr-xr-x 2 root wheel uarch 4B Apr 24 11:11 .
>>> drwxr-xr-x 4 root wheel uarch 4B Mar 8 2021 ..
>>> -rw-r--r-- 1 root wheel uarch 73B Apr 24 11:11 instances
>>> -rw-r--r-- 1 root wheel uarch 160K Apr 24 13:33 replicator.db
>>>
>>> [13:33:58] mail18.scom.ca [root:0] /usr/local/var/lib/dovecot
>>> #
>>>
>>> replicator.db seems to get updated ok but never processed properly.
>>>
>>> # sync.users
>>> nick at elirpa.com high 00:09:41 463:47:01 - y
>>> keith at elirpa.com high 00:09:23 463:45:43 - y
>>> paul at scom.ca high 00:09:41 463:46:51 - y
>>> ed at scom.ca high 00:09:43 463:47:01 - y
>>> ed.hanna at dssmgmt.com high 00:09:42 463:46:58 - y
>>> paul at paulkudla.net high 00:09:44 463:47:03 580:35:07
>>> y
>>>
>>>
>>>
>>>
>>> so ....
>>>
>>>
>>>
>>> two things :
>>>
>>> first to get the production stuff to work i had to write a script that
>>> whould find the bad sync's and the force a dsync between the servers
>>>
>>> i run this every five minutes or each server.
>>>
>>> in crontab
>>>
>>> */10 * * * * root /usr/bin/nohup
>>> /programs/common/sync.recover > /dev/null
>>>
>>>
>>> python script to sort things out
>>>
>>> # cat /programs/common/sync.recover
>>> #!/usr/local/bin/python3
>>>
>>> #Force sync between servers that are reporting bad?
>>>
>>> import os,sys,django,socket
>>> from optparse import OptionParser
>>>
>>>
>>> from lib import *
>>>
>>> #Sample Re-Index MB
>>> #doveadm -D force-resync -u paul at scom.ca -f INBOX*
>>>
>>>
>>>
>>> USAGE_TEXT = '''\
>>> usage: %%prog %s[options]
>>> '''
>>>
>>> parser = OptionParser(usage=USAGE_TEXT % '', version='0.4')
>>>
>>> parser.add_option("-m", "--send_to", dest="send_to", help="Send Email
>>> To")
>>> parser.add_option("-e", "--email", dest="email_box", help="Box to
>>> Index")
>>> parser.add_option("-d", "--detail",action='store_true',
>>> dest="detail",default =False, help="Detailed report")
>>> parser.add_option("-i", "--index",action='store_true',
>>> dest="index",default =False, help="Index")
>>>
>>> options, args = parser.parse_args()
>>>
>>> print (options.email_box)
>>> print (options.send_to)
>>> print (options.detail)
>>>
>>> #sys.exit()
>>>
>>>
>>>
>>> print ('Getting Current User Sync Status')
>>> command = commands("/usr/local/bin/doveadm replicator status '*'")
>>>
>>>
>>> #print command
>>>
>>> sync_user_status = command.output.split('\n')
>>>
>>> #print sync_user_status
>>>
>>> synced = []
>>>
>>> for n in range(1,len(sync_user_status)) :
>>> user = sync_user_status[n]
>>> print ('Processing User : %s' %user.split(' ')[0])
>>> if user.split(' ')[0] != options.email_box :
>>> if options.email_box != None :
>>> continue
>>>
>>> if options.index == True :
>>> command = '/usr/local/bin/doveadm -D force-resync -u %s
>>> -f INBOX*' %user.split(' ')[0]
>>> command = commands(command)
>>> command = command.output
>>>
>>> #print user
>>> for nn in range (len(user)-1,0,-1) :
>>> #print nn
>>> #print user[nn]
>>>
>>> if user[nn] == '-' :
>>> #print 'skipping ... %s' %user.split(' ')[0]
>>>
>>> break
>>>
>>>
>>>
>>> if user[nn] == 'y': #Found a Bad Mailbox
>>> print ('syncing ... %s' %user.split(' ')[0])
>>>
>>>
>>> if options.detail == True :
>>> command = '/usr/local/bin/doveadm -D
>>> sync -u %s -d -N -l 30 -U' %user.split(' ')[0]
>>> print (command)
>>> command = commands(command)
>>> command = command.output.split('\n')
>>> print (command)
>>> print ('Processed Mailbox for ... %s'
>>> %user.split(' ')[0] )
>>> synced.append('Processed Mailbox for
>>> ...
>>> %s' %user.split(' ')[0])
>>> for nnn in range(len(command)):
>>> synced.append(command[nnn] +
>>> '\n')
>>> break
>>>
>>>
>>> if options.detail == False :
>>> #command = '/usr/local/bin/doveadm -D
>>> sync -u %s -d -N -l 30 -U' %user.split(' ')[0]
>>> #print (command)
>>> #command = os.system(command)
>>> command = subprocess.Popen(
>>> ["/usr/local/bin/doveadm sync -u %s -d -N -l 30 -U" %user.split(' ')[0]
>>> ], \
>>> shell = True, stdin=None, stdout=None,
>>> stderr=None, close_fds=True)
>>>
>>> print ( 'Processed Mailbox for ... %s'
>>> %user.split(' ')[0] )
>>> synced.append('Processed Mailbox for
>>> ...
>>> %s' %user.split(' ')[0])
>>> #sys.exit()
>>> break
>>>
>>> if len(synced) != 0 :
>>> #send email showing bad synced boxes ?
>>>
>>> if options.send_to != None :
>>> send_from = 'monitor at scom.ca'
>>> send_to = ['%s' %options.send_to]
>>> send_subject = 'Dovecot Bad Sync Report for : %s'
>>> %(socket.gethostname())
>>> send_text = '\n\n'
>>> for n in range (len(synced)) :
>>> send_text = send_text + synced[n] + '\n'
>>>
>>> send_files = []
>>> sendmail (send_from, send_to, send_subject, send_text,
>>> send_files)
>>>
>>>
>>>
>>> sys.exit()
>>>
>>> second :
>>>
>>> i posted this a month ago - no response
>>>
>>> please appreciate that i am trying to help ....
>>>
>>> after much testing i can now reporduce the replication issues at hand
>>>
>>> I am running on freebsd 12 & 13 stable (both test and production
>>> servers)
>>>
>>> sdram drives etc ...
>>>
>>> Basically replication works fine until reaching a folder quantity of ~
>>> 256 or more
>>>
>>> to reproduce using doveadm i created folders like
>>>
>>> INBOX/folder-0
>>> INBOX/folder-1
>>> INBOX/folder-2
>>> INBOX/folder-3
>>> and so forth ......
>>>
>>> I created 200 folders and they replicated ok on both servers
>>>
>>> I created another 200 (400 total) and the replicator got stuck and would
>>> not update the mbox on the alternate server anymore and is still
>>> updating 4 days later ?
>>>
>>> basically replicator goes so far and either hangs or more likely bails
>>> on an error that is not reported to the debug reporting ?
>>>
>>> however dsync will sync the two servers but only when run manually (ie
>>> all the folders will sync)
>>>
>>> I have two test servers avaliable if you need any kind of access - again
>>> here to help.
>>>
>>> [07:28:42] mail18.scom.ca [root:0] ~
>>> # sync.status
>>> Queued 'sync' requests 0
>>> Queued 'high' requests 6
>>> Queued 'low' requests 0
>>> Queued 'failed' requests 0
>>> Queued 'full resync' requests 0
>>> Waiting 'failed' requests 0
>>> Total number of known users 255
>>>
>>> username type status
>>> paul at scom.ca normal Waiting for dsync to finish
>>> keith at elirpa.com incremental Waiting for dsync to finish
>>> ed.hanna at dssmgmt.com incremental Waiting for dsync to finish
>>> ed at scom.ca incremental Waiting for dsync to finish
>>> nick at elirpa.com incremental Waiting for dsync to finish
>>> paul at paulkudla.net incremental Waiting for dsync to finish
>>>
>>>
>>> i have been going through the c code and it seems the replication gets
>>> requested ok
>>>
>>> replicator.db does get updated ok with the replicated request for the
>>> mbox in question.
>>>
>>> however i am still looking for the actual replicator function in the
>>> lib's that do the actual replication requests
>>>
>>> the number of folders & subfolders is defanately the issue - not the
>>> mbox pyhsical size as thought origionally.
>>>
>>> if someone can point me in the right direction, it seems either the
>>> replicator is not picking up on the number of folders to replicat
>>> properly or it has a hard set limit like 256 / 512 / 65535 etc and stops
>>> the replication request thereafter.
>>>
>>> I am mainly a machine code programmer from the 80's and have
>>> concentrated on python as of late, 'c' i am starting to go through just
>>> to give you a background on my talents.
>>>
>>> It took 2 months to finger this out.
>>>
>>> this issue also seems to be indirectly causing the duplicate messages
>>> supression not to work as well.
>>>
>>> python programming to reproduce issue (loops are for last run started @
>>> 200 - fyi) :
>>>
>>> # cat mbox.gen
>>> #!/usr/local/bin/python2
>>>
>>> import os,sys
>>>
>>> from lib import *
>>>
>>>
>>> user = 'paul at paulkudla.net'
>>>
>>> """
>>> for count in range (0,600) :
>>> box = 'INBOX/folder-%s' %count
>>> print count
>>> command = '/usr/local/bin/doveadm mailbox create -s -u %s %s'
>>> %(user,box)
>>> print command
>>> a = commands.getoutput(command)
>>> print a
>>> """
>>>
>>> for count in range (0,600) :
>>> box = 'INBOX/folder-0/sub-%' %count
>>> print count
>>> command = '/usr/local/bin/doveadm mailbox create -s -u %s %s'
>>> %(user,box)
>>> print command
>>> a = commands.getoutput(command)
>>> print a
>>>
>>>
>>>
>>> #sys.exit()
>>>
>>>
>>>
>>>
>>>
>>> Happy Sunday !!!
>>> Thanks - paul
>>>
>>> Paul Kudla
>>>
>>>
>>> Scom.ca Internet Services <http://www.scom.ca>
>>> 004-1009 Byron Street South
>>> Whitby, Ontario - Canada
>>> L1N 4S3
>>>
>>> Toronto 416.642.7266
>>> Main 1.866.411.7266
>>> Fax 1.888.892.7266
>>>
>>> On 4/24/2022 10:22 AM, Arnaud Abélard wrote:
>>>> Hello,
>>>>
>>>> I am working on replicating a server (and adding compression on the
>>>> other side) and since I had "Error: dsync I/O has stalled, no activity
>>>> for 600 seconds (version not received)" errors I upgraded both source
>>>> and destination server with the latest 2.3 version (2.3.18). While
>>>> before the upgrade all the 15 replication connections were busy after
>>>> upgrading dovecot replicator dsync-status shows that most of the time
>>>> nothing is being replicated at all. I can see some brief replications
>>>> that last, but 99,9% of the time nothing is happening at all.
>>>>
>>>> I have a replication_full_sync_interval of 12 hours but I have
>>>> thousands of users with their last full sync over 90 hours ago.
>>>>
>>>> "doveadm replicator status" also shows that i have over 35,000 queued
>>>> full resync requests, but no sync, high or low queued requests so why
>>>> aren't the full requests occuring?
>>>>
>>>> There are no errors in the logs.
>>>>
>>>> Thanks,
>>>>
>>>> Arnaud
>>>>
>>>>
>>>>
>>>>
>>>>
>
--
Arnaud Abélard
Responsable pôle Système et Stockage
Service Infrastructures
DSIN Université de Nantes
-
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4186 bytes
Desc: S/MIME Cryptographic Signature
URL: <https://dovecot.org/pipermail/dovecot/attachments/20220425/e50a508b/attachment-0001.bin>
More information about the dovecot
mailing list