no full syncs after upgrading to dovecot 2.3.18

Arnaud Abélard arnaud.abelard at univ-nantes.fr
Mon Apr 25 14:01:24 UTC 2022


Ah, I'm now getting errors in the logs, that would explains the 
increasing number of failed sync requests:

dovecot: imap(xxxxx)<2961235><Bs6w43rdQPAqAcsFiXEmAInUhhA3Rfqh>: Error: 
Mailbox INBOX: /vmail/l/i/xxxxx/dovecot.index reset, view is now 
inconsistent


And sure enough:

# dovecot replicator status xxxxx

xxxxx         none     00:02:54  07:11:28  -            y


What could explain that error?

Arnaud



On 25/04/2022 15:13, Arnaud Abélard wrote:
> Hello,
> 
> On my side we are running Linux (Debian Buster).
> 
> I'm not sure my problem is actually the same as Paul or you Sebastian 
> since I have a lot of boxes but those are actually small (quota of 
> 110MB) so I doubt any of them have more than a dozen imap folders.
> 
> The main symptom is that I have tons of full sync requests awaiting but 
> even though no other sync is pending the replicator just waits for 
> something to trigger those syncs.
> 
> Today, with users back I can see that normal and incremental syncs are 
> being done on the 15 connections, with an occasional full sync here or 
> there and lots of "Waiting 'failed' requests":
> 
> Queued 'sync' requests        0
> 
> Queued 'high' requests        0
> 
> Queued 'low' requests         0
> 
> Queued 'failed' requests      122
> 
> Queued 'full resync' requests 28785
> 
> Waiting 'failed' requests     4294
> 
> Total number of known users   42512
> 
> 
> 
> So, why didn't the replicator take advantage of the weekend to replicate 
> the mailboxes while no user were using them?
> 
> Arnaud
> 
> 
> 
> 
> On 25/04/2022 13:54, Sebastian Marske wrote:
>> Hi there,
>>
>> thanks for your insights and for diving deeper into this Paul!
>>
>> For me, the users ending up in 'Waiting for dsync to finish' all have
>> more than 256 Imap folders as well (ranging from 288 up to >5500; as per
>> 'doveadm mailbox list -u <username> | wc -l'). For more details on my
>> setup please see my post from February [1].
>>
>> @Arnaud: What OS are you running on?
>>
>>
>> Best
>> Sebastian
>>
>>
>> [1] https://dovecot.org/pipermail/dovecot/2022-February/124168.html
>>
>>
>> On 4/24/22 19:36, Paul Kudla (SCOM.CA Internet Services Inc.) wrote:
>>>
>>> Question having similiar replication issues
>>>
>>> pls read everything below and advise the folder counts on the
>>> non-replicated users?
>>>
>>> i find  the total number of folders / account seems to be a factor and
>>> NOT the size of the mail box
>>>
>>> ie i have customers with 40G of emails no problem over 40 or so folders
>>> and it works ok
>>>
>>> 300+ folders seems to be the issue
>>>
>>> i have been going through the replication code
>>>
>>> no errors being logged
>>>
>>> i am assuming that the replication --> dhclient --> other server is
>>> timing out or not reading the folder lists correctly (ie dies after X
>>> folders read)
>>>
>>> thus i am going through the code patching for log entries etc to find
>>> the issues.
>>>
>>> see
>>>
>>> [13:33:57] mail18.scom.ca [root:0] /usr/local/var/lib/dovecot
>>> # ll
>>> total 86
>>> drwxr-xr-x  2 root  wheel  uarch    4B Apr 24 11:11 .
>>> drwxr-xr-x  4 root  wheel  uarch    4B Mar  8  2021 ..
>>> -rw-r--r--  1 root  wheel  uarch   73B Apr 24 11:11 instances
>>> -rw-r--r--  1 root  wheel  uarch  160K Apr 24 13:33 replicator.db
>>>
>>> [13:33:58] mail18.scom.ca [root:0] /usr/local/var/lib/dovecot
>>> #
>>>
>>> replicator.db seems to get updated ok but never processed properly.
>>>
>>> # sync.users
>>> nick at elirpa.com                   high     00:09:41  463:47:01 -     y
>>> keith at elirpa.com                  high     00:09:23  463:45:43 -     y
>>> paul at scom.ca                      high     00:09:41  463:46:51 -     y
>>> ed at scom.ca                        high     00:09:43  463:47:01 -     y
>>> ed.hanna at dssmgmt.com              high     00:09:42  463:46:58 -     y
>>> paul at paulkudla.net                high     00:09:44  463:47:03 580:35:07
>>>     y
>>>
>>>
>>>
>>>
>>> so ....
>>>
>>>
>>>
>>> two things :
>>>
>>> first to get the production stuff to work i had to write a script that
>>> whould find the bad sync's and the force a dsync between the servers
>>>
>>> i run this every five minutes or each server.
>>>
>>> in crontab
>>>
>>> */10    *                *    *    *    root            /usr/bin/nohup
>>> /programs/common/sync.recover > /dev/null
>>>
>>>
>>> python script to sort things out
>>>
>>> # cat /programs/common/sync.recover
>>> #!/usr/local/bin/python3
>>>
>>> #Force sync between servers that are reporting bad?
>>>
>>> import os,sys,django,socket
>>> from optparse import OptionParser
>>>
>>>
>>> from lib import *
>>>
>>> #Sample Re-Index MB
>>> #doveadm -D force-resync -u paul at scom.ca -f INBOX*
>>>
>>>
>>>
>>> USAGE_TEXT = '''\
>>> usage: %%prog %s[options]
>>> '''
>>>
>>> parser = OptionParser(usage=USAGE_TEXT % '', version='0.4')
>>>
>>> parser.add_option("-m", "--send_to", dest="send_to", help="Send Email 
>>> To")
>>> parser.add_option("-e", "--email", dest="email_box", help="Box to 
>>> Index")
>>> parser.add_option("-d", "--detail",action='store_true',
>>> dest="detail",default =False, help="Detailed report")
>>> parser.add_option("-i", "--index",action='store_true',
>>> dest="index",default =False, help="Index")
>>>
>>> options, args = parser.parse_args()
>>>
>>> print (options.email_box)
>>> print (options.send_to)
>>> print (options.detail)
>>>
>>> #sys.exit()
>>>
>>>
>>>
>>> print ('Getting Current User Sync Status')
>>> command = commands("/usr/local/bin/doveadm replicator status '*'")
>>>
>>>
>>> #print command
>>>
>>> sync_user_status = command.output.split('\n')
>>>
>>> #print sync_user_status
>>>
>>> synced = []
>>>
>>> for n in range(1,len(sync_user_status)) :
>>>          user = sync_user_status[n]
>>>          print ('Processing User : %s' %user.split(' ')[0])
>>>          if user.split(' ')[0] != options.email_box :
>>>                  if options.email_box != None :
>>>                          continue
>>>
>>>          if options.index == True :
>>>                  command = '/usr/local/bin/doveadm -D force-resync -u %s
>>> -f INBOX*' %user.split(' ')[0]
>>>                  command = commands(command)
>>>                  command = command.output
>>>
>>>          #print user
>>>          for nn in range (len(user)-1,0,-1) :
>>>                  #print nn
>>>                  #print user[nn]
>>>
>>>                  if user[nn] == '-' :
>>>                          #print 'skipping ... %s' %user.split(' ')[0]
>>>
>>>                          break
>>>
>>>
>>>
>>>                  if user[nn] == 'y': #Found a Bad Mailbox
>>>                          print ('syncing ... %s' %user.split(' ')[0])
>>>
>>>
>>>                          if options.detail == True :
>>>                                  command = '/usr/local/bin/doveadm -D
>>> sync -u %s -d -N -l 30 -U' %user.split(' ')[0]
>>>                                  print (command)
>>>                                  command = commands(command)
>>>                                  command = command.output.split('\n')
>>>                                  print (command)
>>>                                  print ('Processed Mailbox for ... %s'
>>> %user.split(' ')[0] )
>>>                                  synced.append('Processed Mailbox for 
>>> ...
>>> %s' %user.split(' ')[0])
>>>                                  for nnn in range(len(command)):
>>>                                          synced.append(command[nnn] + 
>>> '\n')
>>>                                  break
>>>
>>>
>>>                          if options.detail == False :
>>>                                  #command = '/usr/local/bin/doveadm -D
>>> sync -u %s -d -N -l 30 -U' %user.split(' ')[0]
>>>                                  #print (command)
>>>                                  #command = os.system(command)
>>>                                  command = subprocess.Popen(
>>> ["/usr/local/bin/doveadm sync -u %s -d -N -l 30 -U" %user.split(' ')[0]
>>> ], \
>>>                                  shell = True, stdin=None, stdout=None,
>>> stderr=None, close_fds=True)
>>>
>>>                                  print ( 'Processed Mailbox for ... %s'
>>> %user.split(' ')[0] )
>>>                                  synced.append('Processed Mailbox for 
>>> ...
>>> %s' %user.split(' ')[0])
>>>                                  #sys.exit()
>>>                                  break
>>>
>>> if len(synced) != 0 :
>>>          #send email showing bad synced boxes ?
>>>
>>>          if options.send_to != None :
>>>                  send_from = 'monitor at scom.ca'
>>>                  send_to = ['%s' %options.send_to]
>>>                  send_subject = 'Dovecot Bad Sync Report for : %s'
>>> %(socket.gethostname())
>>>                  send_text = '\n\n'
>>>                  for n in range (len(synced)) :
>>>                          send_text = send_text + synced[n] + '\n'
>>>
>>>                  send_files = []
>>>                  sendmail (send_from, send_to, send_subject, send_text,
>>> send_files)
>>>
>>>
>>>
>>> sys.exit()
>>>
>>> second :
>>>
>>> i posted this a month ago - no response
>>>
>>> please appreciate that i am trying to help ....
>>>
>>> after much testing i can now reporduce the replication issues at hand
>>>
>>> I am running on freebsd 12 & 13 stable (both test and production 
>>> servers)
>>>
>>> sdram drives etc ...
>>>
>>> Basically replication works fine until reaching a folder quantity of ~
>>> 256 or more
>>>
>>> to reproduce using doveadm i created folders like
>>>
>>> INBOX/folder-0
>>> INBOX/folder-1
>>> INBOX/folder-2
>>> INBOX/folder-3
>>> and so forth ......
>>>
>>> I created 200 folders and they replicated ok on both servers
>>>
>>> I created another 200 (400 total) and the replicator got stuck and would
>>> not update the mbox on the alternate server anymore and is still
>>> updating 4 days later ?
>>>
>>> basically replicator goes so far and either hangs or more likely bails
>>> on an error that is not reported to the debug reporting ?
>>>
>>> however dsync will sync the two servers but only when run manually (ie
>>> all the folders will sync)
>>>
>>> I have two test servers avaliable if you need any kind of access - again
>>> here to help.
>>>
>>> [07:28:42] mail18.scom.ca [root:0] ~
>>> # sync.status
>>> Queued 'sync' requests        0
>>> Queued 'high' requests        6
>>> Queued 'low' requests         0
>>> Queued 'failed' requests      0
>>> Queued 'full resync' requests 0
>>> Waiting 'failed' requests     0
>>> Total number of known users   255
>>>
>>> username                       type        status
>>> paul at scom.ca                   normal      Waiting for dsync to finish
>>> keith at elirpa.com               incremental Waiting for dsync to finish
>>> ed.hanna at dssmgmt.com           incremental Waiting for dsync to finish
>>> ed at scom.ca                     incremental Waiting for dsync to finish
>>> nick at elirpa.com                incremental Waiting for dsync to finish
>>> paul at paulkudla.net             incremental Waiting for dsync to finish
>>>
>>>
>>> i have been going through the c code and it seems the replication gets
>>> requested ok
>>>
>>> replicator.db does get updated ok with the replicated request for the
>>> mbox in question.
>>>
>>> however i am still looking for the actual replicator function in the
>>> lib's that do the actual replication requests
>>>
>>> the number of folders & subfolders is defanately the issue - not the
>>> mbox pyhsical size as thought origionally.
>>>
>>> if someone can point me in the right direction, it seems either the
>>> replicator is not picking up on the number of folders to replicat
>>> properly or it has a hard set limit like 256 / 512 / 65535 etc and stops
>>> the replication request thereafter.
>>>
>>> I am mainly a machine code programmer from the 80's and have
>>> concentrated on python as of late, 'c' i am starting to go through just
>>> to give you a background on my talents.
>>>
>>> It took 2 months to finger this out.
>>>
>>> this issue also seems to be indirectly causing the duplicate messages
>>> supression not to work as well.
>>>
>>> python programming to reproduce issue (loops are for last run started @
>>> 200 - fyi) :
>>>
>>> # cat mbox.gen
>>> #!/usr/local/bin/python2
>>>
>>> import os,sys
>>>
>>> from lib import *
>>>
>>>
>>> user = 'paul at paulkudla.net'
>>>
>>> """
>>> for count in range (0,600) :
>>>          box = 'INBOX/folder-%s' %count
>>>          print count
>>>          command = '/usr/local/bin/doveadm mailbox create -s -u %s %s'
>>> %(user,box)
>>>          print command
>>>          a = commands.getoutput(command)
>>>          print a
>>> """
>>>
>>> for count in range (0,600) :
>>>          box = 'INBOX/folder-0/sub-%' %count
>>>          print count
>>>          command = '/usr/local/bin/doveadm mailbox create -s -u %s %s'
>>> %(user,box)
>>>          print command
>>>          a = commands.getoutput(command)
>>>          print a
>>>
>>>
>>>
>>>          #sys.exit()
>>>
>>>
>>>
>>>
>>>
>>> Happy Sunday !!!
>>> Thanks - paul
>>>
>>> Paul Kudla
>>>
>>>
>>> Scom.ca Internet Services <http://www.scom.ca>
>>> 004-1009 Byron Street South
>>> Whitby, Ontario - Canada
>>> L1N 4S3
>>>
>>> Toronto 416.642.7266
>>> Main 1.866.411.7266
>>> Fax 1.888.892.7266
>>>
>>> On 4/24/2022 10:22 AM, Arnaud Abélard wrote:
>>>> Hello,
>>>>
>>>> I am working on replicating a server (and adding compression on the
>>>> other side) and since I had "Error: dsync I/O has stalled, no activity
>>>> for 600 seconds (version not received)" errors I upgraded both source
>>>> and destination server with the latest 2.3 version (2.3.18). While
>>>> before the upgrade all the 15 replication connections were busy after
>>>> upgrading dovecot replicator dsync-status shows that most of the time
>>>> nothing is being replicated at all. I can see some brief replications
>>>> that last, but 99,9% of the time nothing is happening at all.
>>>>
>>>> I have a replication_full_sync_interval of 12 hours but I have
>>>> thousands of users with their last full sync over 90 hours ago.
>>>>
>>>> "doveadm replicator status" also shows that i have over 35,000 queued
>>>> full resync requests, but no sync, high or low queued requests so why
>>>> aren't the full requests occuring?
>>>>
>>>> There are no errors in the logs.
>>>>
>>>> Thanks,
>>>>
>>>> Arnaud
>>>>
>>>>
>>>>
>>>>
>>>>
> 

-- 
Arnaud Abélard
Responsable pôle Système et Stockage
Service Infrastructures
DSIN Université de Nantes
-

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4186 bytes
Desc: S/MIME Cryptographic Signature
URL: <https://dovecot.org/pipermail/dovecot/attachments/20220425/e50a508b/attachment-0001.bin>


More information about the dovecot mailing list