no full syncs after upgrading to dovecot 2.3.18

Arnaud Abélard

24 Apr 2022 24 Apr '22

5:22 p.m.

Hello,

I am working on replicating a server (and adding compression on the other side) and since I had "Error: dsync I/O has stalled, no activity for 600 seconds (version not received)" errors I upgraded both source and destination server with the latest 2.3 version (2.3.18). While before the upgrade all the 15 replication connections were busy after upgrading dovecot replicator dsync-status shows that most of the time nothing is being replicated at all. I can see some brief replications that last, but 99,9% of the time nothing is happening at all.

I have a replication_full_sync_interval of 12 hours but I have thousands of users with their last full sync over 90 hours ago.

"doveadm replicator status" also shows that i have over 35,000 queued full resync requests, but no sync, high or low queued requests so why aren't the full requests occuring?

There are no errors in the logs.

Thanks,

Arnaud

-- Arnaud Abélard Responsable pôle Système et Stockage Service Infrastructures DSIN Université de Nantes

Attachments:

smime.p7s (application/pkcs7-signature — 4.1 KB)

Show replies by date

Paul Kudla (SCOM.CA Internet Services Inc.)

24 Apr 24 Apr

8:36 p.m.

Question having similiar replication issues

pls read everything below and advise the folder counts on the non-replicated users?

i find the total number of folders / account seems to be a factor and NOT the size of the mail box

ie i have customers with 40G of emails no problem over 40 or so folders and it works ok

300+ folders seems to be the issue

i have been going through the replication code

no errors being logged

i am assuming that the replication --> dhclient --> other server is timing out or not reading the folder lists correctly (ie dies after X folders read)

thus i am going through the code patching for log entries etc to find the issues.

see

[13:33:57] mail18.scom.ca [root:0] /usr/local/var/lib/dovecot

ll

total 86 drwxr-xr-x 2 root wheel uarch 4B Apr 24 11:11 . drwxr-xr-x 4 root wheel uarch 4B Mar 8 2021 .. -rw-r--r-- 1 root wheel uarch 73B Apr 24 11:11 instances -rw-r--r-- 1 root wheel uarch 160K Apr 24 13:33 replicator.db

[13:33:58] mail18.scom.ca [root:0] /usr/local/var/lib/dovecot

replicator.db seems to get updated ok but never processed properly.

sync.users

nick@elirpa.com high 00:09:41 463:47:01 - y keith@elirpa.com high 00:09:23 463:45:43 - y paul@scom.ca high 00:09:41 463:46:51 - y ed@scom.ca high 00:09:43 463:47:01 - y ed.hanna@dssmgmt.com high 00:09:42 463:46:58 - y paul@paulkudla.net high 00:09:44 463:47:03 580:35:07 y

so ....

two things :

first to get the production stuff to work i had to write a script that whould find the bad sync's and the force a dsync between the servers

i run this every five minutes or each server.

in crontab

*/10 * * * * root /usr/bin/nohup /programs/common/sync.recover > /dev/null

python script to sort things out

cat /programs/common/sync.recover

#!/usr/local/bin/python3

#Force sync between servers that are reporting bad?

import os,sys,django,socket from optparse import OptionParser

from lib import *

#Sample Re-Index MB #doveadm -D force-resync -u paul@scom.ca -f INBOX*

USAGE_TEXT = '''
usage: %%prog %s[options] '''

parser = OptionParser(usage=USAGE_TEXT % '', version='0.4')

parser.add_option("-m", "--send_to", dest="send_to", help="Send Email To") parser.add_option("-e", "--email", dest="email_box", help="Box to Index") parser.add_option("-d", "--detail",action='store_true', dest="detail",default =False, help="Detailed report") parser.add_option("-i", "--index",action='store_true', dest="index",default =False, help="Index")

options, args = parser.parse_args()

print (options.email_box) print (options.send_to) print (options.detail)

#sys.exit()

print ('Getting Current User Sync Status') command = commands("/usr/local/bin/doveadm replicator status '*'")

#print command

sync_user_status = command.output.split('\n')

#print sync_user_status

synced = []

for n in range(1,len(sync_user_status)) : user = sync_user_status[n] print ('Processing User : %s' %user.split(' ')[0]) if user.split(' ')[0] != options.email_box : if options.email_box != None : continue

     if options.index == True :
             command = '/usr/local/bin/doveadm -D force-resync -u %s

-f INBOX*' %user.split(' ')[0] command = commands(command) command = command.output

     #print user
     for nn in range (len(user)-1,0,-1) :
             #print nn
             #print user[nn]

             if user[nn] == '-' :
                     #print 'skipping ... %s' %user.split(' ')[0]

                     break



             if user[nn] == 'y': #Found a Bad Mailbox
                     print ('syncing ... %s' %user.split(' ')[0])


                     if options.detail == True :
                             command = '/usr/local/bin/doveadm -D

sync -u %s -d -N -l 30 -U' %user.split(' ')[0] print (command) command = commands(command) command = command.output.split('\n') print (command) print ('Processed Mailbox for ... %s' %user.split(' ')[0] ) synced.append('Processed Mailbox for ... %s' %user.split(' ')[0]) for nnn in range(len(command)): synced.append(command[nnn] + '\n') break

                     if options.detail == False :
                             #command = '/usr/local/bin/doveadm -D

sync -u %s -d -N -l 30 -U' %user.split(' ')[0] #print (command) #command = os.system(command) command = subprocess.Popen( ["/usr/local/bin/doveadm sync -u %s -d -N -l 30 -U" %user.split(' ')[0] ],
shell = True, stdin=None, stdout=None, stderr=None, close_fds=True)

                             print ( 'Processed Mailbox for ... %s'

%user.split(' ')[0] ) synced.append('Processed Mailbox for ... %s' %user.split(' ')[0]) #sys.exit() break

if len(synced) != 0 : #send email showing bad synced boxes ?

     if options.send_to != None :
             send_from = 'monitor@scom.ca'
             send_to = ['%s' %options.send_to]
             send_subject = 'Dovecot Bad Sync Report for : %s'

%(socket.gethostname()) send_text = '\n\n' for n in range (len(synced)) : send_text = send_text + synced[n] + '\n'

             send_files = []
             sendmail (send_from, send_to, send_subject, send_text,

send_files)

sys.exit()

second :

i posted this a month ago - no response

please appreciate that i am trying to help ....

after much testing i can now reporduce the replication issues at hand

I am running on freebsd 12 & 13 stable (both test and production servers)

sdram drives etc ...

Basically replication works fine until reaching a folder quantity of ~ 256 or more

to reproduce using doveadm i created folders like

INBOX/folder-0 INBOX/folder-1 INBOX/folder-2 INBOX/folder-3 and so forth ......

I created 200 folders and they replicated ok on both servers

I created another 200 (400 total) and the replicator got stuck and would not update the mbox on the alternate server anymore and is still updating 4 days later ?

basically replicator goes so far and either hangs or more likely bails on an error that is not reported to the debug reporting ?

however dsync will sync the two servers but only when run manually (ie all the folders will sync)

I have two test servers avaliable if you need any kind of access - again here to help.

[07:28:42] mail18.scom.ca [root:0] ~

sync.status

Queued 'sync' requests 0

Queued 'high' requests 6

Queued 'low' requests 0

Queued 'failed' requests 0

Queued 'full resync' requests 0

Waiting 'failed' requests 0

Total number of known users 255

username type status paul@scom.ca normal Waiting for dsync to finish keith@elirpa.com incremental Waiting for dsync to finish ed.hanna@dssmgmt.com incremental Waiting for dsync to finish ed@scom.ca incremental Waiting for dsync to finish nick@elirpa.com incremental Waiting for dsync to finish paul@paulkudla.net incremental Waiting for dsync to finish

i have been going through the c code and it seems the replication gets requested ok

replicator.db does get updated ok with the replicated request for the mbox in question.

however i am still looking for the actual replicator function in the lib's that do the actual replication requests

the number of folders & subfolders is defanately the issue - not the mbox pyhsical size as thought origionally.

if someone can point me in the right direction, it seems either the replicator is not picking up on the number of folders to replicat properly or it has a hard set limit like 256 / 512 / 65535 etc and stops the replication request thereafter.

I am mainly a machine code programmer from the 80's and have concentrated on python as of late, 'c' i am starting to go through just to give you a background on my talents.

It took 2 months to finger this out.

this issue also seems to be indirectly causing the duplicate messages supression not to work as well.

python programming to reproduce issue (loops are for last run started @ 200 - fyi) :

cat mbox.gen

#!/usr/local/bin/python2

import os,sys

from lib import *

user = 'paul@paulkudla.net'

""" for count in range (0,600) : box = 'INBOX/folder-%s' %count print count command = '/usr/local/bin/doveadm mailbox create -s -u %s %s' %(user,box) print command a = commands.getoutput(command) print a """

for count in range (0,600) : box = 'INBOX/folder-0/sub-%' %count print count command = '/usr/local/bin/doveadm mailbox create -s -u %s %s' %(user,box) print command a = commands.getoutput(command) print a

     #sys.exit()

Happy Sunday !!! Thanks - paul

Paul Kudla

Scom.ca Internet Services <http://www.scom.ca> 004-1009 Byron Street South Whitby, Ontario - Canada L1N 4S3

Toronto 416.642.7266 Main 1.866.411.7266 Fax 1.888.892.7266

On 4/24/2022 10:22 AM, Arnaud Abélard wrote:

...

Hello,

I am working on replicating a server (and adding compression on the other side) and since I had "Error: dsync I/O has stalled, no activity for 600 seconds (version not received)" errors I upgraded both source and destination server with the latest 2.3 version (2.3.18). While before the upgrade all the 15 replication connections were busy after upgrading dovecot replicator dsync-status shows that most of the time nothing is being replicated at all. I can see some brief replications that last, but 99,9% of the time nothing is happening at all.

I have a replication_full_sync_interval of 12 hours but I have thousands of users with their last full sync over 90 hours ago.

"doveadm replicator status" also shows that i have over 35,000 queued full resync requests, but no sync, high or low queued requests so why aren't the full requests occuring?

There are no errors in the logs.

Thanks,

Arnaud

Sebastian Marske

25 Apr 25 Apr

2:54 p.m.

Hi there,

thanks for your insights and for diving deeper into this Paul!

For me, the users ending up in 'Waiting for dsync to finish' all have more than 256 Imap folders as well (ranging from 288 up to >5500; as per 'doveadm mailbox list -u <username> | wc -l'). For more details on my setup please see my post from February [1].

@Arnaud: What OS are you running on?

Best Sebastian

[1] https://dovecot.org/pipermail/dovecot/2022-February/124168.html

On 4/24/22 19:36, Paul Kudla (SCOM.CA Internet Services Inc.) wrote:

...

Question having similiar replication issues

pls read everything below and advise the folder counts on the non-replicated users?

i find the total number of folders / account seems to be a factor and NOT the size of the mail box

ie i have customers with 40G of emails no problem over 40 or so folders and it works ok

300+ folders seems to be the issue

i have been going through the replication code

no errors being logged

i am assuming that the replication --> dhclient --> other server is timing out or not reading the folder lists correctly (ie dies after X folders read)

thus i am going through the code patching for log entries etc to find the issues.

see

[13:33:57] mail18.scom.ca [root:0] /usr/local/var/lib/dovecot

ll

total 86 drwxr-xr-x 2 root wheel uarch    4B Apr 24 11:11 . drwxr-xr-x 4 root wheel uarch    4B Mar 8 2021 .. -rw-r--r-- 1 root wheel uarch   73B Apr 24 11:11 instances -rw-r--r-- 1 root wheel uarch 160K Apr 24 13:33 replicator.db

[13:33:58] mail18.scom.ca [root:0] /usr/local/var/lib/dovecot

replicator.db seems to get updated ok but never processed properly.

sync.users

nick@elirpa.com                   high     00:09:41 463:47:01 -     y keith@elirpa.com                  high     00:09:23 463:45:43 -     y paul@scom.ca                      high     00:09:41 463:46:51 -     y ed@scom.ca                        high     00:09:43 463:47:01 -     y ed.hanna@dssmgmt.com              high     00:09:42 463:46:58 -     y paul@paulkudla.net                high     00:09:44 463:47:03 580:35:07 y

so ....

two things :

first to get the production stuff to work i had to write a script that whould find the bad sync's and the force a dsync between the servers

i run this every five minutes or each server.

in crontab

*/10    *                *    *    *    root            /usr/bin/nohup /programs/common/sync.recover > /dev/null

python script to sort things out

cat /programs/common/sync.recover

#!/usr/local/bin/python3

#Force sync between servers that are reporting bad?

import os,sys,django,socket from optparse import OptionParser

from lib import *

#Sample Re-Index MB #doveadm -D force-resync -u paul@scom.ca -f INBOX*

USAGE_TEXT = '''
usage: %%prog %s[options] '''

parser = OptionParser(usage=USAGE_TEXT % '', version='0.4')

parser.add_option("-m", "--send_to", dest="send_to", help="Send Email To") parser.add_option("-e", "--email", dest="email_box", help="Box to Index") parser.add_option("-d", "--detail",action='store_true', dest="detail",default =False, help="Detailed report") parser.add_option("-i", "--index",action='store_true', dest="index",default =False, help="Index")

options, args = parser.parse_args()

print (options.email_box) print (options.send_to) print (options.detail)

#sys.exit()

print ('Getting Current User Sync Status') command = commands("/usr/local/bin/doveadm replicator status '*'")

#print command

sync_user_status = command.output.split('\n')

#print sync_user_status

synced = []

for n in range(1,len(sync_user_status)) : user = sync_user_status[n] print ('Processing User : %s' %user.split(' ')[0]) if user.split(' ')[0] != options.email_box : if options.email_box != None : continue

        if options.index == True : command = '/usr/local/bin/doveadm -D force-resync -u %s -f INBOX*' %user.split(' ')[0] command = commands(command) command = command.output

        #print user for nn in range (len(user)-1,0,-1) : #print nn #print user[nn]

                if user[nn] == '-' : #print 'skipping ... %s' %user.split(' ')[0]

                        break

                if user[nn] == 'y': #Found a Bad Mailbox print ('syncing ... %s' %user.split(' ')[0])

                        if options.detail == True : command = '/usr/local/bin/doveadm -D sync -u %s -d -N -l 30 -U' %user.split(' ')[0] print (command) command = commands(command) command = command.output.split('\n') print (command) print ('Processed Mailbox for ... %s' %user.split(' ')[0] ) synced.append('Processed Mailbox for ... %s' %user.split(' ')[0]) for nnn in range(len(command)): synced.append(command[nnn] + '\n') break

                        if options.detail == False : #command = '/usr/local/bin/doveadm -D sync -u %s -d -N -l 30 -U' %user.split(' ')[0] #print (command) #command = os.system(command) command = subprocess.Popen( ["/usr/local/bin/doveadm sync -u %s -d -N -l 30 -U" %user.split(' ')[0] ],
shell = True, stdin=None, stdout=None, stderr=None, close_fds=True)

                                print ( 'Processed Mailbox for ... %s' %user.split(' ')[0] ) synced.append('Processed Mailbox for ... %s' %user.split(' ')[0]) #sys.exit() break

if len(synced) != 0 : #send email showing bad synced boxes ?

        if options.send_to != None : send_from = 'monitor@scom.ca' send_to = ['%s' %options.send_to] send_subject = 'Dovecot Bad Sync Report for : %s' %(socket.gethostname()) send_text = '\n\n' for n in range (len(synced)) : send_text = send_text + synced[n] + '\n'

                send_files = [] sendmail (send_from, send_to, send_subject, send_text, send_files)

sys.exit()

second :

i posted this a month ago - no response

please appreciate that i am trying to help ....

after much testing i can now reporduce the replication issues at hand

I am running on freebsd 12 & 13 stable (both test and production servers)

sdram drives etc ...

Basically replication works fine until reaching a folder quantity of ~ 256 or more

to reproduce using doveadm i created folders like

INBOX/folder-0 INBOX/folder-1 INBOX/folder-2 INBOX/folder-3 and so forth ......

I created 200 folders and they replicated ok on both servers

I created another 200 (400 total) and the replicator got stuck and would not update the mbox on the alternate server anymore and is still updating 4 days later ?

basically replicator goes so far and either hangs or more likely bails on an error that is not reported to the debug reporting ?

however dsync will sync the two servers but only when run manually (ie all the folders will sync)

I have two test servers avaliable if you need any kind of access - again here to help.

[07:28:42] mail18.scom.ca [root:0] ~

sync.status

Queued 'sync' requests        0 Queued 'high' requests        6 Queued 'low' requests         0 Queued 'failed' requests      0 Queued 'full resync' requests 0 Waiting 'failed' requests     0 Total number of known users   255

username                       type        status paul@scom.ca                   normal      Waiting for dsync to finish keith@elirpa.com               incremental Waiting for dsync to finish ed.hanna@dssmgmt.com           incremental Waiting for dsync to finish ed@scom.ca                     incremental Waiting for dsync to finish nick@elirpa.com                incremental Waiting for dsync to finish paul@paulkudla.net             incremental Waiting for dsync to finish

i have been going through the c code and it seems the replication gets requested ok

replicator.db does get updated ok with the replicated request for the mbox in question.

however i am still looking for the actual replicator function in the lib's that do the actual replication requests

the number of folders & subfolders is defanately the issue - not the mbox pyhsical size as thought origionally.

if someone can point me in the right direction, it seems either the replicator is not picking up on the number of folders to replicat properly or it has a hard set limit like 256 / 512 / 65535 etc and stops the replication request thereafter.

I am mainly a machine code programmer from the 80's and have concentrated on python as of late, 'c' i am starting to go through just to give you a background on my talents.

It took 2 months to finger this out.

this issue also seems to be indirectly causing the duplicate messages supression not to work as well.

python programming to reproduce issue (loops are for last run started @ 200 - fyi) :

cat mbox.gen

#!/usr/local/bin/python2

import os,sys

from lib import *

user = 'paul@paulkudla.net'

""" for count in range (0,600) : box = 'INBOX/folder-%s' %count print count command = '/usr/local/bin/doveadm mailbox create -s -u %s %s' %(user,box) print command a = commands.getoutput(command) print a """

for count in range (0,600) : box = 'INBOX/folder-0/sub-%' %count print count command = '/usr/local/bin/doveadm mailbox create -s -u %s %s' %(user,box) print command a = commands.getoutput(command) print a

        #sys.exit()

Happy Sunday !!! Thanks - paul

Paul Kudla

Scom.ca Internet Services <http://www.scom.ca> 004-1009 Byron Street South Whitby, Ontario - Canada L1N 4S3

Toronto 416.642.7266 Main 1.866.411.7266 Fax 1.888.892.7266

On 4/24/2022 10:22 AM, Arnaud Abélard wrote:

...
Hello,

I am working on replicating a server (and adding compression on the other side) and since I had "Error: dsync I/O has stalled, no activity for 600 seconds (version not received)" errors I upgraded both source and destination server with the latest 2.3 version (2.3.18). While before the upgrade all the 15 replication connections were busy after upgrading dovecot replicator dsync-status shows that most of the time nothing is being replicated at all. I can see some brief replications that last, but 99,9% of the time nothing is happening at all.

I have a replication_full_sync_interval of 12 hours but I have thousands of users with their last full sync over 90 hours ago.

"doveadm replicator status" also shows that i have over 35,000 queued full resync requests, but no sync, high or low queued requests so why aren't the full requests occuring?

There are no errors in the logs.

Thanks,

Arnaud

Arnaud Abélard

4:13 p.m.

Hello,

On my side we are running Linux (Debian Buster).

I'm not sure my problem is actually the same as Paul or you Sebastian since I have a lot of boxes but those are actually small (quota of 110MB) so I doubt any of them have more than a dozen imap folders.

The main symptom is that I have tons of full sync requests awaiting but even though no other sync is pending the replicator just waits for something to trigger those syncs.

Today, with users back I can see that normal and incremental syncs are being done on the 15 connections, with an occasional full sync here or there and lots of "Waiting 'failed' requests":

Queued 'sync' requests 0

Queued 'high' requests 0

Queued 'low' requests 0

Queued 'failed' requests 122

Queued 'full resync' requests 28785

Waiting 'failed' requests 4294

Total number of known users 42512

So, why didn't the replicator take advantage of the weekend to replicate the mailboxes while no user were using them?

Arnaud

On 25/04/2022 13:54, Sebastian Marske wrote:

...

Hi there,

thanks for your insights and for diving deeper into this Paul!

For me, the users ending up in 'Waiting for dsync to finish' all have more than 256 Imap folders as well (ranging from 288 up to >5500; as per 'doveadm mailbox list -u <username> | wc -l'). For more details on my setup please see my post from February [1].

@Arnaud: What OS are you running on?

Best Sebastian

[1] https://dovecot.org/pipermail/dovecot/2022-February/124168.html

On 4/24/22 19:36, Paul Kudla (SCOM.CA Internet Services Inc.) wrote:

...
Question having similiar replication issues

pls read everything below and advise the folder counts on the non-replicated users?

i find the total number of folders / account seems to be a factor and NOT the size of the mail box

ie i have customers with 40G of emails no problem over 40 or so folders and it works ok

300+ folders seems to be the issue

i have been going through the replication code

no errors being logged

i am assuming that the replication --> dhclient --> other server is timing out or not reading the folder lists correctly (ie dies after X folders read)

thus i am going through the code patching for log entries etc to find the issues.

see

[13:33:57] mail18.scom.ca [root:0] /usr/local/var/lib/dovecot

ll

total 86 drwxr-xr-x 2 root wheel uarch    4B Apr 24 11:11 . drwxr-xr-x 4 root wheel uarch    4B Mar 8 2021 .. -rw-r--r-- 1 root wheel uarch   73B Apr 24 11:11 instances -rw-r--r-- 1 root wheel uarch 160K Apr 24 13:33 replicator.db

[13:33:58] mail18.scom.ca [root:0] /usr/local/var/lib/dovecot

replicator.db seems to get updated ok but never processed properly.

sync.users

nick@elirpa.com                   high     00:09:41 463:47:01 -     y keith@elirpa.com                  high     00:09:23 463:45:43 -     y paul@scom.ca                      high     00:09:41 463:46:51 -     y ed@scom.ca                        high     00:09:43 463:47:01 -     y ed.hanna@dssmgmt.com              high     00:09:42 463:46:58 -     y paul@paulkudla.net                high     00:09:44 463:47:03 580:35:07 y

so ....

two things :

first to get the production stuff to work i had to write a script that whould find the bad sync's and the force a dsync between the servers

i run this every five minutes or each server.

in crontab

*/10    *                *    *    *    root            /usr/bin/nohup /programs/common/sync.recover > /dev/null

python script to sort things out

cat /programs/common/sync.recover

#!/usr/local/bin/python3

#Force sync between servers that are reporting bad?

import os,sys,django,socket from optparse import OptionParser

from lib import *

#Sample Re-Index MB #doveadm -D force-resync -u paul@scom.ca -f INBOX*

USAGE_TEXT = '''
usage: %%prog %s[options] '''

parser = OptionParser(usage=USAGE_TEXT % '', version='0.4')

parser.add_option("-m", "--send_to", dest="send_to", help="Send Email To") parser.add_option("-e", "--email", dest="email_box", help="Box to Index") parser.add_option("-d", "--detail",action='store_true', dest="detail",default =False, help="Detailed report") parser.add_option("-i", "--index",action='store_true', dest="index",default =False, help="Index")

options, args = parser.parse_args()

print (options.email_box) print (options.send_to) print (options.detail)

#sys.exit()

print ('Getting Current User Sync Status') command = commands("/usr/local/bin/doveadm replicator status '*'")

#print command

sync_user_status = command.output.split('\n')

#print sync_user_status

synced = []

for n in range(1,len(sync_user_status)) : user = sync_user_status[n] print ('Processing User : %s' %user.split(' ')[0]) if user.split(' ')[0] != options.email_box : if options.email_box != None : continue

        if options.index == True : command = '/usr/local/bin/doveadm -D force-resync -u %s -f INBOX*' %user.split(' ')[0] command = commands(command) command = command.output

        #print user for nn in range (len(user)-1,0,-1) : #print nn #print user[nn]

                if user[nn] == '-' : #print 'skipping ... %s' %user.split(' ')[0]

                        break

                if user[nn] == 'y': #Found a Bad Mailbox print ('syncing ... %s' %user.split(' ')[0])

                        if options.detail == True : command = '/usr/local/bin/doveadm -D sync -u %s -d -N -l 30 -U' %user.split(' ')[0] print (command) command = commands(command) command = command.output.split('\n') print (command) print ('Processed Mailbox for ... %s' %user.split(' ')[0] ) synced.append('Processed Mailbox for ... %s' %user.split(' ')[0]) for nnn in range(len(command)): synced.append(command[nnn] + '\n') break

                        if options.detail == False : #command = '/usr/local/bin/doveadm -D sync -u %s -d -N -l 30 -U' %user.split(' ')[0] #print (command) #command = os.system(command) command = subprocess.Popen( ["/usr/local/bin/doveadm sync -u %s -d -N -l 30 -U" %user.split(' ')[0] ],
shell = True, stdin=None, stdout=None, stderr=None, close_fds=True)

                                print ( 'Processed Mailbox for ... %s' %user.split(' ')[0] ) synced.append('Processed Mailbox for ... %s' %user.split(' ')[0]) #sys.exit() break

if len(synced) != 0 : #send email showing bad synced boxes ?

        if options.send_to != None : send_from = 'monitor@scom.ca' send_to = ['%s' %options.send_to] send_subject = 'Dovecot Bad Sync Report for : %s' %(socket.gethostname()) send_text = '\n\n' for n in range (len(synced)) : send_text = send_text + synced[n] + '\n'

                send_files = [] sendmail (send_from, send_to, send_subject, send_text, send_files)

sys.exit()

second :

i posted this a month ago - no response

please appreciate that i am trying to help ....

after much testing i can now reporduce the replication issues at hand

I am running on freebsd 12 & 13 stable (both test and production servers)

sdram drives etc ...

Basically replication works fine until reaching a folder quantity of ~ 256 or more

to reproduce using doveadm i created folders like

INBOX/folder-0 INBOX/folder-1 INBOX/folder-2 INBOX/folder-3 and so forth ......

I created 200 folders and they replicated ok on both servers

I created another 200 (400 total) and the replicator got stuck and would not update the mbox on the alternate server anymore and is still updating 4 days later ?

basically replicator goes so far and either hangs or more likely bails on an error that is not reported to the debug reporting ?

however dsync will sync the two servers but only when run manually (ie all the folders will sync)

I have two test servers avaliable if you need any kind of access - again here to help.

[07:28:42] mail18.scom.ca [root:0] ~

sync.status

Queued 'sync' requests        0 Queued 'high' requests        6 Queued 'low' requests         0 Queued 'failed' requests      0 Queued 'full resync' requests 0 Waiting 'failed' requests     0 Total number of known users   255

username                       type        status paul@scom.ca                   normal      Waiting for dsync to finish keith@elirpa.com               incremental Waiting for dsync to finish ed.hanna@dssmgmt.com           incremental Waiting for dsync to finish ed@scom.ca                     incremental Waiting for dsync to finish nick@elirpa.com                incremental Waiting for dsync to finish paul@paulkudla.net             incremental Waiting for dsync to finish

i have been going through the c code and it seems the replication gets requested ok

replicator.db does get updated ok with the replicated request for the mbox in question.

however i am still looking for the actual replicator function in the lib's that do the actual replication requests

the number of folders & subfolders is defanately the issue - not the mbox pyhsical size as thought origionally.

if someone can point me in the right direction, it seems either the replicator is not picking up on the number of folders to replicat properly or it has a hard set limit like 256 / 512 / 65535 etc and stops the replication request thereafter.

I am mainly a machine code programmer from the 80's and have concentrated on python as of late, 'c' i am starting to go through just to give you a background on my talents.

It took 2 months to finger this out.

this issue also seems to be indirectly causing the duplicate messages supression not to work as well.

python programming to reproduce issue (loops are for last run started @ 200 - fyi) :

cat mbox.gen

#!/usr/local/bin/python2

import os,sys

from lib import *

user = 'paul@paulkudla.net'

""" for count in range (0,600) : box = 'INBOX/folder-%s' %count print count command = '/usr/local/bin/doveadm mailbox create -s -u %s %s' %(user,box) print command a = commands.getoutput(command) print a """

for count in range (0,600) : box = 'INBOX/folder-0/sub-%' %count print count command = '/usr/local/bin/doveadm mailbox create -s -u %s %s' %(user,box) print command a = commands.getoutput(command) print a

        #sys.exit()

Happy Sunday !!! Thanks - paul

Paul Kudla

Scom.ca Internet Services <http://www.scom.ca> 004-1009 Byron Street South Whitby, Ontario - Canada L1N 4S3

Toronto 416.642.7266 Main 1.866.411.7266 Fax 1.888.892.7266

On 4/24/2022 10:22 AM, Arnaud Abélard wrote:

...
Hello,

I am working on replicating a server (and adding compression on the other side) and since I had "Error: dsync I/O has stalled, no activity for 600 seconds (version not received)" errors I upgraded both source and destination server with the latest 2.3 version (2.3.18). While before the upgrade all the 15 replication connections were busy after upgrading dovecot replicator dsync-status shows that most of the time nothing is being replicated at all. I can see some brief replications that last, but 99,9% of the time nothing is happening at all.

I have a replication_full_sync_interval of 12 hours but I have thousands of users with their last full sync over 90 hours ago.

"doveadm replicator status" also shows that i have over 35,000 queued full resync requests, but no sync, high or low queued requests so why aren't the full requests occuring?

There are no errors in the logs.

Thanks,

Arnaud

-- Arnaud Abélard Responsable pôle Système et Stockage Service Infrastructures DSIN Université de Nantes

Arnaud Abélard

5:01 p.m.

Ah, I'm now getting errors in the logs, that would explains the increasing number of failed sync requests:

dovecot: imap(xxxxx)<2961235><Bs6w43rdQPAqAcsFiXEmAInUhhA3Rfqh>: Error: Mailbox INBOX: /vmail/l/i/xxxxx/dovecot.index reset, view is now inconsistent

And sure enough:

dovecot replicator status xxxxx

xxxxx none 00:02:54 07:11:28 - y

What could explain that error?

Arnaud

On 25/04/2022 15:13, Arnaud Abélard wrote:

...

Hello,

On my side we are running Linux (Debian Buster).

I'm not sure my problem is actually the same as Paul or you Sebastian since I have a lot of boxes but those are actually small (quota of 110MB) so I doubt any of them have more than a dozen imap folders.

The main symptom is that I have tons of full sync requests awaiting but even though no other sync is pending the replicator just waits for something to trigger those syncs.

Today, with users back I can see that normal and incremental syncs are being done on the 15 connections, with an occasional full sync here or there and lots of "Waiting 'failed' requests":

Queued 'sync' requests        0

Queued 'high' requests        0

Queued 'low' requests         0

Queued 'failed' requests      122

Queued 'full resync' requests 28785

Waiting 'failed' requests     4294

Total number of known users   42512

So, why didn't the replicator take advantage of the weekend to replicate the mailboxes while no user were using them?

Arnaud

On 25/04/2022 13:54, Sebastian Marske wrote:

...
Hi there,

thanks for your insights and for diving deeper into this Paul!

For me, the users ending up in 'Waiting for dsync to finish' all have more than 256 Imap folders as well (ranging from 288 up to >5500; as per 'doveadm mailbox list -u <username> | wc -l'). For more details on my setup please see my post from February [1].

@Arnaud: What OS are you running on?

Best Sebastian

[1] https://dovecot.org/pipermail/dovecot/2022-February/124168.html

On 4/24/22 19:36, Paul Kudla (SCOM.CA Internet Services Inc.) wrote:

...
Question having similiar replication issues

pls read everything below and advise the folder counts on the non-replicated users?

i find the total number of folders / account seems to be a factor and NOT the size of the mail box

ie i have customers with 40G of emails no problem over 40 or so folders and it works ok

300+ folders seems to be the issue

i have been going through the replication code

no errors being logged

i am assuming that the replication --> dhclient --> other server is timing out or not reading the folder lists correctly (ie dies after X folders read)

thus i am going through the code patching for log entries etc to find the issues.

see

[13:33:57] mail18.scom.ca [root:0] /usr/local/var/lib/dovecot

ll

total 86 drwxr-xr-x 2 root wheel uarch    4B Apr 24 11:11 . drwxr-xr-x 4 root wheel uarch    4B Mar 8 2021 .. -rw-r--r-- 1 root wheel uarch   73B Apr 24 11:11 instances -rw-r--r-- 1 root wheel uarch 160K Apr 24 13:33 replicator.db

[13:33:58] mail18.scom.ca [root:0] /usr/local/var/lib/dovecot

replicator.db seems to get updated ok but never processed properly.

sync.users

nick@elirpa.com                   high     00:09:41 463:47:01 -     y keith@elirpa.com                  high     00:09:23 463:45:43 -     y paul@scom.ca                      high     00:09:41 463:46:51 -     y ed@scom.ca                        high     00:09:43 463:47:01 -     y ed.hanna@dssmgmt.com              high     00:09:42 463:46:58 -     y paul@paulkudla.net                high     00:09:44 463:47:03 580:35:07 y

so ....

two things :

first to get the production stuff to work i had to write a script that whould find the bad sync's and the force a dsync between the servers

i run this every five minutes or each server.

in crontab

*/10    *                *    *    *    root            /usr/bin/nohup /programs/common/sync.recover > /dev/null

python script to sort things out

cat /programs/common/sync.recover

#!/usr/local/bin/python3

#Force sync between servers that are reporting bad?

import os,sys,django,socket from optparse import OptionParser

from lib import *

#Sample Re-Index MB #doveadm -D force-resync -u paul@scom.ca -f INBOX*

USAGE_TEXT = '''
usage: %%prog %s[options] '''

parser = OptionParser(usage=USAGE_TEXT % '', version='0.4')

parser.add_option("-m", "--send_to", dest="send_to", help="Send Email To") parser.add_option("-e", "--email", dest="email_box", help="Box to Index") parser.add_option("-d", "--detail",action='store_true', dest="detail",default =False, help="Detailed report") parser.add_option("-i", "--index",action='store_true', dest="index",default =False, help="Index")

options, args = parser.parse_args()

print (options.email_box) print (options.send_to) print (options.detail)

#sys.exit()

print ('Getting Current User Sync Status') command = commands("/usr/local/bin/doveadm replicator status '*'")

#print command

sync_user_status = command.output.split('\n')

#print sync_user_status

synced = []

for n in range(1,len(sync_user_status)) : user = sync_user_status[n] print ('Processing User : %s' %user.split(' ')[0]) if user.split(' ')[0] != options.email_box : if options.email_box != None : continue

         if options.index == True : command = '/usr/local/bin/doveadm -D force-resync -u %s -f INBOX*' %user.split(' ')[0] command = commands(command) command = command.output

         #print user for nn in range (len(user)-1,0,-1) : #print nn #print user[nn]

                 if user[nn] == '-' : #print 'skipping ... %s' %user.split(' ')[0]

                         break

                 if user[nn] == 'y': #Found a Bad Mailbox print ('syncing ... %s' %user.split(' ')[0])

                         if options.detail == True : command = '/usr/local/bin/doveadm -D sync -u %s -d -N -l 30 -U' %user.split(' ')[0] print (command) command = commands(command) command = command.output.split('\n') print (command) print ('Processed Mailbox for ... %s' %user.split(' ')[0] ) synced.append('Processed Mailbox for ... %s' %user.split(' ')[0]) for nnn in range(len(command)): synced.append(command[nnn] + '\n') break

                         if options.detail == False : #command = '/usr/local/bin/doveadm -D sync -u %s -d -N -l 30 -U' %user.split(' ')[0] #print (command) #command = os.system(command) command = subprocess.Popen( ["/usr/local/bin/doveadm sync -u %s -d -N -l 30 -U" %user.split(' ')[0] ],
shell = True, stdin=None, stdout=None, stderr=None, close_fds=True)

                                 print ( 'Processed Mailbox for ... %s' %user.split(' ')[0] ) synced.append('Processed Mailbox for ... %s' %user.split(' ')[0]) #sys.exit() break

if len(synced) != 0 : #send email showing bad synced boxes ?

         if options.send_to != None : send_from = 'monitor@scom.ca' send_to = ['%s' %options.send_to] send_subject = 'Dovecot Bad Sync Report for : %s' %(socket.gethostname()) send_text = '\n\n' for n in range (len(synced)) : send_text = send_text + synced[n] + '\n'

                 send_files = [] sendmail (send_from, send_to, send_subject, send_text, send_files)

sys.exit()

second :

i posted this a month ago - no response

please appreciate that i am trying to help ....

after much testing i can now reporduce the replication issues at hand

I am running on freebsd 12 & 13 stable (both test and production servers)

sdram drives etc ...

Basically replication works fine until reaching a folder quantity of ~ 256 or more

to reproduce using doveadm i created folders like

INBOX/folder-0 INBOX/folder-1 INBOX/folder-2 INBOX/folder-3 and so forth ......

I created 200 folders and they replicated ok on both servers

I created another 200 (400 total) and the replicator got stuck and would not update the mbox on the alternate server anymore and is still updating 4 days later ?

basically replicator goes so far and either hangs or more likely bails on an error that is not reported to the debug reporting ?

however dsync will sync the two servers but only when run manually (ie all the folders will sync)

I have two test servers avaliable if you need any kind of access - again here to help.

[07:28:42] mail18.scom.ca [root:0] ~

sync.status

Queued 'sync' requests        0 Queued 'high' requests        6 Queued 'low' requests         0 Queued 'failed' requests      0 Queued 'full resync' requests 0 Waiting 'failed' requests     0 Total number of known users   255

username                       type        status paul@scom.ca                   normal      Waiting for dsync to finish keith@elirpa.com               incremental Waiting for dsync to finish ed.hanna@dssmgmt.com           incremental Waiting for dsync to finish ed@scom.ca                     incremental Waiting for dsync to finish nick@elirpa.com                incremental Waiting for dsync to finish paul@paulkudla.net             incremental Waiting for dsync to finish

i have been going through the c code and it seems the replication gets requested ok

replicator.db does get updated ok with the replicated request for the mbox in question.

however i am still looking for the actual replicator function in the lib's that do the actual replication requests

the number of folders & subfolders is defanately the issue - not the mbox pyhsical size as thought origionally.

if someone can point me in the right direction, it seems either the replicator is not picking up on the number of folders to replicat properly or it has a hard set limit like 256 / 512 / 65535 etc and stops the replication request thereafter.

I am mainly a machine code programmer from the 80's and have concentrated on python as of late, 'c' i am starting to go through just to give you a background on my talents.

It took 2 months to finger this out.

this issue also seems to be indirectly causing the duplicate messages supression not to work as well.

python programming to reproduce issue (loops are for last run started @ 200 - fyi) :

cat mbox.gen

#!/usr/local/bin/python2

import os,sys

from lib import *

user = 'paul@paulkudla.net'

""" for count in range (0,600) : box = 'INBOX/folder-%s' %count print count command = '/usr/local/bin/doveadm mailbox create -s -u %s %s' %(user,box) print command a = commands.getoutput(command) print a """

for count in range (0,600) : box = 'INBOX/folder-0/sub-%' %count print count command = '/usr/local/bin/doveadm mailbox create -s -u %s %s' %(user,box) print command a = commands.getoutput(command) print a

         #sys.exit()

Happy Sunday !!! Thanks - paul

Paul Kudla

Scom.ca Internet Services <http://www.scom.ca> 004-1009 Byron Street South Whitby, Ontario - Canada L1N 4S3

Toronto 416.642.7266 Main 1.866.411.7266 Fax 1.888.892.7266

On 4/24/2022 10:22 AM, Arnaud Abélard wrote:

...
Hello,

I am working on replicating a server (and adding compression on the other side) and since I had "Error: dsync I/O has stalled, no activity for 600 seconds (version not received)" errors I upgraded both source and destination server with the latest 2.3 version (2.3.18). While before the upgrade all the 15 replication connections were busy after upgrading dovecot replicator dsync-status shows that most of the time nothing is being replicated at all. I can see some brief replications that last, but 99,9% of the time nothing is happening at all.

I have a replication_full_sync_interval of 12 hours but I have thousands of users with their last full sync over 90 hours ago.

"doveadm replicator status" also shows that i have over 35,000 queued full resync requests, but no sync, high or low queued requests so why aren't the full requests occuring?

There are no errors in the logs.

Thanks,

Arnaud

-- Arnaud Abélard Responsable pôle Système et Stockage Service Infrastructures DSIN Université de Nantes

Paul Kudla (SCOM.CA Internet Services Inc.)

26 Apr 26 Apr

12:21 p.m.

side issue

if you are getting inconsistant dsyncs there is no real way to fix this in the long run.

i know its a pain (already had to my self)

i needed to do a full sync, take one server offline, delete the user dir (with dovecot offline) and then rsync (or somehow duplicate the main server's user data) over the the remote again.

then bring remote back up and it kind or worked worked

best suggestion is to bring the main server down at night so the copy is clean?

if using postfix you can enable the soft bounce option and the mail will back spool until everything comes back online

(needs to be enable on bother servers)

replication was still an issue on accounts with 300+ folders in them, still working on a fix for that.

Happy Tuesday !!! Thanks - paul

Paul Kudla

Scom.ca Internet Services <http://www.scom.ca> 004-1009 Byron Street South Whitby, Ontario - Canada L1N 4S3

Toronto 416.642.7266 Main 1.866.411.7266 Fax 1.888.892.7266

On 4/25/2022 10:01 AM, Arnaud Abélard wrote:

...

Ah, I'm now getting errors in the logs, that would explains the increasing number of failed sync requests:

dovecot: imap(xxxxx)<2961235><Bs6w43rdQPAqAcsFiXEmAInUhhA3Rfqh>: Error: Mailbox INBOX: /vmail/l/i/xxxxx/dovecot.index reset, view is now inconsistent

And sure enough:

dovecot replicator status xxxxx

xxxxx         none     00:02:54 07:11:28 -            y

What could explain that error?

Arnaud

On 25/04/2022 15:13, Arnaud Abélard wrote:

...
Hello,

On my side we are running Linux (Debian Buster).

I'm not sure my problem is actually the same as Paul or you Sebastian since I have a lot of boxes but those are actually small (quota of 110MB) so I doubt any of them have more than a dozen imap folders.

The main symptom is that I have tons of full sync requests awaiting but even though no other sync is pending the replicator just waits for something to trigger those syncs.

Today, with users back I can see that normal and incremental syncs are being done on the 15 connections, with an occasional full sync here or there and lots of "Waiting 'failed' requests":

Queued 'sync' requests        0

Queued 'high' requests        0

Queued 'low' requests         0

Queued 'failed' requests      122

Queued 'full resync' requests 28785

Waiting 'failed' requests     4294

Total number of known users   42512

So, why didn't the replicator take advantage of the weekend to replicate the mailboxes while no user were using them?

Arnaud

On 25/04/2022 13:54, Sebastian Marske wrote:

...
Hi there,

thanks for your insights and for diving deeper into this Paul!

For me, the users ending up in 'Waiting for dsync to finish' all have more than 256 Imap folders as well (ranging from 288 up to >5500; as per 'doveadm mailbox list -u <username> | wc -l'). For more details on my setup please see my post from February [1].

@Arnaud: What OS are you running on?

Best Sebastian

[1] https://dovecot.org/pipermail/dovecot/2022-February/124168.html

On 4/24/22 19:36, Paul Kudla (SCOM.CA Internet Services Inc.) wrote:

...
Question having similiar replication issues

pls read everything below and advise the folder counts on the non-replicated users?

i find the total number of folders / account seems to be a factor and NOT the size of the mail box

ie i have customers with 40G of emails no problem over 40 or so folders and it works ok

300+ folders seems to be the issue

i have been going through the replication code

no errors being logged

i am assuming that the replication --> dhclient --> other server is timing out or not reading the folder lists correctly (ie dies after X folders read)

thus i am going through the code patching for log entries etc to find the issues.

see

[13:33:57] mail18.scom.ca [root:0] /usr/local/var/lib/dovecot

ll

total 86 drwxr-xr-x 2 root wheel uarch    4B Apr 24 11:11 . drwxr-xr-x 4 root wheel uarch    4B Mar 8 2021 .. -rw-r--r-- 1 root wheel uarch   73B Apr 24 11:11 instances -rw-r--r-- 1 root wheel uarch 160K Apr 24 13:33 replicator.db

[13:33:58] mail18.scom.ca [root:0] /usr/local/var/lib/dovecot

replicator.db seems to get updated ok but never processed properly.

sync.users

nick@elirpa.com                   high     00:09:41 463:47:01 -     y keith@elirpa.com                  high     00:09:23 463:45:43 -     y paul@scom.ca                      high     00:09:41 463:46:51 -     y ed@scom.ca                        high     00:09:43 463:47:01 -     y ed.hanna@dssmgmt.com              high     00:09:42 463:46:58 -     y paul@paulkudla.net                high     00:09:44 463:47:03 580:35:07 y

so ....

two things :

first to get the production stuff to work i had to write a script that whould find the bad sync's and the force a dsync between the servers

i run this every five minutes or each server.

in crontab

*/10    *                *    *    *    root            /usr/bin/nohup /programs/common/sync.recover > /dev/null

python script to sort things out

cat /programs/common/sync.recover

#!/usr/local/bin/python3

#Force sync between servers that are reporting bad?

import os,sys,django,socket from optparse import OptionParser

from lib import *

#Sample Re-Index MB #doveadm -D force-resync -u paul@scom.ca -f INBOX*

USAGE_TEXT = '''
usage: %%prog %s[options] '''

parser = OptionParser(usage=USAGE_TEXT % '', version='0.4')

parser.add_option("-m", "--send_to", dest="send_to", help="Send Email To") parser.add_option("-e", "--email", dest="email_box", help="Box to Index") parser.add_option("-d", "--detail",action='store_true', dest="detail",default =False, help="Detailed report") parser.add_option("-i", "--index",action='store_true', dest="index",default =False, help="Index")

options, args = parser.parse_args()

print (options.email_box) print (options.send_to) print (options.detail)

#sys.exit()

print ('Getting Current User Sync Status') command = commands("/usr/local/bin/doveadm replicator status '*'")

#print command

sync_user_status = command.output.split('\n')

#print sync_user_status

synced = []

for n in range(1,len(sync_user_status)) : user = sync_user_status[n] print ('Processing User : %s' %user.split(' ')[0]) if user.split(' ')[0] != options.email_box : if options.email_box != None : continue

         if options.index == True : command = '/usr/local/bin/doveadm -D force-resync -u %s -f INBOX*' %user.split(' ')[0] command = commands(command) command = command.output

         #print user for nn in range (len(user)-1,0,-1) : #print nn #print user[nn]

                 if user[nn] == '-' : #print 'skipping ... %s' %user.split(' ')[0]

                         break

                 if user[nn] == 'y': #Found a Bad Mailbox print ('syncing ... %s' %user.split(' ')[0])

                         if options.detail == True : command = '/usr/local/bin/doveadm -D sync -u %s -d -N -l 30 -U' %user.split(' ')[0] print (command) command = commands(command) command = command.output.split('\n') print (command) print ('Processed Mailbox for ... %s' %user.split(' ')[0] ) synced.append('Processed Mailbox for ... %s' %user.split(' ')[0]) for nnn in range(len(command)): synced.append(command[nnn]

'\n') break

                         if options.detail == False : #command = '/usr/local/bin/doveadm -D sync -u %s -d -N -l 30 -U' %user.split(' ')[0] #print (command) #command = os.system(command) command = subprocess.Popen( ["/usr/local/bin/doveadm sync -u %s -d -N -l 30 -U" %user.split(' ')[0] ],
shell = True, stdin=None, stdout=None, stderr=None, close_fds=True)

                                 print ( 'Processed Mailbox for ... %s' %user.split(' ')[0] ) synced.append('Processed Mailbox for ... %s' %user.split(' ')[0]) #sys.exit() break

if len(synced) != 0 : #send email showing bad synced boxes ?

         if options.send_to != None : send_from = 'monitor@scom.ca' send_to = ['%s' %options.send_to] send_subject = 'Dovecot Bad Sync Report for : %s' %(socket.gethostname()) send_text = '\n\n' for n in range (len(synced)) : send_text = send_text + synced[n] + '\n'

                 send_files = [] sendmail (send_from, send_to, send_subject, send_text, send_files)

sys.exit()

second :

i posted this a month ago - no response

please appreciate that i am trying to help ....

after much testing i can now reporduce the replication issues at hand

I am running on freebsd 12 & 13 stable (both test and production servers)

sdram drives etc ...

Basically replication works fine until reaching a folder quantity of ~ 256 or more

to reproduce using doveadm i created folders like

INBOX/folder-0 INBOX/folder-1 INBOX/folder-2 INBOX/folder-3 and so forth ......

I created 200 folders and they replicated ok on both servers

I created another 200 (400 total) and the replicator got stuck and would not update the mbox on the alternate server anymore and is still updating 4 days later ?

basically replicator goes so far and either hangs or more likely bails on an error that is not reported to the debug reporting ?

however dsync will sync the two servers but only when run manually (ie all the folders will sync)

I have two test servers avaliable if you need any kind of access - again here to help.

[07:28:42] mail18.scom.ca [root:0] ~

sync.status

Queued 'sync' requests        0 Queued 'high' requests        6 Queued 'low' requests         0 Queued 'failed' requests      0 Queued 'full resync' requests 0 Waiting 'failed' requests     0 Total number of known users   255

username                       type        status paul@scom.ca                   normal      Waiting for dsync to finish keith@elirpa.com               incremental Waiting for dsync to finish ed.hanna@dssmgmt.com           incremental Waiting for dsync to finish ed@scom.ca                     incremental Waiting for dsync to finish nick@elirpa.com                incremental Waiting for dsync to finish paul@paulkudla.net             incremental Waiting for dsync to finish

i have been going through the c code and it seems the replication gets requested ok

replicator.db does get updated ok with the replicated request for the mbox in question.

however i am still looking for the actual replicator function in the lib's that do the actual replication requests

the number of folders & subfolders is defanately the issue - not the mbox pyhsical size as thought origionally.

if someone can point me in the right direction, it seems either the replicator is not picking up on the number of folders to replicat properly or it has a hard set limit like 256 / 512 / 65535 etc and stops the replication request thereafter.

I am mainly a machine code programmer from the 80's and have concentrated on python as of late, 'c' i am starting to go through just to give you a background on my talents.

It took 2 months to finger this out.

this issue also seems to be indirectly causing the duplicate messages supression not to work as well.

python programming to reproduce issue (loops are for last run started @ 200 - fyi) :

cat mbox.gen

#!/usr/local/bin/python2

import os,sys

from lib import *

user = 'paul@paulkudla.net'

""" for count in range (0,600) : box = 'INBOX/folder-%s' %count print count command = '/usr/local/bin/doveadm mailbox create -s -u %s %s' %(user,box) print command a = commands.getoutput(command) print a """

for count in range (0,600) : box = 'INBOX/folder-0/sub-%' %count print count command = '/usr/local/bin/doveadm mailbox create -s -u %s %s' %(user,box) print command a = commands.getoutput(command) print a

         #sys.exit()

Happy Sunday !!! Thanks - paul

Paul Kudla

Scom.ca Internet Services <http://www.scom.ca> 004-1009 Byron Street South Whitby, Ontario - Canada L1N 4S3

Toronto 416.642.7266 Main 1.866.411.7266 Fax 1.888.892.7266

On 4/24/2022 10:22 AM, Arnaud Abélard wrote:

...
Hello,

I am working on replicating a server (and adding compression on the other side) and since I had "Error: dsync I/O has stalled, no activity for 600 seconds (version not received)" errors I upgraded both source and destination server with the latest 2.3 version (2.3.18). While before the upgrade all the 15 replication connections were busy after upgrading dovecot replicator dsync-status shows that most of the time nothing is being replicated at all. I can see some brief replications that last, but 99,9% of the time nothing is happening at all.

I have a replication_full_sync_interval of 12 hours but I have thousands of users with their last full sync over 90 hours ago.

"doveadm replicator status" also shows that i have over 35,000 queued full resync requests, but no sync, high or low queued requests so why aren't the full requests occuring?

There are no errors in the logs.

Thanks,

Arnaud

Reuben Farrelly

4:03 p.m.

I ran into this back in February and documented a reproducible test case (and sent it to this list). In short - I was able to reproduce this by having a valid and consistent mailbox on the source/local, creating a very standard empty Maildir/(new|cur|tmp) folder on the remote replica, and then initiating the replicate from the source. This consistently caused dsync to fail replication with the error "dovecot.index reset, view is now inconsistent" and sync aborted, leaving the replica mailbox in a screwed up inconsistent state. Client connections on the source replica were also dropped when this error occurred. You can see the error by enabling debug level logging if you initiate dsync manually on a test mailbox.

The only workaround I found was to remove the remote Maildir and let Dovecot create the whole thing from scratch. Dovecot did not like any existing folders on the destination replica even if they were the same names as the source and completely empty. I was able to reproduce this the bare minimum of folders - just an INBOX!

I have no idea if any of the developers saw my post or if the bug has been fixed for the next release. But it seemed to be quite a common problem over time (saw a few posts from people going back a long way with the same problem) and it is seriously disruptive to clients. The error message is not helpful in tracking down the problem either.

Secondly, I also have had an ongoing and longstanding problem using tcps: for replication. For some reason using tcps: (with no other changes at all to the config) results in a lot of timeout messages "Error: dsync I/O has stalled, no activity for 600 seconds". This goes away if I revert back to tcp: instead of tcps - with tcp: I very rarely get timeouts. No idea why, guess this is a bug of some sort also.

It's disappointing that there appears to be no way to have these sorts or problems addressed like there once was. I am not using Dovecot for commercial purposes so paying a fortune for a support contract for a high end installation just isn't going to happen, and this list seems to be quite ordinary for getting support and reporting bugs nowadays....

Reuben

On 26/04/2022 7:21 pm, Paul Kudla (SCOM.CA Internet Services Inc.) wrote:

...

side issue

if you are getting inconsistant dsyncs there is no real way to fix this in the long run.

i know its a pain (already had to my self)

i needed to do a full sync, take one server offline, delete the user dir (with dovecot offline) and then rsync (or somehow duplicate the main server's user data) over the the remote again.

then bring remote back up and it kind or worked worked

best suggestion is to bring the main server down at night so the copy is clean?

if using postfix you can enable the soft bounce option and the mail will back spool until everything comes back online

(needs to be enable on bother servers)

replication was still an issue on accounts with 300+ folders in them, still working on a fix for that.

Happy Tuesday !!! Thanks - paul

Paul Kudla

Scom.ca Internet Services <http://www.scom.ca> 004-1009 Byron Street South Whitby, Ontario - Canada L1N 4S3

Toronto 416.642.7266 Main 1.866.411.7266 Fax 1.888.892.7266

On 4/25/2022 10:01 AM, Arnaud Abélard wrote:

...
Ah, I'm now getting errors in the logs, that would explains the increasing number of failed sync requests:

dovecot: imap(xxxxx)<2961235><Bs6w43rdQPAqAcsFiXEmAInUhhA3Rfqh>: Error: Mailbox INBOX: /vmail/l/i/xxxxx/dovecot.index reset, view is now inconsistent

And sure enough:

dovecot replicator status xxxxx

xxxxx         none     00:02:54 07:11:28 -            y

What could explain that error?

Arnaud

On 25/04/2022 15:13, Arnaud Abélard wrote:

...
Hello,

On my side we are running Linux (Debian Buster).

I'm not sure my problem is actually the same as Paul or you Sebastian since I have a lot of boxes but those are actually small (quota of 110MB) so I doubt any of them have more than a dozen imap folders.

The main symptom is that I have tons of full sync requests awaiting but even though no other sync is pending the replicator just waits for something to trigger those syncs.

Today, with users back I can see that normal and incremental syncs are being done on the 15 connections, with an occasional full sync here or there and lots of "Waiting 'failed' requests":

Queued 'sync' requests        0

Queued 'high' requests        0

Queued 'low' requests         0

Queued 'failed' requests      122

Queued 'full resync' requests 28785

Waiting 'failed' requests     4294

Total number of known users   42512

So, why didn't the replicator take advantage of the weekend to replicate the mailboxes while no user were using them?

Arnaud

On 25/04/2022 13:54, Sebastian Marske wrote:

...
Hi there,

thanks for your insights and for diving deeper into this Paul!

For me, the users ending up in 'Waiting for dsync to finish' all have more than 256 Imap folders as well (ranging from 288 up to >5500; as per 'doveadm mailbox list -u <username> | wc -l'). For more details on my setup please see my post from February [1].

@Arnaud: What OS are you running on?

Best Sebastian

[1] https://dovecot.org/pipermail/dovecot/2022-February/124168.html

On 4/24/22 19:36, Paul Kudla (SCOM.CA Internet Services Inc.) wrote:

...
Question having similiar replication issues

pls read everything below and advise the folder counts on the non-replicated users?

i find the total number of folders / account seems to be a factor and NOT the size of the mail box

ie i have customers with 40G of emails no problem over 40 or so folders and it works ok

300+ folders seems to be the issue

i have been going through the replication code

no errors being logged

i am assuming that the replication --> dhclient --> other server is timing out or not reading the folder lists correctly (ie dies after X folders read)

thus i am going through the code patching for log entries etc to find the issues.

see

[13:33:57] mail18.scom.ca [root:0] /usr/local/var/lib/dovecot

ll

total 86 drwxr-xr-x 2 root wheel uarch    4B Apr 24 11:11 . drwxr-xr-x 4 root wheel uarch    4B Mar 8 2021 .. -rw-r--r-- 1 root wheel uarch   73B Apr 24 11:11 instances -rw-r--r-- 1 root wheel uarch 160K Apr 24 13:33 replicator.db

[13:33:58] mail18.scom.ca [root:0] /usr/local/var/lib/dovecot

replicator.db seems to get updated ok but never processed properly.

sync.users

nick@elirpa.com                   high     00:09:41 463:47:01 -     y keith@elirpa.com                  high     00:09:23 463:45:43 -     y paul@scom.ca                      high     00:09:41 463:46:51 -     y ed@scom.ca                        high     00:09:43 463:47:01 -     y ed.hanna@dssmgmt.com              high     00:09:42 463:46:58 -     y paul@paulkudla.net                high     00:09:44 463:47:03 580:35:07 y

so ....

two things :

first to get the production stuff to work i had to write a script that whould find the bad sync's and the force a dsync between the servers

i run this every five minutes or each server.

in crontab

*/10    *                *    *    *    root /usr/bin/nohup /programs/common/sync.recover > /dev/null

python script to sort things out

cat /programs/common/sync.recover

#!/usr/local/bin/python3

#Force sync between servers that are reporting bad?

import os,sys,django,socket from optparse import OptionParser

from lib import *

#Sample Re-Index MB #doveadm -D force-resync -u paul@scom.ca -f INBOX*

USAGE_TEXT = '''
usage: %%prog %s[options] '''

parser = OptionParser(usage=USAGE_TEXT % '', version='0.4')

parser.add_option("-m", "--send_to", dest="send_to", help="Send Email To") parser.add_option("-e", "--email", dest="email_box", help="Box to Index") parser.add_option("-d", "--detail",action='store_true', dest="detail",default =False, help="Detailed report") parser.add_option("-i", "--index",action='store_true', dest="index",default =False, help="Index")

options, args = parser.parse_args()

print (options.email_box) print (options.send_to) print (options.detail)

#sys.exit()

print ('Getting Current User Sync Status') command = commands("/usr/local/bin/doveadm replicator status '*'")

#print command

sync_user_status = command.output.split('\n')

#print sync_user_status

synced = []

for n in range(1,len(sync_user_status)) : user = sync_user_status[n] print ('Processing User : %s' %user.split(' ')[0]) if user.split(' ')[0] != options.email_box : if options.email_box != None : continue

         if options.index == True : command = '/usr/local/bin/doveadm -D force-resync -u %s -f INBOX*' %user.split(' ')[0] command = commands(command) command = command.output

         #print user for nn in range (len(user)-1,0,-1) : #print nn #print user[nn]

                 if user[nn] == '-' : #print 'skipping ... %s' %user.split(' ')[0]

                         break

                 if user[nn] == 'y': #Found a Bad Mailbox print ('syncing ... %s' %user.split(' ')[0])

                         if options.detail == True : command = '/usr/local/bin/doveadm -D sync -u %s -d -N -l 30 -U' %user.split(' ')[0] print (command) command = commands(command) command = command.output.split('\n') print (command) print ('Processed Mailbox for ... %s' %user.split(' ')[0] ) synced.append('Processed Mailbox for ... %s' %user.split(' ')[0]) for nnn in range(len(command)): synced.append(command[nnn] + '\n') break

                         if options.detail == False : #command = '/usr/local/bin/doveadm -D sync -u %s -d -N -l 30 -U' %user.split(' ')[0] #print (command) #command = os.system(command) command = subprocess.Popen( ["/usr/local/bin/doveadm sync -u %s -d -N -l 30 -U" %user.split(' ')[0] ],
shell = True, stdin=None, stdout=None, stderr=None, close_fds=True)

                                 print ( 'Processed Mailbox for ... %s' %user.split(' ')[0] ) synced.append('Processed Mailbox for ... %s' %user.split(' ')[0]) #sys.exit() break

if len(synced) != 0 : #send email showing bad synced boxes ?

         if options.send_to != None : send_from = 'monitor@scom.ca' send_to = ['%s' %options.send_to] send_subject = 'Dovecot Bad Sync Report for : %s' %(socket.gethostname()) send_text = '\n\n' for n in range (len(synced)) : send_text = send_text + synced[n] + '\n'

                 send_files = [] sendmail (send_from, send_to, send_subject, send_text, send_files)

sys.exit()

second :

i posted this a month ago - no response

please appreciate that i am trying to help ....

after much testing i can now reporduce the replication issues at hand

I am running on freebsd 12 & 13 stable (both test and production servers)

sdram drives etc ...

Basically replication works fine until reaching a folder quantity of ~ 256 or more

to reproduce using doveadm i created folders like

INBOX/folder-0 INBOX/folder-1 INBOX/folder-2 INBOX/folder-3 and so forth ......

I created 200 folders and they replicated ok on both servers

I created another 200 (400 total) and the replicator got stuck and would not update the mbox on the alternate server anymore and is still updating 4 days later ?

basically replicator goes so far and either hangs or more likely bails on an error that is not reported to the debug reporting ?

however dsync will sync the two servers but only when run manually (ie all the folders will sync)

I have two test servers avaliable if you need any kind of access - again here to help.

[07:28:42] mail18.scom.ca [root:0] ~

sync.status

Queued 'sync' requests        0 Queued 'high' requests        6 Queued 'low' requests         0 Queued 'failed' requests      0 Queued 'full resync' requests 0 Waiting 'failed' requests     0 Total number of known users   255

username                       type        status paul@scom.ca                   normal      Waiting for dsync to finish keith@elirpa.com               incremental Waiting for dsync to finish ed.hanna@dssmgmt.com           incremental Waiting for dsync to finish ed@scom.ca                     incremental Waiting for dsync to finish nick@elirpa.com                incremental Waiting for dsync to finish paul@paulkudla.net             incremental Waiting for dsync to finish

i have been going through the c code and it seems the replication gets requested ok

replicator.db does get updated ok with the replicated request for the mbox in question.

however i am still looking for the actual replicator function in the lib's that do the actual replication requests

the number of folders & subfolders is defanately the issue - not the mbox pyhsical size as thought origionally.

if someone can point me in the right direction, it seems either the replicator is not picking up on the number of folders to replicat properly or it has a hard set limit like 256 / 512 / 65535 etc and stops the replication request thereafter.

I am mainly a machine code programmer from the 80's and have concentrated on python as of late, 'c' i am starting to go through just to give you a background on my talents.

It took 2 months to finger this out.

this issue also seems to be indirectly causing the duplicate messages supression not to work as well.

python programming to reproduce issue (loops are for last run started @ 200 - fyi) :

cat mbox.gen

#!/usr/local/bin/python2

import os,sys

from lib import *

user = 'paul@paulkudla.net'

""" for count in range (0,600) : box = 'INBOX/folder-%s' %count print count command = '/usr/local/bin/doveadm mailbox create -s -u %s %s' %(user,box) print command a = commands.getoutput(command) print a """

for count in range (0,600) : box = 'INBOX/folder-0/sub-%' %count print count command = '/usr/local/bin/doveadm mailbox create -s -u %s %s' %(user,box) print command a = commands.getoutput(command) print a

         #sys.exit()

Happy Sunday !!! Thanks - paul

Paul Kudla

Scom.ca Internet Services <http://www.scom.ca> 004-1009 Byron Street South Whitby, Ontario - Canada L1N 4S3

Toronto 416.642.7266 Main 1.866.411.7266 Fax 1.888.892.7266

On 4/24/2022 10:22 AM, Arnaud Abélard wrote:

...
Hello,

I am working on replicating a server (and adding compression on the other side) and since I had "Error: dsync I/O has stalled, no activity for 600 seconds (version not received)" errors I upgraded both source and destination server with the latest 2.3 version (2.3.18). While before the upgrade all the 15 replication connections were busy after upgrading dovecot replicator dsync-status shows that most of the time nothing is being replicated at all. I can see some brief replications that last, but 99,9% of the time nothing is happening at all.

I have a replication_full_sync_interval of 12 hours but I have thousands of users with their last full sync over 90 hours ago.

"doveadm replicator status" also shows that i have over 35,000 queued full resync requests, but no sync, high or low queued requests so why aren't the full requests occuring?

There are no errors in the logs.

Thanks,

Arnaud

Paul Kudla (SCOM.CA Internet Services Inc.)

4:38 p.m.

Agreed there seems to be no way of posting these kinds of issues to see if they are even being addressed or even known about moving forward on new updates

i read somewhere there is a new branch soming out but nothing as of yet?

2.4 maybe .... 5.0 ........

my previous replication issues (back in feb) went unanswered.

not faulting anyone, but the developers do seem to be disconnected from issues as of late? or concentrating on other issues.

I have no problem with support contracts for day to day maintence however as a programmer myself they usually dont work as the other end relies on the latest source code anyways. Thus can not help.

I am trying to take a part the replicator c programming based on 2.3.18 as most of it does work to some extent.

tcps just does not work (ie 600 seconds default in the c programming)

My thoughts are tcp works ok but fails when the replicator through dsync-client.c when asked to return the folder list?

replicator-brain.c seems to control the overall process and timing.

replicator-queue.c seems to handle the que file that does seem to carry acurate info.

things in the source code are documented enough to figure this out but i am still going through all the related .h files documentation wise which are all over the place.

there is no clear documentation on the .h lib files so i have to walk through the tree one at a time finding relative code.

since the dsync from doveadm does see to work ok i have to assume the dsync-client used to compile the replicator is at fault somehow or a call from it upstream?

Thanks for your input on the other issues noted below, i will keep that in mind when disassembling the source code.

No sense in fixing one thing and leaving something else behind, probably all related anyways.

i have two test servers avaliable so i can play with all this offline to reproduce the issues

Unfortunately I have to make a living first, this will be addressed when possible as i dont like systems that are live running this way and currently only have 5 accounts with this issue (mine included)

Happy Tuesday !!! Thanks - paul

Paul Kudla

Scom.ca Internet Services <http://www.scom.ca> 004-1009 Byron Street South Whitby, Ontario - Canada L1N 4S3

Toronto 416.642.7266 Main 1.866.411.7266 Fax 1.888.892.7266

On 4/26/2022 9:03 AM, Reuben Farrelly wrote:

...

I ran into this back in February and documented a reproducible test case (and sent it to this list). In short - I was able to reproduce this by having a valid and consistent mailbox on the source/local, creating a very standard empty Maildir/(new|cur|tmp) folder on the remote replica, and then initiating the replicate from the source. This consistently caused dsync to fail replication with the error "dovecot.index reset, view is now inconsistent" and sync aborted, leaving the replica mailbox in a screwed up inconsistent state. Client connections on the source replica were also dropped when this error occurred. You can see the error by enabling debug level logging if you initiate dsync manually on a test mailbox.

The only workaround I found was to remove the remote Maildir and let Dovecot create the whole thing from scratch. Dovecot did not like any existing folders on the destination replica even if they were the same names as the source and completely empty. I was able to reproduce this the bare minimum of folders - just an INBOX!

I have no idea if any of the developers saw my post or if the bug has been fixed for the next release. But it seemed to be quite a common problem over time (saw a few posts from people going back a long way with the same problem) and it is seriously disruptive to clients. The error message is not helpful in tracking down the problem either.

Secondly, I also have had an ongoing and longstanding problem using tcps: for replication. For some reason using tcps: (with no other changes at all to the config) results in a lot of timeout messages "Error: dsync I/O has stalled, no activity for 600 seconds". This goes away if I revert back to tcp: instead of tcps - with tcp: I very rarely get timeouts. No idea why, guess this is a bug of some sort also.

It's disappointing that there appears to be no way to have these sorts or problems addressed like there once was. I am not using Dovecot for commercial purposes so paying a fortune for a support contract for a high end installation just isn't going to happen, and this list seems to be quite ordinary for getting support and reporting bugs nowadays....

Reuben

On 26/04/2022 7:21 pm, Paul Kudla (SCOM.CA Internet Services Inc.) wrote:

...
side issue

if you are getting inconsistant dsyncs there is no real way to fix this in the long run.

i know its a pain (already had to my self)

i needed to do a full sync, take one server offline, delete the user dir (with dovecot offline) and then rsync (or somehow duplicate the main server's user data) over the the remote again.

then bring remote back up and it kind or worked worked

best suggestion is to bring the main server down at night so the copy is clean?

if using postfix you can enable the soft bounce option and the mail will back spool until everything comes back online

(needs to be enable on bother servers)

replication was still an issue on accounts with 300+ folders in them, still working on a fix for that.

Happy Tuesday !!! Thanks - paul

Paul Kudla

Scom.ca Internet Services <http://www.scom.ca> 004-1009 Byron Street South Whitby, Ontario - Canada L1N 4S3

Toronto 416.642.7266 Main 1.866.411.7266 Fax 1.888.892.7266

On 4/25/2022 10:01 AM, Arnaud Abélard wrote:

...
Ah, I'm now getting errors in the logs, that would explains the increasing number of failed sync requests:

dovecot: imap(xxxxx)<2961235><Bs6w43rdQPAqAcsFiXEmAInUhhA3Rfqh>: Error: Mailbox INBOX: /vmail/l/i/xxxxx/dovecot.index reset, view is now inconsistent

And sure enough:

dovecot replicator status xxxxx

xxxxx         none     00:02:54 07:11:28 -            y

What could explain that error?

Arnaud

On 25/04/2022 15:13, Arnaud Abélard wrote:

...
Hello,

On my side we are running Linux (Debian Buster).

I'm not sure my problem is actually the same as Paul or you Sebastian since I have a lot of boxes but those are actually small (quota of 110MB) so I doubt any of them have more than a dozen imap folders.

The main symptom is that I have tons of full sync requests awaiting but even though no other sync is pending the replicator just waits for something to trigger those syncs.

Today, with users back I can see that normal and incremental syncs are being done on the 15 connections, with an occasional full sync here or there and lots of "Waiting 'failed' requests":

Queued 'sync' requests        0

Queued 'high' requests        0

Queued 'low' requests         0

Queued 'failed' requests      122

Queued 'full resync' requests 28785

Waiting 'failed' requests     4294

Total number of known users   42512

So, why didn't the replicator take advantage of the weekend to replicate the mailboxes while no user were using them?

Arnaud

On 25/04/2022 13:54, Sebastian Marske wrote:

...
Hi there,

thanks for your insights and for diving deeper into this Paul!

For me, the users ending up in 'Waiting for dsync to finish' all have more than 256 Imap folders as well (ranging from 288 up to >5500; as per 'doveadm mailbox list -u <username> | wc -l'). For more details on my setup please see my post from February [1].

@Arnaud: What OS are you running on?

Best Sebastian

[1] https://dovecot.org/pipermail/dovecot/2022-February/124168.html

On 4/24/22 19:36, Paul Kudla (SCOM.CA Internet Services Inc.) wrote:

...
Question having similiar replication issues

pls read everything below and advise the folder counts on the non-replicated users?

i find the total number of folders / account seems to be a factor and NOT the size of the mail box

ie i have customers with 40G of emails no problem over 40 or so folders and it works ok

300+ folders seems to be the issue

i have been going through the replication code

no errors being logged

i am assuming that the replication --> dhclient --> other server is timing out or not reading the folder lists correctly (ie dies after X folders read)

thus i am going through the code patching for log entries etc to find the issues.

see

[13:33:57] mail18.scom.ca [root:0] /usr/local/var/lib/dovecot

ll

total 86 drwxr-xr-x 2 root wheel uarch    4B Apr 24 11:11 . drwxr-xr-x 4 root wheel uarch    4B Mar 8 2021 .. -rw-r--r-- 1 root wheel uarch   73B Apr 24 11:11 instances -rw-r--r-- 1 root wheel uarch 160K Apr 24 13:33 replicator.db

[13:33:58] mail18.scom.ca [root:0] /usr/local/var/lib/dovecot

replicator.db seems to get updated ok but never processed properly.

sync.users

nick@elirpa.com                   high     00:09:41 463:47:01 -     y keith@elirpa.com                  high     00:09:23 463:45:43 -     y paul@scom.ca                      high     00:09:41 463:46:51 -     y ed@scom.ca                        high     00:09:43 463:47:01 -     y ed.hanna@dssmgmt.com              high     00:09:42 463:46:58 -     y paul@paulkudla.net                high     00:09:44 463:47:03 580:35:07 y

so ....

two things :

first to get the production stuff to work i had to write a script that whould find the bad sync's and the force a dsync between the servers

i run this every five minutes or each server.

in crontab

*/10    *                *    *    *    root /usr/bin/nohup /programs/common/sync.recover > /dev/null

python script to sort things out

cat /programs/common/sync.recover

#!/usr/local/bin/python3

#Force sync between servers that are reporting bad?

import os,sys,django,socket from optparse import OptionParser

from lib import *

#Sample Re-Index MB #doveadm -D force-resync -u paul@scom.ca -f INBOX*

USAGE_TEXT = '''
usage: %%prog %s[options] '''

parser = OptionParser(usage=USAGE_TEXT % '', version='0.4')

parser.add_option("-m", "--send_to", dest="send_to", help="Send Email To") parser.add_option("-e", "--email", dest="email_box", help="Box to Index") parser.add_option("-d", "--detail",action='store_true', dest="detail",default =False, help="Detailed report") parser.add_option("-i", "--index",action='store_true', dest="index",default =False, help="Index")

options, args = parser.parse_args()

print (options.email_box) print (options.send_to) print (options.detail)

#sys.exit()

print ('Getting Current User Sync Status') command = commands("/usr/local/bin/doveadm replicator status '*'")

#print command

sync_user_status = command.output.split('\n')

#print sync_user_status

synced = []

for n in range(1,len(sync_user_status)) : user = sync_user_status[n] print ('Processing User : %s' %user.split(' ')[0]) if user.split(' ')[0] != options.email_box : if options.email_box != None : continue

         if options.index == True : command = '/usr/local/bin/doveadm -D force-resync -u %s -f INBOX*' %user.split(' ')[0] command = commands(command) command = command.output

         #print user for nn in range (len(user)-1,0,-1) : #print nn #print user[nn]

                 if user[nn] == '-' : #print 'skipping ... %s' %user.split(' ')[0]

                         break

                 if user[nn] == 'y': #Found a Bad Mailbox print ('syncing ... %s' %user.split(' ')[0])

                         if options.detail == True : command = '/usr/local/bin/doveadm -D sync -u %s -d -N -l 30 -U' %user.split(' ')[0] print (command) command = commands(command) command = command.output.split('\n') print (command) print ('Processed Mailbox for ... %s' %user.split(' ')[0] ) synced.append('Processed Mailbox for ... %s' %user.split(' ')[0]) for nnn in range(len(command)): synced.append(command[nnn] + '\n') break

                         if options.detail == False : #command = '/usr/local/bin/doveadm -D sync -u %s -d -N -l 30 -U' %user.split(' ')[0] #print (command) #command = os.system(command) command = subprocess.Popen( ["/usr/local/bin/doveadm sync -u %s -d -N -l 30 -U" %user.split(' ')[0] ],
shell = True, stdin=None, stdout=None, stderr=None, close_fds=True)

                                 print ( 'Processed Mailbox for ... %s' %user.split(' ')[0] ) synced.append('Processed Mailbox for ... %s' %user.split(' ')[0]) #sys.exit() break

if len(synced) != 0 : #send email showing bad synced boxes ?

         if options.send_to != None : send_from = 'monitor@scom.ca' send_to = ['%s' %options.send_to] send_subject = 'Dovecot Bad Sync Report for : %s' %(socket.gethostname()) send_text = '\n\n' for n in range (len(synced)) : send_text = send_text + synced[n] + '\n'

                 send_files = [] sendmail (send_from, send_to, send_subject, send_text, send_files)

sys.exit()

second :

i posted this a month ago - no response

please appreciate that i am trying to help ....

after much testing i can now reporduce the replication issues at hand

I am running on freebsd 12 & 13 stable (both test and production servers)

sdram drives etc ...

Basically replication works fine until reaching a folder quantity of ~ 256 or more

to reproduce using doveadm i created folders like

INBOX/folder-0 INBOX/folder-1 INBOX/folder-2 INBOX/folder-3 and so forth ......

I created 200 folders and they replicated ok on both servers

I created another 200 (400 total) and the replicator got stuck and would not update the mbox on the alternate server anymore and is still updating 4 days later ?

basically replicator goes so far and either hangs or more likely bails on an error that is not reported to the debug reporting ?

however dsync will sync the two servers but only when run manually (ie all the folders will sync)

I have two test servers avaliable if you need any kind of access - again here to help.

[07:28:42] mail18.scom.ca [root:0] ~

sync.status

Queued 'sync' requests        0 Queued 'high' requests        6 Queued 'low' requests         0 Queued 'failed' requests      0 Queued 'full resync' requests 0 Waiting 'failed' requests     0 Total number of known users   255

username                       type        status paul@scom.ca                   normal      Waiting for dsync to finish keith@elirpa.com               incremental Waiting for dsync to finish ed.hanna@dssmgmt.com           incremental Waiting for dsync to finish ed@scom.ca                     incremental Waiting for dsync to finish nick@elirpa.com                incremental Waiting for dsync to finish paul@paulkudla.net             incremental Waiting for dsync to finish

i have been going through the c code and it seems the replication gets requested ok

replicator.db does get updated ok with the replicated request for the mbox in question.

however i am still looking for the actual replicator function in the lib's that do the actual replication requests

the number of folders & subfolders is defanately the issue - not the mbox pyhsical size as thought origionally.

if someone can point me in the right direction, it seems either the replicator is not picking up on the number of folders to replicat properly or it has a hard set limit like 256 / 512 / 65535 etc and stops the replication request thereafter.

I am mainly a machine code programmer from the 80's and have concentrated on python as of late, 'c' i am starting to go through just to give you a background on my talents.

It took 2 months to finger this out.

this issue also seems to be indirectly causing the duplicate messages supression not to work as well.

python programming to reproduce issue (loops are for last run started @ 200 - fyi) :

cat mbox.gen

#!/usr/local/bin/python2

import os,sys

from lib import *

user = 'paul@paulkudla.net'

""" for count in range (0,600) : box = 'INBOX/folder-%s' %count print count command = '/usr/local/bin/doveadm mailbox create -s -u %s %s' %(user,box) print command a = commands.getoutput(command) print a """

for count in range (0,600) : box = 'INBOX/folder-0/sub-%' %count print count command = '/usr/local/bin/doveadm mailbox create -s -u %s %s' %(user,box) print command a = commands.getoutput(command) print a

         #sys.exit()

Happy Sunday !!! Thanks - paul

Paul Kudla

Scom.ca Internet Services <http://www.scom.ca> 004-1009 Byron Street South Whitby, Ontario - Canada L1N 4S3

Toronto 416.642.7266 Main 1.866.411.7266 Fax 1.888.892.7266

On 4/24/2022 10:22 AM, Arnaud Abélard wrote: > Hello, > > I am working on replicating a server (and adding compression on the > other side) and since I had "Error: dsync I/O has stalled, no > activity > for 600 seconds (version not received)" errors I upgraded both > source > and destination server with the latest 2.3 version (2.3.18). While > before the upgrade all the 15 replication connections were busy > after > upgrading dovecot replicator dsync-status shows that most of the > time > nothing is being replicated at all. I can see some brief > replications > that last, but 99,9% of the time nothing is happening at all. > > I have a replication_full_sync_interval of 12 hours but I have > thousands of users with their last full sync over 90 hours ago. > > "doveadm replicator status" also shows that i have over 35,000 > queued > full resync requests, but no sync, high or low queued requests so > why > aren't the full requests occuring? > > There are no errors in the logs. > > Thanks, > > Arnaud > > > > >

Aki Tuomi

27 Apr 27 Apr

4:01 p.m.

Hi!

This is probably going to get fixed in 2.3.19, this looks like an issue we are already fixing.

Aki

...

On 26/04/2022 16:38 Paul Kudla (SCOM.CA Internet Services Inc.) <paul@scom.ca> wrote:

Agreed there seems to be no way of posting these kinds of issues to see if they are even being addressed or even known about moving forward on new updates

i read somewhere there is a new branch soming out but nothing as of yet?

2.4 maybe .... 5.0 ........

my previous replication issues (back in feb) went unanswered.

not faulting anyone, but the developers do seem to be disconnected from issues as of late? or concentrating on other issues.

I have no problem with support contracts for day to day maintence however as a programmer myself they usually dont work as the other end relies on the latest source code anyways. Thus can not help.

I am trying to take a part the replicator c programming based on 2.3.18 as most of it does work to some extent.

tcps just does not work (ie 600 seconds default in the c programming)

My thoughts are tcp works ok but fails when the replicator through dsync-client.c when asked to return the folder list?

replicator-brain.c seems to control the overall process and timing.

replicator-queue.c seems to handle the que file that does seem to carry acurate info.

things in the source code are documented enough to figure this out but i am still going through all the related .h files documentation wise which are all over the place.

there is no clear documentation on the .h lib files so i have to walk through the tree one at a time finding relative code.

since the dsync from doveadm does see to work ok i have to assume the dsync-client used to compile the replicator is at fault somehow or a call from it upstream?

Thanks for your input on the other issues noted below, i will keep that in mind when disassembling the source code.

No sense in fixing one thing and leaving something else behind, probably all related anyways.

i have two test servers avaliable so i can play with all this offline to reproduce the issues

Unfortunately I have to make a living first, this will be addressed when possible as i dont like systems that are live running this way and currently only have 5 accounts with this issue (mine included)

Happy Tuesday !!! Thanks - paul

Paul Kudla

Scom.ca Internet Services <http://www.scom.ca> 004-1009 Byron Street South Whitby, Ontario - Canada L1N 4S3

Toronto 416.642.7266 Main 1.866.411.7266 Fax 1.888.892.7266

On 4/26/2022 9:03 AM, Reuben Farrelly wrote:

...
I ran into this back in February and documented a reproducible test case (and sent it to this list). In short - I was able to reproduce this by having a valid and consistent mailbox on the source/local, creating a very standard empty Maildir/(new|cur|tmp) folder on the remote replica, and then initiating the replicate from the source. This consistently caused dsync to fail replication with the error "dovecot.index reset, view is now inconsistent" and sync aborted, leaving the replica mailbox in a screwed up inconsistent state. Client connections on the source replica were also dropped when this error occurred. You can see the error by enabling debug level logging if you initiate dsync manually on a test mailbox.

The only workaround I found was to remove the remote Maildir and let Dovecot create the whole thing from scratch. Dovecot did not like any existing folders on the destination replica even if they were the same names as the source and completely empty. I was able to reproduce this the bare minimum of folders - just an INBOX!

I have no idea if any of the developers saw my post or if the bug has been fixed for the next release. But it seemed to be quite a common problem over time (saw a few posts from people going back a long way with the same problem) and it is seriously disruptive to clients. The error message is not helpful in tracking down the problem either.

Secondly, I also have had an ongoing and longstanding problem using tcps: for replication. For some reason using tcps: (with no other changes at all to the config) results in a lot of timeout messages "Error: dsync I/O has stalled, no activity for 600 seconds". This goes away if I revert back to tcp: instead of tcps - with tcp: I very rarely get timeouts. No idea why, guess this is a bug of some sort also.

It's disappointing that there appears to be no way to have these sorts or problems addressed like there once was. I am not using Dovecot for commercial purposes so paying a fortune for a support contract for a high end installation just isn't going to happen, and this list seems to be quite ordinary for getting support and reporting bugs nowadays....

Reuben

On 26/04/2022 7:21 pm, Paul Kudla (SCOM.CA Internet Services Inc.) wrote:

...
side issue

if you are getting inconsistant dsyncs there is no real way to fix this in the long run.

i know its a pain (already had to my self)

i needed to do a full sync, take one server offline, delete the user dir (with dovecot offline) and then rsync (or somehow duplicate the main server's user data) over the the remote again.

then bring remote back up and it kind or worked worked

best suggestion is to bring the main server down at night so the copy is clean?

if using postfix you can enable the soft bounce option and the mail will back spool until everything comes back online

(needs to be enable on bother servers)

replication was still an issue on accounts with 300+ folders in them, still working on a fix for that.

Happy Tuesday !!! Thanks - paul

Paul Kudla

Scom.ca Internet Services <http://www.scom.ca> 004-1009 Byron Street South Whitby, Ontario - Canada L1N 4S3

Toronto 416.642.7266 Main 1.866.411.7266 Fax 1.888.892.7266

On 4/25/2022 10:01 AM, Arnaud Abélard wrote:

...
Ah, I'm now getting errors in the logs, that would explains the increasing number of failed sync requests:

dovecot: imap(xxxxx)<2961235><Bs6w43rdQPAqAcsFiXEmAInUhhA3Rfqh>: Error: Mailbox INBOX: /vmail/l/i/xxxxx/dovecot.index reset, view is now inconsistent

And sure enough:

dovecot replicator status xxxxx

xxxxx         none     00:02:54 07:11:28 -            y

What could explain that error?

Arnaud

On 25/04/2022 15:13, Arnaud Abélard wrote:

...
Hello,

On my side we are running Linux (Debian Buster).

I'm not sure my problem is actually the same as Paul or you Sebastian since I have a lot of boxes but those are actually small (quota of 110MB) so I doubt any of them have more than a dozen imap folders.

The main symptom is that I have tons of full sync requests awaiting but even though no other sync is pending the replicator just waits for something to trigger those syncs.

Today, with users back I can see that normal and incremental syncs are being done on the 15 connections, with an occasional full sync here or there and lots of "Waiting 'failed' requests":

Queued 'sync' requests        0

Queued 'high' requests        0

Queued 'low' requests         0

Queued 'failed' requests      122

Queued 'full resync' requests 28785

Waiting 'failed' requests     4294

Total number of known users   42512

So, why didn't the replicator take advantage of the weekend to replicate the mailboxes while no user were using them?

Arnaud

On 25/04/2022 13:54, Sebastian Marske wrote:

...
Hi there,

thanks for your insights and for diving deeper into this Paul!

For me, the users ending up in 'Waiting for dsync to finish' all have more than 256 Imap folders as well (ranging from 288 up to >5500; as per 'doveadm mailbox list -u <username> | wc -l'). For more details on my setup please see my post from February [1].

@Arnaud: What OS are you running on?

Best Sebastian

[1] https://dovecot.org/pipermail/dovecot/2022-February/124168.html

On 4/24/22 19:36, Paul Kudla (SCOM.CA Internet Services Inc.) wrote: > > Question having similiar replication issues > > pls read everything below and advise the folder counts on the > non-replicated users? > > i find the total number of folders / account seems to be a factor > and > NOT the size of the mail box > > ie i have customers with 40G of emails no problem over 40 or so > folders > and it works ok > > 300+ folders seems to be the issue > > i have been going through the replication code > > no errors being logged > > i am assuming that the replication --> dhclient --> other server is > timing out or not reading the folder lists correctly (ie dies after X > folders read) > > thus i am going through the code patching for log entries etc to find > the issues. > > see > > [13:33:57] mail18.scom.ca [root:0] /usr/local/var/lib/dovecot > # ll > total 86 > drwxr-xr-x 2 root wheel uarch    4B Apr 24 11:11 . > drwxr-xr-x 4 root wheel uarch    4B Mar 8 2021 .. > -rw-r--r-- 1 root wheel uarch   73B Apr 24 11:11 instances > -rw-r--r-- 1 root wheel uarch 160K Apr 24 13:33 replicator.db > > [13:33:58] mail18.scom.ca [root:0] /usr/local/var/lib/dovecot > # > > replicator.db seems to get updated ok but never processed properly. > > # sync.users > nick@elirpa.com                   high     00:09:41 463:47:01 -     y > keith@elirpa.com                  high     00:09:23 463:45:43 -     y > paul@scom.ca                      high     00:09:41 463:46:51 -     y > ed@scom.ca                        high     00:09:43 463:47:01 -     y > ed.hanna@dssmgmt.com              high     00:09:42 463:46:58 -     y > paul@paulkudla.net                high     00:09:44 463:47:03 > 580:35:07 >     y > > > > > so .... > > > > two things : > > first to get the production stuff to work i had to write a script > that > whould find the bad sync's and the force a dsync between the servers > > i run this every five minutes or each server. > > in crontab > > */10    *                *    *    *    root /usr/bin/nohup > /programs/common/sync.recover > /dev/null > > > python script to sort things out > > # cat /programs/common/sync.recover > #!/usr/local/bin/python3 > > #Force sync between servers that are reporting bad? > > import os,sys,django,socket > from optparse import OptionParser > > > from lib import * > > #Sample Re-Index MB > #doveadm -D force-resync -u paul@scom.ca -f INBOX* > > > > USAGE_TEXT = '''
> usage: %%prog %s[options] > ''' > > parser = OptionParser(usage=USAGE_TEXT % '', version='0.4') > > parser.add_option("-m", "--send_to", dest="send_to", help="Send > Email To") > parser.add_option("-e", "--email", dest="email_box", help="Box to > Index") > parser.add_option("-d", "--detail",action='store_true', > dest="detail",default =False, help="Detailed report") > parser.add_option("-i", "--index",action='store_true', > dest="index",default =False, help="Index") > > options, args = parser.parse_args() > > print (options.email_box) > print (options.send_to) > print (options.detail) > > #sys.exit() > > > > print ('Getting Current User Sync Status') > command = commands("/usr/local/bin/doveadm replicator status '*'") > > > #print command > > sync_user_status = command.output.split('\n') > > #print sync_user_status > > synced = [] > > for n in range(1,len(sync_user_status)) : >          user = sync_user_status[n] >          print ('Processing User : %s' %user.split(' ')[0]) >          if user.split(' ')[0] != options.email_box : >                  if options.email_box != None : >                          continue > >          if options.index == True : >                  command = '/usr/local/bin/doveadm -D force-resync > -u %s > -f INBOX*' %user.split(' ')[0] >                  command = commands(command) >                  command = command.output > >          #print user >          for nn in range (len(user)-1,0,-1) : >                  #print nn >                  #print user[nn] > >                  if user[nn] == '-' : >                          #print 'skipping ... %s' %user.split(' ')[0] > >                          break > > > >                  if user[nn] == 'y': #Found a Bad Mailbox >                          print ('syncing ... %s' %user.split(' ')[0]) > > >                          if options.detail == True : >                                  command = '/usr/local/bin/doveadm -D > sync -u %s -d -N -l 30 -U' %user.split(' ')[0] >                                  print (command) >                                  command = commands(command) >                                  command = command.output.split('\n') >                                  print (command) >                                  print ('Processed Mailbox for ... > %s' > %user.split(' ')[0] ) >                                  synced.append('Processed Mailbox > for ... > %s' %user.split(' ')[0]) >                                  for nnn in range(len(command)): > synced.append(command[nnn] + '\n') >                                  break > > >                          if options.detail == False : >                                  #command = > '/usr/local/bin/doveadm -D > sync -u %s -d -N -l 30 -U' %user.split(' ')[0] >                                  #print (command) >                                  #command = os.system(command) >                                  command = subprocess.Popen( > ["/usr/local/bin/doveadm sync -u %s -d -N -l 30 -U" %user.split(' > ')[0] > ],
>                                  shell = True, stdin=None, > stdout=None, > stderr=None, close_fds=True) > >                                  print ( 'Processed Mailbox for > ... %s' > %user.split(' ')[0] ) >                                  synced.append('Processed Mailbox > for ... > %s' %user.split(' ')[0]) >                                  #sys.exit() >                                  break > > if len(synced) != 0 : >          #send email showing bad synced boxes ? > >          if options.send_to != None : >                  send_from = 'monitor@scom.ca' >                  send_to = ['%s' %options.send_to] >                  send_subject = 'Dovecot Bad Sync Report for : %s' > %(socket.gethostname()) >                  send_text = '\n\n' >                  for n in range (len(synced)) : >                          send_text = send_text + synced[n] + '\n' > >                  send_files = [] >                  sendmail (send_from, send_to, send_subject, > send_text, > send_files) > > > > sys.exit() > > second : > > i posted this a month ago - no response > > please appreciate that i am trying to help .... > > after much testing i can now reporduce the replication issues at hand > > I am running on freebsd 12 & 13 stable (both test and production > servers) > > sdram drives etc ... > > Basically replication works fine until reaching a folder quantity > of ~ > 256 or more > > to reproduce using doveadm i created folders like > > INBOX/folder-0 > INBOX/folder-1 > INBOX/folder-2 > INBOX/folder-3 > and so forth ...... > > I created 200 folders and they replicated ok on both servers > > I created another 200 (400 total) and the replicator got stuck and > would > not update the mbox on the alternate server anymore and is still > updating 4 days later ? > > basically replicator goes so far and either hangs or more likely > bails > on an error that is not reported to the debug reporting ? > > however dsync will sync the two servers but only when run manually > (ie > all the folders will sync) > > I have two test servers avaliable if you need any kind of access - > again > here to help. > > [07:28:42] mail18.scom.ca [root:0] ~ > # sync.status > Queued 'sync' requests        0 > Queued 'high' requests        6 > Queued 'low' requests         0 > Queued 'failed' requests      0 > Queued 'full resync' requests 0 > Waiting 'failed' requests     0 > Total number of known users   255 > > username                       type        status > paul@scom.ca                   normal      Waiting for dsync to > finish > keith@elirpa.com               incremental Waiting for dsync to > finish > ed.hanna@dssmgmt.com           incremental Waiting for dsync to > finish > ed@scom.ca                     incremental Waiting for dsync to > finish > nick@elirpa.com                incremental Waiting for dsync to > finish > paul@paulkudla.net             incremental Waiting for dsync to > finish > > > i have been going through the c code and it seems the replication > gets > requested ok > > replicator.db does get updated ok with the replicated request for the > mbox in question. > > however i am still looking for the actual replicator function in the > lib's that do the actual replication requests > > the number of folders & subfolders is defanately the issue - not the > mbox pyhsical size as thought origionally. > > if someone can point me in the right direction, it seems either the > replicator is not picking up on the number of folders to replicat > properly or it has a hard set limit like 256 / 512 / 65535 etc and > stops > the replication request thereafter. > > I am mainly a machine code programmer from the 80's and have > concentrated on python as of late, 'c' i am starting to go through > just > to give you a background on my talents. > > It took 2 months to finger this out. > > this issue also seems to be indirectly causing the duplicate messages > supression not to work as well. > > python programming to reproduce issue (loops are for last run > started @ > 200 - fyi) : > > # cat mbox.gen > #!/usr/local/bin/python2 > > import os,sys > > from lib import * > > > user = 'paul@paulkudla.net' > > """ > for count in range (0,600) : >          box = 'INBOX/folder-%s' %count >          print count >          command = '/usr/local/bin/doveadm mailbox create -s -u %s > %s' > %(user,box) >          print command >          a = commands.getoutput(command) >          print a > """ > > for count in range (0,600) : >          box = 'INBOX/folder-0/sub-%' %count >          print count >          command = '/usr/local/bin/doveadm mailbox create -s -u %s > %s' > %(user,box) >          print command >          a = commands.getoutput(command) >          print a > > > >          #sys.exit() > > > > > > Happy Sunday !!! > Thanks - paul > > Paul Kudla > > > Scom.ca Internet Services <http://www.scom.ca> > 004-1009 Byron Street South > Whitby, Ontario - Canada > L1N 4S3 > > Toronto 416.642.7266 > Main 1.866.411.7266 > Fax 1.888.892.7266 > > On 4/24/2022 10:22 AM, Arnaud Abélard wrote: >> Hello, >> >> I am working on replicating a server (and adding compression on the >> other side) and since I had "Error: dsync I/O has stalled, no >> activity >> for 600 seconds (version not received)" errors I upgraded both >> source >> and destination server with the latest 2.3 version (2.3.18). While >> before the upgrade all the 15 replication connections were busy >> after >> upgrading dovecot replicator dsync-status shows that most of the >> time >> nothing is being replicated at all. I can see some brief >> replications >> that last, but 99,9% of the time nothing is happening at all. >> >> I have a replication_full_sync_interval of 12 hours but I have >> thousands of users with their last full sync over 90 hours ago. >> >> "doveadm replicator status" also shows that i have over 35,000 >> queued >> full resync requests, but no sync, high or low queued requests so >> why >> aren't the full requests occuring? >> >> There are no errors in the logs. >> >> Thanks, >> >> Arnaud >> >> >> >> >>

Paul Kudla (SCOM.CA Internet Services Inc.)

28 Apr 28 Apr

1:57 p.m.

Thanks for the update.

is this for both replication issues (folders +300 etc)

Just Asking - Any ETA

Happy Thursday !!! Thanks - paul

Paul Kudla

Scom.ca Internet Services <http://www.scom.ca> 004-1009 Byron Street South Whitby, Ontario - Canada L1N 4S3

Toronto 416.642.7266 Main 1.866.411.7266 Fax 1.888.892.7266

On 4/27/2022 9:01 AM, Aki Tuomi wrote:

...

Hi!

This is probably going to get fixed in 2.3.19, this looks like an issue we are already fixing.

Aki

...
On 26/04/2022 16:38 Paul Kudla (SCOM.CA Internet Services Inc.) <paul@scom.ca> wrote:

Agreed there seems to be no way of posting these kinds of issues to see if they are even being addressed or even known about moving forward on new updates

i read somewhere there is a new branch soming out but nothing as of yet?

2.4 maybe .... 5.0 ........

my previous replication issues (back in feb) went unanswered.

not faulting anyone, but the developers do seem to be disconnected from issues as of late? or concentrating on other issues.

I have no problem with support contracts for day to day maintence however as a programmer myself they usually dont work as the other end relies on the latest source code anyways. Thus can not help.

I am trying to take a part the replicator c programming based on 2.3.18 as most of it does work to some extent.

tcps just does not work (ie 600 seconds default in the c programming)

My thoughts are tcp works ok but fails when the replicator through dsync-client.c when asked to return the folder list?

replicator-brain.c seems to control the overall process and timing.

replicator-queue.c seems to handle the que file that does seem to carry acurate info.

things in the source code are documented enough to figure this out but i am still going through all the related .h files documentation wise which are all over the place.

there is no clear documentation on the .h lib files so i have to walk through the tree one at a time finding relative code.

since the dsync from doveadm does see to work ok i have to assume the dsync-client used to compile the replicator is at fault somehow or a call from it upstream?

Thanks for your input on the other issues noted below, i will keep that in mind when disassembling the source code.

No sense in fixing one thing and leaving something else behind, probably all related anyways.

i have two test servers avaliable so i can play with all this offline to reproduce the issues

Unfortunately I have to make a living first, this will be addressed when possible as i dont like systems that are live running this way and currently only have 5 accounts with this issue (mine included)

Happy Tuesday !!! Thanks - paul

Paul Kudla

Scom.ca Internet Services <http://www.scom.ca> 004-1009 Byron Street South Whitby, Ontario - Canada L1N 4S3

Toronto 416.642.7266 Main 1.866.411.7266 Fax 1.888.892.7266

On 4/26/2022 9:03 AM, Reuben Farrelly wrote:

...
I ran into this back in February and documented a reproducible test case (and sent it to this list). In short - I was able to reproduce this by having a valid and consistent mailbox on the source/local, creating a very standard empty Maildir/(new|cur|tmp) folder on the remote replica, and then initiating the replicate from the source. This consistently caused dsync to fail replication with the error "dovecot.index reset, view is now inconsistent" and sync aborted, leaving the replica mailbox in a screwed up inconsistent state. Client connections on the source replica were also dropped when this error occurred. You can see the error by enabling debug level logging if you initiate dsync manually on a test mailbox.

The only workaround I found was to remove the remote Maildir and let Dovecot create the whole thing from scratch. Dovecot did not like any existing folders on the destination replica even if they were the same names as the source and completely empty. I was able to reproduce this the bare minimum of folders - just an INBOX!

I have no idea if any of the developers saw my post or if the bug has been fixed for the next release. But it seemed to be quite a common problem over time (saw a few posts from people going back a long way with the same problem) and it is seriously disruptive to clients. The error message is not helpful in tracking down the problem either.

Secondly, I also have had an ongoing and longstanding problem using tcps: for replication. For some reason using tcps: (with no other changes at all to the config) results in a lot of timeout messages "Error: dsync I/O has stalled, no activity for 600 seconds". This goes away if I revert back to tcp: instead of tcps - with tcp: I very rarely get timeouts. No idea why, guess this is a bug of some sort also.

It's disappointing that there appears to be no way to have these sorts or problems addressed like there once was. I am not using Dovecot for commercial purposes so paying a fortune for a support contract for a high end installation just isn't going to happen, and this list seems to be quite ordinary for getting support and reporting bugs nowadays....

Reuben

On 26/04/2022 7:21 pm, Paul Kudla (SCOM.CA Internet Services Inc.) wrote:

...
side issue

if you are getting inconsistant dsyncs there is no real way to fix this in the long run.

i know its a pain (already had to my self)

i needed to do a full sync, take one server offline, delete the user dir (with dovecot offline) and then rsync (or somehow duplicate the main server's user data) over the the remote again.

then bring remote back up and it kind or worked worked

best suggestion is to bring the main server down at night so the copy is clean?

if using postfix you can enable the soft bounce option and the mail will back spool until everything comes back online

(needs to be enable on bother servers)

replication was still an issue on accounts with 300+ folders in them, still working on a fix for that.

Happy Tuesday !!! Thanks - paul

Paul Kudla

Scom.ca Internet Services <http://www.scom.ca> 004-1009 Byron Street South Whitby, Ontario - Canada L1N 4S3

Toronto 416.642.7266 Main 1.866.411.7266 Fax 1.888.892.7266

On 4/25/2022 10:01 AM, Arnaud Abélard wrote:

...
Ah, I'm now getting errors in the logs, that would explains the increasing number of failed sync requests:

dovecot: imap(xxxxx)<2961235><Bs6w43rdQPAqAcsFiXEmAInUhhA3Rfqh>: Error: Mailbox INBOX: /vmail/l/i/xxxxx/dovecot.index reset, view is now inconsistent

And sure enough:

dovecot replicator status xxxxx

xxxxx         none     00:02:54 07:11:28 -            y

What could explain that error?

Arnaud

On 25/04/2022 15:13, Arnaud Abélard wrote:

...
Hello,

On my side we are running Linux (Debian Buster).

I'm not sure my problem is actually the same as Paul or you Sebastian since I have a lot of boxes but those are actually small (quota of 110MB) so I doubt any of them have more than a dozen imap folders.

The main symptom is that I have tons of full sync requests awaiting but even though no other sync is pending the replicator just waits for something to trigger those syncs.

Today, with users back I can see that normal and incremental syncs are being done on the 15 connections, with an occasional full sync here or there and lots of "Waiting 'failed' requests":

Queued 'sync' requests        0

Queued 'high' requests        0

Queued 'low' requests         0

Queued 'failed' requests      122

Queued 'full resync' requests 28785

Waiting 'failed' requests     4294

Total number of known users   42512

So, why didn't the replicator take advantage of the weekend to replicate the mailboxes while no user were using them?

Arnaud

On 25/04/2022 13:54, Sebastian Marske wrote: > Hi there, > > thanks for your insights and for diving deeper into this Paul! > > For me, the users ending up in 'Waiting for dsync to finish' all have > more than 256 Imap folders as well (ranging from 288 up to >5500; > as per > 'doveadm mailbox list -u <username> | wc -l'). For more details on my > setup please see my post from February [1]. > > @Arnaud: What OS are you running on? > > > Best > Sebastian > > > [1] https://dovecot.org/pipermail/dovecot/2022-February/124168.html > > > On 4/24/22 19:36, Paul Kudla (SCOM.CA Internet Services Inc.) wrote: >> >> Question having similiar replication issues >> >> pls read everything below and advise the folder counts on the >> non-replicated users? >> >> i find the total number of folders / account seems to be a factor >> and >> NOT the size of the mail box >> >> ie i have customers with 40G of emails no problem over 40 or so >> folders >> and it works ok >> >> 300+ folders seems to be the issue >> >> i have been going through the replication code >> >> no errors being logged >> >> i am assuming that the replication --> dhclient --> other server is >> timing out or not reading the folder lists correctly (ie dies after X >> folders read) >> >> thus i am going through the code patching for log entries etc to find >> the issues. >> >> see >> >> [13:33:57] mail18.scom.ca [root:0] /usr/local/var/lib/dovecot >> # ll >> total 86 >> drwxr-xr-x 2 root wheel uarch    4B Apr 24 11:11 . >> drwxr-xr-x 4 root wheel uarch    4B Mar 8 2021 .. >> -rw-r--r-- 1 root wheel uarch   73B Apr 24 11:11 instances >> -rw-r--r-- 1 root wheel uarch 160K Apr 24 13:33 replicator.db >> >> [13:33:58] mail18.scom.ca [root:0] /usr/local/var/lib/dovecot >> # >> >> replicator.db seems to get updated ok but never processed properly. >> >> # sync.users >> nick@elirpa.com                   high     00:09:41 463:47:01 -     y >> keith@elirpa.com                  high     00:09:23 463:45:43 -     y >> paul@scom.ca                      high     00:09:41 463:46:51 -     y >> ed@scom.ca                        high     00:09:43 463:47:01 -     y >> ed.hanna@dssmgmt.com              high     00:09:42 463:46:58 -     y >> paul@paulkudla.net                high     00:09:44 463:47:03 >> 580:35:07 >>     y >> >> >> >> >> so .... >> >> >> >> two things : >> >> first to get the production stuff to work i had to write a script >> that >> whould find the bad sync's and the force a dsync between the servers >> >> i run this every five minutes or each server. >> >> in crontab >> >> */10    *                *    *    *    root /usr/bin/nohup >> /programs/common/sync.recover > /dev/null >> >> >> python script to sort things out >> >> # cat /programs/common/sync.recover >> #!/usr/local/bin/python3 >> >> #Force sync between servers that are reporting bad? >> >> import os,sys,django,socket >> from optparse import OptionParser >> >> >> from lib import * >> >> #Sample Re-Index MB >> #doveadm -D force-resync -u paul@scom.ca -f INBOX* >> >> >> >> USAGE_TEXT = '''
>> usage: %%prog %s[options] >> ''' >> >> parser = OptionParser(usage=USAGE_TEXT % '', version='0.4') >> >> parser.add_option("-m", "--send_to", dest="send_to", help="Send >> Email To") >> parser.add_option("-e", "--email", dest="email_box", help="Box to >> Index") >> parser.add_option("-d", "--detail",action='store_true', >> dest="detail",default =False, help="Detailed report") >> parser.add_option("-i", "--index",action='store_true', >> dest="index",default =False, help="Index") >> >> options, args = parser.parse_args() >> >> print (options.email_box) >> print (options.send_to) >> print (options.detail) >> >> #sys.exit() >> >> >> >> print ('Getting Current User Sync Status') >> command = commands("/usr/local/bin/doveadm replicator status '*'") >> >> >> #print command >> >> sync_user_status = command.output.split('\n') >> >> #print sync_user_status >> >> synced = [] >> >> for n in range(1,len(sync_user_status)) : >>          user = sync_user_status[n] >>          print ('Processing User : %s' %user.split(' ')[0]) >>          if user.split(' ')[0] != options.email_box : >>                  if options.email_box != None : >>                          continue >> >>          if options.index == True : >>                  command = '/usr/local/bin/doveadm -D force-resync >> -u %s >> -f INBOX*' %user.split(' ')[0] >>                  command = commands(command) >>                  command = command.output >> >>          #print user >>          for nn in range (len(user)-1,0,-1) : >>                  #print nn >>                  #print user[nn] >> >>                  if user[nn] == '-' : >>                          #print 'skipping ... %s' %user.split(' ')[0] >> >>                          break >> >> >> >>                  if user[nn] == 'y': #Found a Bad Mailbox >>                          print ('syncing ... %s' %user.split(' ')[0]) >> >> >>                          if options.detail == True : >>                                  command = '/usr/local/bin/doveadm -D >> sync -u %s -d -N -l 30 -U' %user.split(' ')[0] >>                                  print (command) >>                                  command = commands(command) >>                                  command = command.output.split('\n') >>                                  print (command) >>                                  print ('Processed Mailbox for ... >> %s' >> %user.split(' ')[0] ) >>                                  synced.append('Processed Mailbox >> for ... >> %s' %user.split(' ')[0]) >>                                  for nnn in range(len(command)): >> synced.append(command[nnn] + '\n') >>                                  break >> >> >>                          if options.detail == False : >>                                  #command = >> '/usr/local/bin/doveadm -D >> sync -u %s -d -N -l 30 -U' %user.split(' ')[0] >>                                  #print (command) >>                                  #command = os.system(command) >>                                  command = subprocess.Popen( >> ["/usr/local/bin/doveadm sync -u %s -d -N -l 30 -U" %user.split(' >> ')[0] >> ],
>>                                  shell = True, stdin=None, >> stdout=None, >> stderr=None, close_fds=True) >> >>                                  print ( 'Processed Mailbox for >> ... %s' >> %user.split(' ')[0] ) >>                                  synced.append('Processed Mailbox >> for ... >> %s' %user.split(' ')[0]) >>                                  #sys.exit() >>                                  break >> >> if len(synced) != 0 : >>          #send email showing bad synced boxes ? >> >>          if options.send_to != None : >>                  send_from = 'monitor@scom.ca' >>                  send_to = ['%s' %options.send_to] >>                  send_subject = 'Dovecot Bad Sync Report for : %s' >> %(socket.gethostname()) >>                  send_text = '\n\n' >>                  for n in range (len(synced)) : >>                          send_text = send_text + synced[n] + '\n' >> >>                  send_files = [] >>                  sendmail (send_from, send_to, send_subject, >> send_text, >> send_files) >> >> >> >> sys.exit() >> >> second : >> >> i posted this a month ago - no response >> >> please appreciate that i am trying to help .... >> >> after much testing i can now reporduce the replication issues at hand >> >> I am running on freebsd 12 & 13 stable (both test and production >> servers) >> >> sdram drives etc ... >> >> Basically replication works fine until reaching a folder quantity >> of ~ >> 256 or more >> >> to reproduce using doveadm i created folders like >> >> INBOX/folder-0 >> INBOX/folder-1 >> INBOX/folder-2 >> INBOX/folder-3 >> and so forth ...... >> >> I created 200 folders and they replicated ok on both servers >> >> I created another 200 (400 total) and the replicator got stuck and >> would >> not update the mbox on the alternate server anymore and is still >> updating 4 days later ? >> >> basically replicator goes so far and either hangs or more likely >> bails >> on an error that is not reported to the debug reporting ? >> >> however dsync will sync the two servers but only when run manually >> (ie >> all the folders will sync) >> >> I have two test servers avaliable if you need any kind of access - >> again >> here to help. >> >> [07:28:42] mail18.scom.ca [root:0] ~ >> # sync.status >> Queued 'sync' requests        0 >> Queued 'high' requests        6 >> Queued 'low' requests         0 >> Queued 'failed' requests      0 >> Queued 'full resync' requests 0 >> Waiting 'failed' requests     0 >> Total number of known users   255 >> >> username                       type        status >> paul@scom.ca                   normal      Waiting for dsync to >> finish >> keith@elirpa.com               incremental Waiting for dsync to >> finish >> ed.hanna@dssmgmt.com           incremental Waiting for dsync to >> finish >> ed@scom.ca                     incremental Waiting for dsync to >> finish >> nick@elirpa.com                incremental Waiting for dsync to >> finish >> paul@paulkudla.net             incremental Waiting for dsync to >> finish >> >> >> i have been going through the c code and it seems the replication >> gets >> requested ok >> >> replicator.db does get updated ok with the replicated request for the >> mbox in question. >> >> however i am still looking for the actual replicator function in the >> lib's that do the actual replication requests >> >> the number of folders & subfolders is defanately the issue - not the >> mbox pyhsical size as thought origionally. >> >> if someone can point me in the right direction, it seems either the >> replicator is not picking up on the number of folders to replicat >> properly or it has a hard set limit like 256 / 512 / 65535 etc and >> stops >> the replication request thereafter. >> >> I am mainly a machine code programmer from the 80's and have >> concentrated on python as of late, 'c' i am starting to go through >> just >> to give you a background on my talents. >> >> It took 2 months to finger this out. >> >> this issue also seems to be indirectly causing the duplicate messages >> supression not to work as well. >> >> python programming to reproduce issue (loops are for last run >> started @ >> 200 - fyi) : >> >> # cat mbox.gen >> #!/usr/local/bin/python2 >> >> import os,sys >> >> from lib import * >> >> >> user = 'paul@paulkudla.net' >> >> """ >> for count in range (0,600) : >>          box = 'INBOX/folder-%s' %count >>          print count >>          command = '/usr/local/bin/doveadm mailbox create -s -u %s >> %s' >> %(user,box) >>          print command >>          a = commands.getoutput(command) >>          print a >> """ >> >> for count in range (0,600) : >>          box = 'INBOX/folder-0/sub-%' %count >>          print count >>          command = '/usr/local/bin/doveadm mailbox create -s -u %s >> %s' >> %(user,box) >>          print command >>          a = commands.getoutput(command) >>          print a >> >> >> >>          #sys.exit() >> >> >> >> >> >> Happy Sunday !!! >> Thanks - paul >> >> Paul Kudla >> >> >> Scom.ca Internet Services <http://www.scom.ca> >> 004-1009 Byron Street South >> Whitby, Ontario - Canada >> L1N 4S3 >> >> Toronto 416.642.7266 >> Main 1.866.411.7266 >> Fax 1.888.892.7266 >> >> On 4/24/2022 10:22 AM, Arnaud Abélard wrote: >>> Hello, >>> >>> I am working on replicating a server (and adding compression on the >>> other side) and since I had "Error: dsync I/O has stalled, no >>> activity >>> for 600 seconds (version not received)" errors I upgraded both >>> source >>> and destination server with the latest 2.3 version (2.3.18). While >>> before the upgrade all the 15 replication connections were busy >>> after >>> upgrading dovecot replicator dsync-status shows that most of the >>> time >>> nothing is being replicated at all. I can see some brief >>> replications >>> that last, but 99,9% of the time nothing is happening at all. >>> >>> I have a replication_full_sync_interval of 12 hours but I have >>> thousands of users with their last full sync over 90 hours ago. >>> >>> "doveadm replicator status" also shows that i have over 35,000 >>> queued >>> full resync requests, but no sync, high or low queued requests so >>> why >>> aren't the full requests occuring? >>> >>> There are no errors in the logs. >>> >>> Thanks, >>> >>> Arnaud >>> >>> >>> >>> >>>

Aki Tuomi

2:02 p.m.

2.3.19 is round the corner, so not long. I cannot yet promise an exact date but hopefully within week or two.

Aki

...

On 28/04/2022 13:57 Paul Kudla (SCOM.CA Internet Services Inc.) <paul@scom.ca> wrote:

Thanks for the update.

is this for both replication issues (folders +300 etc)

Just Asking - Any ETA

Happy Thursday !!! Thanks - paul

Paul Kudla

Scom.ca Internet Services <http://www.scom.ca> 004-1009 Byron Street South Whitby, Ontario - Canada L1N 4S3

Toronto 416.642.7266 Main 1.866.411.7266 Fax 1.888.892.7266

On 4/27/2022 9:01 AM, Aki Tuomi wrote:

...
Hi!

This is probably going to get fixed in 2.3.19, this looks like an issue we are already fixing.

Aki

...
On 26/04/2022 16:38 Paul Kudla (SCOM.CA Internet Services Inc.) <paul@scom.ca> wrote:

Agreed there seems to be no way of posting these kinds of issues to see if they are even being addressed or even known about moving forward on new updates

i read somewhere there is a new branch soming out but nothing as of yet?

2.4 maybe .... 5.0 ........

my previous replication issues (back in feb) went unanswered.

not faulting anyone, but the developers do seem to be disconnected from issues as of late? or concentrating on other issues.

I have no problem with support contracts for day to day maintence however as a programmer myself they usually dont work as the other end relies on the latest source code anyways. Thus can not help.

I am trying to take a part the replicator c programming based on 2.3.18 as most of it does work to some extent.

tcps just does not work (ie 600 seconds default in the c programming)

My thoughts are tcp works ok but fails when the replicator through dsync-client.c when asked to return the folder list?

replicator-brain.c seems to control the overall process and timing.

replicator-queue.c seems to handle the que file that does seem to carry acurate info.

things in the source code are documented enough to figure this out but i am still going through all the related .h files documentation wise which are all over the place.

there is no clear documentation on the .h lib files so i have to walk through the tree one at a time finding relative code.

since the dsync from doveadm does see to work ok i have to assume the dsync-client used to compile the replicator is at fault somehow or a call from it upstream?

Thanks for your input on the other issues noted below, i will keep that in mind when disassembling the source code.

No sense in fixing one thing and leaving something else behind, probably all related anyways.

i have two test servers avaliable so i can play with all this offline to reproduce the issues

Unfortunately I have to make a living first, this will be addressed when possible as i dont like systems that are live running this way and currently only have 5 accounts with this issue (mine included)

Happy Tuesday !!! Thanks - paul

Paul Kudla

Scom.ca Internet Services <http://www.scom.ca> 004-1009 Byron Street South Whitby, Ontario - Canada L1N 4S3

Toronto 416.642.7266 Main 1.866.411.7266 Fax 1.888.892.7266

On 4/26/2022 9:03 AM, Reuben Farrelly wrote:

...
I ran into this back in February and documented a reproducible test case (and sent it to this list). In short - I was able to reproduce this by having a valid and consistent mailbox on the source/local, creating a very standard empty Maildir/(new|cur|tmp) folder on the remote replica, and then initiating the replicate from the source. This consistently caused dsync to fail replication with the error "dovecot.index reset, view is now inconsistent" and sync aborted, leaving the replica mailbox in a screwed up inconsistent state. Client connections on the source replica were also dropped when this error occurred. You can see the error by enabling debug level logging if you initiate dsync manually on a test mailbox.

The only workaround I found was to remove the remote Maildir and let Dovecot create the whole thing from scratch. Dovecot did not like any existing folders on the destination replica even if they were the same names as the source and completely empty. I was able to reproduce this the bare minimum of folders - just an INBOX!

I have no idea if any of the developers saw my post or if the bug has been fixed for the next release. But it seemed to be quite a common problem over time (saw a few posts from people going back a long way with the same problem) and it is seriously disruptive to clients. The error message is not helpful in tracking down the problem either.

Secondly, I also have had an ongoing and longstanding problem using tcps: for replication. For some reason using tcps: (with no other changes at all to the config) results in a lot of timeout messages "Error: dsync I/O has stalled, no activity for 600 seconds". This goes away if I revert back to tcp: instead of tcps - with tcp: I very rarely get timeouts. No idea why, guess this is a bug of some sort also.

It's disappointing that there appears to be no way to have these sorts or problems addressed like there once was. I am not using Dovecot for commercial purposes so paying a fortune for a support contract for a high end installation just isn't going to happen, and this list seems to be quite ordinary for getting support and reporting bugs nowadays....

Reuben

On 26/04/2022 7:21 pm, Paul Kudla (SCOM.CA Internet Services Inc.) wrote:

...
side issue

if you are getting inconsistant dsyncs there is no real way to fix this in the long run.

i know its a pain (already had to my self)

i needed to do a full sync, take one server offline, delete the user dir (with dovecot offline) and then rsync (or somehow duplicate the main server's user data) over the the remote again.

then bring remote back up and it kind or worked worked

best suggestion is to bring the main server down at night so the copy is clean?

if using postfix you can enable the soft bounce option and the mail will back spool until everything comes back online

(needs to be enable on bother servers)

replication was still an issue on accounts with 300+ folders in them, still working on a fix for that.

Happy Tuesday !!! Thanks - paul

Paul Kudla

Scom.ca Internet Services <http://www.scom.ca> 004-1009 Byron Street South Whitby, Ontario - Canada L1N 4S3

Toronto 416.642.7266 Main 1.866.411.7266 Fax 1.888.892.7266

On 4/25/2022 10:01 AM, Arnaud Abélard wrote:

...
Ah, I'm now getting errors in the logs, that would explains the increasing number of failed sync requests:

dovecot: imap(xxxxx)<2961235><Bs6w43rdQPAqAcsFiXEmAInUhhA3Rfqh>: Error: Mailbox INBOX: /vmail/l/i/xxxxx/dovecot.index reset, view is now inconsistent

And sure enough:

dovecot replicator status xxxxx

xxxxx         none     00:02:54 07:11:28 -            y

What could explain that error?

Arnaud

On 25/04/2022 15:13, Arnaud Abélard wrote: > Hello, > > On my side we are running Linux (Debian Buster). > > I'm not sure my problem is actually the same as Paul or you > Sebastian since I have a lot of boxes but those are actually small > (quota of 110MB) so I doubt any of them have more than a dozen imap > folders. > > The main symptom is that I have tons of full sync requests awaiting > but even though no other sync is pending the replicator just waits > for something to trigger those syncs. > > Today, with users back I can see that normal and incremental syncs > are being done on the 15 connections, with an occasional full sync > here or there and lots of "Waiting 'failed' requests": > > Queued 'sync' requests        0 > > Queued 'high' requests        0 > > Queued 'low' requests         0 > > Queued 'failed' requests      122 > > Queued 'full resync' requests 28785 > > Waiting 'failed' requests     4294 > > Total number of known users   42512 > > > > So, why didn't the replicator take advantage of the weekend to > replicate the mailboxes while no user were using them? > > Arnaud > > > > > On 25/04/2022 13:54, Sebastian Marske wrote: >> Hi there, >> >> thanks for your insights and for diving deeper into this Paul! >> >> For me, the users ending up in 'Waiting for dsync to finish' all have >> more than 256 Imap folders as well (ranging from 288 up to >5500; >> as per >> 'doveadm mailbox list -u <username> | wc -l'). For more details on my >> setup please see my post from February [1]. >> >> @Arnaud: What OS are you running on? >> >> >> Best >> Sebastian >> >> >> [1] https://dovecot.org/pipermail/dovecot/2022-February/124168.html >> >> >> On 4/24/22 19:36, Paul Kudla (SCOM.CA Internet Services Inc.) wrote: >>> >>> Question having similiar replication issues >>> >>> pls read everything below and advise the folder counts on the >>> non-replicated users? >>> >>> i find the total number of folders / account seems to be a factor >>> and >>> NOT the size of the mail box >>> >>> ie i have customers with 40G of emails no problem over 40 or so >>> folders >>> and it works ok >>> >>> 300+ folders seems to be the issue >>> >>> i have been going through the replication code >>> >>> no errors being logged >>> >>> i am assuming that the replication --> dhclient --> other server is >>> timing out or not reading the folder lists correctly (ie dies after X >>> folders read) >>> >>> thus i am going through the code patching for log entries etc to find >>> the issues. >>> >>> see >>> >>> [13:33:57] mail18.scom.ca [root:0] /usr/local/var/lib/dovecot >>> # ll >>> total 86 >>> drwxr-xr-x 2 root wheel uarch    4B Apr 24 11:11 . >>> drwxr-xr-x 4 root wheel uarch    4B Mar 8 2021 .. >>> -rw-r--r-- 1 root wheel uarch   73B Apr 24 11:11 instances >>> -rw-r--r-- 1 root wheel uarch 160K Apr 24 13:33 replicator.db >>> >>> [13:33:58] mail18.scom.ca [root:0] /usr/local/var/lib/dovecot >>> # >>> >>> replicator.db seems to get updated ok but never processed properly. >>> >>> # sync.users >>> nick@elirpa.com                   high     00:09:41 463:47:01 -     y >>> keith@elirpa.com                  high     00:09:23 463:45:43 -     y >>> paul@scom.ca                      high     00:09:41 463:46:51 -     y >>> ed@scom.ca                        high     00:09:43 463:47:01 -     y >>> ed.hanna@dssmgmt.com              high     00:09:42 463:46:58 -     y >>> paul@paulkudla.net                high     00:09:44 463:47:03 >>> 580:35:07 >>>     y >>> >>> >>> >>> >>> so .... >>> >>> >>> >>> two things : >>> >>> first to get the production stuff to work i had to write a script >>> that >>> whould find the bad sync's and the force a dsync between the servers >>> >>> i run this every five minutes or each server. >>> >>> in crontab >>> >>> */10    *                *    *    *    root /usr/bin/nohup >>> /programs/common/sync.recover > /dev/null >>> >>> >>> python script to sort things out >>> >>> # cat /programs/common/sync.recover >>> #!/usr/local/bin/python3 >>> >>> #Force sync between servers that are reporting bad? >>> >>> import os,sys,django,socket >>> from optparse import OptionParser >>> >>> >>> from lib import * >>> >>> #Sample Re-Index MB >>> #doveadm -D force-resync -u paul@scom.ca -f INBOX* >>> >>> >>> >>> USAGE_TEXT = '''
>>> usage: %%prog %s[options] >>> ''' >>> >>> parser = OptionParser(usage=USAGE_TEXT % '', version='0.4') >>> >>> parser.add_option("-m", "--send_to", dest="send_to", help="Send >>> Email To") >>> parser.add_option("-e", "--email", dest="email_box", help="Box to >>> Index") >>> parser.add_option("-d", "--detail",action='store_true', >>> dest="detail",default =False, help="Detailed report") >>> parser.add_option("-i", "--index",action='store_true', >>> dest="index",default =False, help="Index") >>> >>> options, args = parser.parse_args() >>> >>> print (options.email_box) >>> print (options.send_to) >>> print (options.detail) >>> >>> #sys.exit() >>> >>> >>> >>> print ('Getting Current User Sync Status') >>> command = commands("/usr/local/bin/doveadm replicator status '*'") >>> >>> >>> #print command >>> >>> sync_user_status = command.output.split('\n') >>> >>> #print sync_user_status >>> >>> synced = [] >>> >>> for n in range(1,len(sync_user_status)) : >>>          user = sync_user_status[n] >>>          print ('Processing User : %s' %user.split(' ')[0]) >>>          if user.split(' ')[0] != options.email_box : >>>                  if options.email_box != None : >>>                          continue >>> >>>          if options.index == True : >>>                  command = '/usr/local/bin/doveadm -D force-resync >>> -u %s >>> -f INBOX*' %user.split(' ')[0] >>>                  command = commands(command) >>>                  command = command.output >>> >>>          #print user >>>          for nn in range (len(user)-1,0,-1) : >>>                  #print nn >>>                  #print user[nn] >>> >>>                  if user[nn] == '-' : >>>                          #print 'skipping ... %s' %user.split(' ')[0] >>> >>>                          break >>> >>> >>> >>>                  if user[nn] == 'y': #Found a Bad Mailbox >>>                          print ('syncing ... %s' %user.split(' ')[0]) >>> >>> >>>                          if options.detail == True : >>>                                  command = '/usr/local/bin/doveadm -D >>> sync -u %s -d -N -l 30 -U' %user.split(' ')[0] >>>                                  print (command) >>>                                  command = commands(command) >>>                                  command = command.output.split('\n') >>>                                  print (command) >>>                                  print ('Processed Mailbox for ... >>> %s' >>> %user.split(' ')[0] ) >>>                                  synced.append('Processed Mailbox >>> for ... >>> %s' %user.split(' ')[0]) >>>                                  for nnn in range(len(command)): >>> synced.append(command[nnn] + '\n') >>>                                  break >>> >>> >>>                          if options.detail == False : >>>                                  #command = >>> '/usr/local/bin/doveadm -D >>> sync -u %s -d -N -l 30 -U' %user.split(' ')[0] >>>                                  #print (command) >>>                                  #command = os.system(command) >>>                                  command = subprocess.Popen( >>> ["/usr/local/bin/doveadm sync -u %s -d -N -l 30 -U" %user.split(' >>> ')[0] >>> ],
>>>                                  shell = True, stdin=None, >>> stdout=None, >>> stderr=None, close_fds=True) >>> >>>                                  print ( 'Processed Mailbox for >>> ... %s' >>> %user.split(' ')[0] ) >>>                                  synced.append('Processed Mailbox >>> for ... >>> %s' %user.split(' ')[0]) >>>                                  #sys.exit() >>>                                  break >>> >>> if len(synced) != 0 : >>>          #send email showing bad synced boxes ? >>> >>>          if options.send_to != None : >>>                  send_from = 'monitor@scom.ca' >>>                  send_to = ['%s' %options.send_to] >>>                  send_subject = 'Dovecot Bad Sync Report for : %s' >>> %(socket.gethostname()) >>>                  send_text = '\n\n' >>>                  for n in range (len(synced)) : >>>                          send_text = send_text + synced[n] + '\n' >>> >>>                  send_files = [] >>>                  sendmail (send_from, send_to, send_subject, >>> send_text, >>> send_files) >>> >>> >>> >>> sys.exit() >>> >>> second : >>> >>> i posted this a month ago - no response >>> >>> please appreciate that i am trying to help .... >>> >>> after much testing i can now reporduce the replication issues at hand >>> >>> I am running on freebsd 12 & 13 stable (both test and production >>> servers) >>> >>> sdram drives etc ... >>> >>> Basically replication works fine until reaching a folder quantity >>> of ~ >>> 256 or more >>> >>> to reproduce using doveadm i created folders like >>> >>> INBOX/folder-0 >>> INBOX/folder-1 >>> INBOX/folder-2 >>> INBOX/folder-3 >>> and so forth ...... >>> >>> I created 200 folders and they replicated ok on both servers >>> >>> I created another 200 (400 total) and the replicator got stuck and >>> would >>> not update the mbox on the alternate server anymore and is still >>> updating 4 days later ? >>> >>> basically replicator goes so far and either hangs or more likely >>> bails >>> on an error that is not reported to the debug reporting ? >>> >>> however dsync will sync the two servers but only when run manually >>> (ie >>> all the folders will sync) >>> >>> I have two test servers avaliable if you need any kind of access - >>> again >>> here to help. >>> >>> [07:28:42] mail18.scom.ca [root:0] ~ >>> # sync.status >>> Queued 'sync' requests        0 >>> Queued 'high' requests        6 >>> Queued 'low' requests         0 >>> Queued 'failed' requests      0 >>> Queued 'full resync' requests 0 >>> Waiting 'failed' requests     0 >>> Total number of known users   255 >>> >>> username                       type        status >>> paul@scom.ca                   normal      Waiting for dsync to >>> finish >>> keith@elirpa.com               incremental Waiting for dsync to >>> finish >>> ed.hanna@dssmgmt.com           incremental Waiting for dsync to >>> finish >>> ed@scom.ca                     incremental Waiting for dsync to >>> finish >>> nick@elirpa.com                incremental Waiting for dsync to >>> finish >>> paul@paulkudla.net             incremental Waiting for dsync to >>> finish >>> >>> >>> i have been going through the c code and it seems the replication >>> gets >>> requested ok >>> >>> replicator.db does get updated ok with the replicated request for the >>> mbox in question. >>> >>> however i am still looking for the actual replicator function in the >>> lib's that do the actual replication requests >>> >>> the number of folders & subfolders is defanately the issue - not the >>> mbox pyhsical size as thought origionally. >>> >>> if someone can point me in the right direction, it seems either the >>> replicator is not picking up on the number of folders to replicat >>> properly or it has a hard set limit like 256 / 512 / 65535 etc and >>> stops >>> the replication request thereafter. >>> >>> I am mainly a machine code programmer from the 80's and have >>> concentrated on python as of late, 'c' i am starting to go through >>> just >>> to give you a background on my talents. >>> >>> It took 2 months to finger this out. >>> >>> this issue also seems to be indirectly causing the duplicate messages >>> supression not to work as well. >>> >>> python programming to reproduce issue (loops are for last run >>> started @ >>> 200 - fyi) : >>> >>> # cat mbox.gen >>> #!/usr/local/bin/python2 >>> >>> import os,sys >>> >>> from lib import * >>> >>> >>> user = 'paul@paulkudla.net' >>> >>> """ >>> for count in range (0,600) : >>>          box = 'INBOX/folder-%s' %count >>>          print count >>>          command = '/usr/local/bin/doveadm mailbox create -s -u %s >>> %s' >>> %(user,box) >>>          print command >>>          a = commands.getoutput(command) >>>          print a >>> """ >>> >>> for count in range (0,600) : >>>          box = 'INBOX/folder-0/sub-%' %count >>>          print count >>>          command = '/usr/local/bin/doveadm mailbox create -s -u %s >>> %s' >>> %(user,box) >>>          print command >>>          a = commands.getoutput(command) >>>          print a >>> >>> >>> >>>          #sys.exit() >>> >>> >>> >>> >>> >>> Happy Sunday !!! >>> Thanks - paul >>> >>> Paul Kudla >>> >>> >>> Scom.ca Internet Services <http://www.scom.ca> >>> 004-1009 Byron Street South >>> Whitby, Ontario - Canada >>> L1N 4S3 >>> >>> Toronto 416.642.7266 >>> Main 1.866.411.7266 >>> Fax 1.888.892.7266 >>> >>> On 4/24/2022 10:22 AM, Arnaud Abélard wrote: >>>> Hello, >>>> >>>> I am working on replicating a server (and adding compression on the >>>> other side) and since I had "Error: dsync I/O has stalled, no >>>> activity >>>> for 600 seconds (version not received)" errors I upgraded both >>>> source >>>> and destination server with the latest 2.3 version (2.3.18). While >>>> before the upgrade all the 15 replication connections were busy >>>> after >>>> upgrading dovecot replicator dsync-status shows that most of the >>>> time >>>> nothing is being replicated at all. I can see some brief >>>> replications >>>> that last, but 99,9% of the time nothing is happening at all. >>>> >>>> I have a replication_full_sync_interval of 12 hours but I have >>>> thousands of users with their last full sync over 90 hours ago. >>>> >>>> "doveadm replicator status" also shows that i have over 35,000 >>>> queued >>>> full resync requests, but no sync, high or low queued requests so >>>> why >>>> aren't the full requests occuring? >>>> >>>> There are no errors in the logs. >>>> >>>> Thanks, >>>> >>>> Arnaud >>>> >>>> >>>> >>>> >>>> >

Paul Kudla (SCOM.CA Internet Services Inc.)

2:29 p.m.

Thanks for the update

I dont push anyone when asking for updates

I am a programmer by trade as well and nothing ever goes as planned

prefer we all take our time and roll it out correctly then jumping the gun.

Why I am trying to help elsewhere as I have gotten pretty fluid with dovecot etc and can help users out with the day to day stuff.

I just can't help with ldap, never got around to that as i use pgsql databases that are replicated etc etc etc on all my configs.

Again thanks for the update.

Happy Thursday !!! Thanks - paul

Paul Kudla

Scom.ca Internet Services <http://www.scom.ca> 004-1009 Byron Street South Whitby, Ontario - Canada L1N 4S3

Toronto 416.642.7266 Main 1.866.411.7266 Fax 1.888.892.7266

On 4/28/2022 7:02 AM, Aki Tuomi wrote:

...

2.3.19 is round the corner, so not long. I cannot yet promise an exact date but hopefully within week or two.

Aki

...
On 28/04/2022 13:57 Paul Kudla (SCOM.CA Internet Services Inc.) <paul@scom.ca> wrote:

Thanks for the update.

is this for both replication issues (folders +300 etc)

Just Asking - Any ETA

Happy Thursday !!! Thanks - paul

Paul Kudla

Scom.ca Internet Services <http://www.scom.ca> 004-1009 Byron Street South Whitby, Ontario - Canada L1N 4S3

Toronto 416.642.7266 Main 1.866.411.7266 Fax 1.888.892.7266

On 4/27/2022 9:01 AM, Aki Tuomi wrote:

...
Hi!

This is probably going to get fixed in 2.3.19, this looks like an issue we are already fixing.

Aki

...
On 26/04/2022 16:38 Paul Kudla (SCOM.CA Internet Services Inc.) <paul@scom.ca> wrote:

Agreed there seems to be no way of posting these kinds of issues to see if they are even being addressed or even known about moving forward on new updates

i read somewhere there is a new branch soming out but nothing as of yet?

2.4 maybe .... 5.0 ........

my previous replication issues (back in feb) went unanswered.

not faulting anyone, but the developers do seem to be disconnected from issues as of late? or concentrating on other issues.

I have no problem with support contracts for day to day maintence however as a programmer myself they usually dont work as the other end relies on the latest source code anyways. Thus can not help.

I am trying to take a part the replicator c programming based on 2.3.18 as most of it does work to some extent.

tcps just does not work (ie 600 seconds default in the c programming)

My thoughts are tcp works ok but fails when the replicator through dsync-client.c when asked to return the folder list?

replicator-brain.c seems to control the overall process and timing.

replicator-queue.c seems to handle the que file that does seem to carry acurate info.

things in the source code are documented enough to figure this out but i am still going through all the related .h files documentation wise which are all over the place.

there is no clear documentation on the .h lib files so i have to walk through the tree one at a time finding relative code.

since the dsync from doveadm does see to work ok i have to assume the dsync-client used to compile the replicator is at fault somehow or a call from it upstream?

Thanks for your input on the other issues noted below, i will keep that in mind when disassembling the source code.

No sense in fixing one thing and leaving something else behind, probably all related anyways.

i have two test servers avaliable so i can play with all this offline to reproduce the issues

Unfortunately I have to make a living first, this will be addressed when possible as i dont like systems that are live running this way and currently only have 5 accounts with this issue (mine included)

Happy Tuesday !!! Thanks - paul

Paul Kudla

Scom.ca Internet Services <http://www.scom.ca> 004-1009 Byron Street South Whitby, Ontario - Canada L1N 4S3

Toronto 416.642.7266 Main 1.866.411.7266 Fax 1.888.892.7266

On 4/26/2022 9:03 AM, Reuben Farrelly wrote:

...
I ran into this back in February and documented a reproducible test case (and sent it to this list). In short - I was able to reproduce this by having a valid and consistent mailbox on the source/local, creating a very standard empty Maildir/(new|cur|tmp) folder on the remote replica, and then initiating the replicate from the source. This consistently caused dsync to fail replication with the error "dovecot.index reset, view is now inconsistent" and sync aborted, leaving the replica mailbox in a screwed up inconsistent state. Client connections on the source replica were also dropped when this error occurred. You can see the error by enabling debug level logging if you initiate dsync manually on a test mailbox.

The only workaround I found was to remove the remote Maildir and let Dovecot create the whole thing from scratch. Dovecot did not like any existing folders on the destination replica even if they were the same names as the source and completely empty. I was able to reproduce this the bare minimum of folders - just an INBOX!

I have no idea if any of the developers saw my post or if the bug has been fixed for the next release. But it seemed to be quite a common problem over time (saw a few posts from people going back a long way with the same problem) and it is seriously disruptive to clients. The error message is not helpful in tracking down the problem either.

Secondly, I also have had an ongoing and longstanding problem using tcps: for replication. For some reason using tcps: (with no other changes at all to the config) results in a lot of timeout messages "Error: dsync I/O has stalled, no activity for 600 seconds". This goes away if I revert back to tcp: instead of tcps - with tcp: I very rarely get timeouts. No idea why, guess this is a bug of some sort also.

It's disappointing that there appears to be no way to have these sorts or problems addressed like there once was. I am not using Dovecot for commercial purposes so paying a fortune for a support contract for a high end installation just isn't going to happen, and this list seems to be quite ordinary for getting support and reporting bugs nowadays....

Reuben

On 26/04/2022 7:21 pm, Paul Kudla (SCOM.CA Internet Services Inc.) wrote:

...
side issue

if you are getting inconsistant dsyncs there is no real way to fix this in the long run.

i know its a pain (already had to my self)

i needed to do a full sync, take one server offline, delete the user dir (with dovecot offline) and then rsync (or somehow duplicate the main server's user data) over the the remote again.

then bring remote back up and it kind or worked worked

best suggestion is to bring the main server down at night so the copy is clean?

if using postfix you can enable the soft bounce option and the mail will back spool until everything comes back online

(needs to be enable on bother servers)

replication was still an issue on accounts with 300+ folders in them, still working on a fix for that.

Happy Tuesday !!! Thanks - paul

Paul Kudla

Scom.ca Internet Services <http://www.scom.ca> 004-1009 Byron Street South Whitby, Ontario - Canada L1N 4S3

Toronto 416.642.7266 Main 1.866.411.7266 Fax 1.888.892.7266

On 4/25/2022 10:01 AM, Arnaud Abélard wrote: > Ah, I'm now getting errors in the logs, that would explains the > increasing number of failed sync requests: > > dovecot: imap(xxxxx)<2961235><Bs6w43rdQPAqAcsFiXEmAInUhhA3Rfqh>: > Error: Mailbox INBOX: /vmail/l/i/xxxxx/dovecot.index reset, view is > now inconsistent > > > And sure enough: > > # dovecot replicator status xxxxx > > xxxxx         none     00:02:54 07:11:28 -            y > > > What could explain that error? > > Arnaud > > > > On 25/04/2022 15:13, Arnaud Abélard wrote: >> Hello, >> >> On my side we are running Linux (Debian Buster). >> >> I'm not sure my problem is actually the same as Paul or you >> Sebastian since I have a lot of boxes but those are actually small >> (quota of 110MB) so I doubt any of them have more than a dozen imap >> folders. >> >> The main symptom is that I have tons of full sync requests awaiting >> but even though no other sync is pending the replicator just waits >> for something to trigger those syncs. >> >> Today, with users back I can see that normal and incremental syncs >> are being done on the 15 connections, with an occasional full sync >> here or there and lots of "Waiting 'failed' requests": >> >> Queued 'sync' requests        0 >> >> Queued 'high' requests        0 >> >> Queued 'low' requests         0 >> >> Queued 'failed' requests      122 >> >> Queued 'full resync' requests 28785 >> >> Waiting 'failed' requests     4294 >> >> Total number of known users   42512 >> >> >> >> So, why didn't the replicator take advantage of the weekend to >> replicate the mailboxes while no user were using them? >> >> Arnaud >> >> >> >> >> On 25/04/2022 13:54, Sebastian Marske wrote: >>> Hi there, >>> >>> thanks for your insights and for diving deeper into this Paul! >>> >>> For me, the users ending up in 'Waiting for dsync to finish' all have >>> more than 256 Imap folders as well (ranging from 288 up to >5500; >>> as per >>> 'doveadm mailbox list -u <username> | wc -l'). For more details on my >>> setup please see my post from February [1]. >>> >>> @Arnaud: What OS are you running on? >>> >>> >>> Best >>> Sebastian >>> >>> >>> [1] https://dovecot.org/pipermail/dovecot/2022-February/124168.html >>> >>> >>> On 4/24/22 19:36, Paul Kudla (SCOM.CA Internet Services Inc.) wrote: >>>> >>>> Question having similiar replication issues >>>> >>>> pls read everything below and advise the folder counts on the >>>> non-replicated users? >>>> >>>> i find the total number of folders / account seems to be a factor >>>> and >>>> NOT the size of the mail box >>>> >>>> ie i have customers with 40G of emails no problem over 40 or so >>>> folders >>>> and it works ok >>>> >>>> 300+ folders seems to be the issue >>>> >>>> i have been going through the replication code >>>> >>>> no errors being logged >>>> >>>> i am assuming that the replication --> dhclient --> other server is >>>> timing out or not reading the folder lists correctly (ie dies after X >>>> folders read) >>>> >>>> thus i am going through the code patching for log entries etc to find >>>> the issues. >>>> >>>> see >>>> >>>> [13:33:57] mail18.scom.ca [root:0] /usr/local/var/lib/dovecot >>>> # ll >>>> total 86 >>>> drwxr-xr-x 2 root wheel uarch    4B Apr 24 11:11 . >>>> drwxr-xr-x 4 root wheel uarch    4B Mar 8 2021 .. >>>> -rw-r--r-- 1 root wheel uarch   73B Apr 24 11:11 instances >>>> -rw-r--r-- 1 root wheel uarch 160K Apr 24 13:33 replicator.db >>>> >>>> [13:33:58] mail18.scom.ca [root:0] /usr/local/var/lib/dovecot >>>> # >>>> >>>> replicator.db seems to get updated ok but never processed properly. >>>> >>>> # sync.users >>>> nick@elirpa.com                   high     00:09:41 463:47:01 -     y >>>> keith@elirpa.com                  high     00:09:23 463:45:43 -     y >>>> paul@scom.ca                      high     00:09:41 463:46:51 -     y >>>> ed@scom.ca                        high     00:09:43 463:47:01 -     y >>>> ed.hanna@dssmgmt.com              high     00:09:42 463:46:58 -     y >>>> paul@paulkudla.net                high     00:09:44 463:47:03 >>>> 580:35:07 >>>>     y >>>> >>>> >>>> >>>> >>>> so .... >>>> >>>> >>>> >>>> two things : >>>> >>>> first to get the production stuff to work i had to write a script >>>> that >>>> whould find the bad sync's and the force a dsync between the servers >>>> >>>> i run this every five minutes or each server. >>>> >>>> in crontab >>>> >>>> */10    *                *    *    *    root /usr/bin/nohup >>>> /programs/common/sync.recover > /dev/null >>>> >>>> >>>> python script to sort things out >>>> >>>> # cat /programs/common/sync.recover >>>> #!/usr/local/bin/python3 >>>> >>>> #Force sync between servers that are reporting bad? >>>> >>>> import os,sys,django,socket >>>> from optparse import OptionParser >>>> >>>> >>>> from lib import * >>>> >>>> #Sample Re-Index MB >>>> #doveadm -D force-resync -u paul@scom.ca -f INBOX* >>>> >>>> >>>> >>>> USAGE_TEXT = '''
>>>> usage: %%prog %s[options] >>>> ''' >>>> >>>> parser = OptionParser(usage=USAGE_TEXT % '', version='0.4') >>>> >>>> parser.add_option("-m", "--send_to", dest="send_to", help="Send >>>> Email To") >>>> parser.add_option("-e", "--email", dest="email_box", help="Box to >>>> Index") >>>> parser.add_option("-d", "--detail",action='store_true', >>>> dest="detail",default =False, help="Detailed report") >>>> parser.add_option("-i", "--index",action='store_true', >>>> dest="index",default =False, help="Index") >>>> >>>> options, args = parser.parse_args() >>>> >>>> print (options.email_box) >>>> print (options.send_to) >>>> print (options.detail) >>>> >>>> #sys.exit() >>>> >>>> >>>> >>>> print ('Getting Current User Sync Status') >>>> command = commands("/usr/local/bin/doveadm replicator status '*'") >>>> >>>> >>>> #print command >>>> >>>> sync_user_status = command.output.split('\n') >>>> >>>> #print sync_user_status >>>> >>>> synced = [] >>>> >>>> for n in range(1,len(sync_user_status)) : >>>>          user = sync_user_status[n] >>>>          print ('Processing User : %s' %user.split(' ')[0]) >>>>          if user.split(' ')[0] != options.email_box : >>>>                  if options.email_box != None : >>>>                          continue >>>> >>>>          if options.index == True : >>>>                  command = '/usr/local/bin/doveadm -D force-resync >>>> -u %s >>>> -f INBOX*' %user.split(' ')[0] >>>>                  command = commands(command) >>>>                  command = command.output >>>> >>>>          #print user >>>>          for nn in range (len(user)-1,0,-1) : >>>>                  #print nn >>>>                  #print user[nn] >>>> >>>>                  if user[nn] == '-' : >>>>                          #print 'skipping ... %s' %user.split(' ')[0] >>>> >>>>                          break >>>> >>>> >>>> >>>>                  if user[nn] == 'y': #Found a Bad Mailbox >>>>                          print ('syncing ... %s' %user.split(' ')[0]) >>>> >>>> >>>>                          if options.detail == True : >>>>                                  command = '/usr/local/bin/doveadm -D >>>> sync -u %s -d -N -l 30 -U' %user.split(' ')[0] >>>>                                  print (command) >>>>                                  command = commands(command) >>>>                                  command = command.output.split('\n') >>>>                                  print (command) >>>>                                  print ('Processed Mailbox for ... >>>> %s' >>>> %user.split(' ')[0] ) >>>>                                  synced.append('Processed Mailbox >>>> for ... >>>> %s' %user.split(' ')[0]) >>>>                                  for nnn in range(len(command)): >>>> synced.append(command[nnn] + '\n') >>>>                                  break >>>> >>>> >>>>                          if options.detail == False : >>>>                                  #command = >>>> '/usr/local/bin/doveadm -D >>>> sync -u %s -d -N -l 30 -U' %user.split(' ')[0] >>>>                                  #print (command) >>>>                                  #command = os.system(command) >>>>                                  command = subprocess.Popen( >>>> ["/usr/local/bin/doveadm sync -u %s -d -N -l 30 -U" %user.split(' >>>> ')[0] >>>> ],
>>>>                                  shell = True, stdin=None, >>>> stdout=None, >>>> stderr=None, close_fds=True) >>>> >>>>                                  print ( 'Processed Mailbox for >>>> ... %s' >>>> %user.split(' ')[0] ) >>>>                                  synced.append('Processed Mailbox >>>> for ... >>>> %s' %user.split(' ')[0]) >>>>                                  #sys.exit() >>>>                                  break >>>> >>>> if len(synced) != 0 : >>>>          #send email showing bad synced boxes ? >>>> >>>>          if options.send_to != None : >>>>                  send_from = 'monitor@scom.ca' >>>>                  send_to = ['%s' %options.send_to] >>>>                  send_subject = 'Dovecot Bad Sync Report for : %s' >>>> %(socket.gethostname()) >>>>                  send_text = '\n\n' >>>>                  for n in range (len(synced)) : >>>>                          send_text = send_text + synced[n] + '\n' >>>> >>>>                  send_files = [] >>>>                  sendmail (send_from, send_to, send_subject, >>>> send_text, >>>> send_files) >>>> >>>> >>>> >>>> sys.exit() >>>> >>>> second : >>>> >>>> i posted this a month ago - no response >>>> >>>> please appreciate that i am trying to help .... >>>> >>>> after much testing i can now reporduce the replication issues at hand >>>> >>>> I am running on freebsd 12 & 13 stable (both test and production >>>> servers) >>>> >>>> sdram drives etc ... >>>> >>>> Basically replication works fine until reaching a folder quantity >>>> of ~ >>>> 256 or more >>>> >>>> to reproduce using doveadm i created folders like >>>> >>>> INBOX/folder-0 >>>> INBOX/folder-1 >>>> INBOX/folder-2 >>>> INBOX/folder-3 >>>> and so forth ...... >>>> >>>> I created 200 folders and they replicated ok on both servers >>>> >>>> I created another 200 (400 total) and the replicator got stuck and >>>> would >>>> not update the mbox on the alternate server anymore and is still >>>> updating 4 days later ? >>>> >>>> basically replicator goes so far and either hangs or more likely >>>> bails >>>> on an error that is not reported to the debug reporting ? >>>> >>>> however dsync will sync the two servers but only when run manually >>>> (ie >>>> all the folders will sync) >>>> >>>> I have two test servers avaliable if you need any kind of access - >>>> again >>>> here to help. >>>> >>>> [07:28:42] mail18.scom.ca [root:0] ~ >>>> # sync.status >>>> Queued 'sync' requests        0 >>>> Queued 'high' requests        6 >>>> Queued 'low' requests         0 >>>> Queued 'failed' requests      0 >>>> Queued 'full resync' requests 0 >>>> Waiting 'failed' requests     0 >>>> Total number of known users   255 >>>> >>>> username                       type        status >>>> paul@scom.ca                   normal      Waiting for dsync to >>>> finish >>>> keith@elirpa.com               incremental Waiting for dsync to >>>> finish >>>> ed.hanna@dssmgmt.com           incremental Waiting for dsync to >>>> finish >>>> ed@scom.ca                     incremental Waiting for dsync to >>>> finish >>>> nick@elirpa.com                incremental Waiting for dsync to >>>> finish >>>> paul@paulkudla.net             incremental Waiting for dsync to >>>> finish >>>> >>>> >>>> i have been going through the c code and it seems the replication >>>> gets >>>> requested ok >>>> >>>> replicator.db does get updated ok with the replicated request for the >>>> mbox in question. >>>> >>>> however i am still looking for the actual replicator function in the >>>> lib's that do the actual replication requests >>>> >>>> the number of folders & subfolders is defanately the issue - not the >>>> mbox pyhsical size as thought origionally. >>>> >>>> if someone can point me in the right direction, it seems either the >>>> replicator is not picking up on the number of folders to replicat >>>> properly or it has a hard set limit like 256 / 512 / 65535 etc and >>>> stops >>>> the replication request thereafter. >>>> >>>> I am mainly a machine code programmer from the 80's and have >>>> concentrated on python as of late, 'c' i am starting to go through >>>> just >>>> to give you a background on my talents. >>>> >>>> It took 2 months to finger this out. >>>> >>>> this issue also seems to be indirectly causing the duplicate messages >>>> supression not to work as well. >>>> >>>> python programming to reproduce issue (loops are for last run >>>> started @ >>>> 200 - fyi) : >>>> >>>> # cat mbox.gen >>>> #!/usr/local/bin/python2 >>>> >>>> import os,sys >>>> >>>> from lib import * >>>> >>>> >>>> user = 'paul@paulkudla.net' >>>> >>>> """ >>>> for count in range (0,600) : >>>>          box = 'INBOX/folder-%s' %count >>>>          print count >>>>          command = '/usr/local/bin/doveadm mailbox create -s -u %s >>>> %s' >>>> %(user,box) >>>>          print command >>>>          a = commands.getoutput(command) >>>>          print a >>>> """ >>>> >>>> for count in range (0,600) : >>>>          box = 'INBOX/folder-0/sub-%' %count >>>>          print count >>>>          command = '/usr/local/bin/doveadm mailbox create -s -u %s >>>> %s' >>>> %(user,box) >>>>          print command >>>>          a = commands.getoutput(command) >>>>          print a >>>> >>>> >>>> >>>>          #sys.exit() >>>> >>>> >>>> >>>> >>>> >>>> Happy Sunday !!! >>>> Thanks - paul >>>> >>>> Paul Kudla >>>> >>>> >>>> Scom.ca Internet Services <http://www.scom.ca> >>>> 004-1009 Byron Street South >>>> Whitby, Ontario - Canada >>>> L1N 4S3 >>>> >>>> Toronto 416.642.7266 >>>> Main 1.866.411.7266 >>>> Fax 1.888.892.7266 >>>> >>>> On 4/24/2022 10:22 AM, Arnaud Abélard wrote: >>>>> Hello, >>>>> >>>>> I am working on replicating a server (and adding compression on the >>>>> other side) and since I had "Error: dsync I/O has stalled, no >>>>> activity >>>>> for 600 seconds (version not received)" errors I upgraded both >>>>> source >>>>> and destination server with the latest 2.3 version (2.3.18). While >>>>> before the upgrade all the 15 replication connections were busy >>>>> after >>>>> upgrading dovecot replicator dsync-status shows that most of the >>>>> time >>>>> nothing is being replicated at all. I can see some brief >>>>> replications >>>>> that last, but 99,9% of the time nothing is happening at all. >>>>> >>>>> I have a replication_full_sync_interval of 12 hours but I have >>>>> thousands of users with their last full sync over 90 hours ago. >>>>> >>>>> "doveadm replicator status" also shows that i have over 35,000 >>>>> queued >>>>> full resync requests, but no sync, high or low queued requests so >>>>> why >>>>> aren't the full requests occuring? >>>>> >>>>> There are no errors in the logs. >>>>> >>>>> Thanks, >>>>> >>>>> Arnaud >>>>> >>>>> >>>>> >>>>> >>>>> >> >

Cassidy B. Larson

11 May 11 May

7:25 a.m.

Hi Aki,

We just installed 2.3.19, and are seeing a couple of users throwing the "INBOX/dovecot.index reset, view is now inconsistent" and their replicator status erroring out. Tried force-resync on the full mailbox, but to no avail just yet. Not sure if this bug was supposedly fixed in 2.3.19?

Thanks,

Cassidy

On Thu, Apr 28, 2022 at 5:02 AM Aki Tuomi <aki.tuomi@open-xchange.com> wrote:

...

2.3.19 is round the corner, so not long. I cannot yet promise an exact date but hopefully within week or two.

Aki
y
y
y
y
y
...
...
...
...
...
>>>> paul@paulkudla.net high 00:09:44 463:47:03 >>>> 580:35:07 >>>> y >>>> >>>> >>>> >>>> >>>> so .... >>>> >>>> >>>> >>>> two things : >>>> >>>> first to get the production stuff to work i had to write a script >>>> that >>>> whould find the bad sync's and the force a dsync between the servers >>>> >>>> i run this every five minutes or each server. >>>> >>>> in crontab >>>> >>>> */10 * * * * root /usr/bin/nohup >>>> /programs/common/sync.recover > /dev/null >>>> >>>> >>>> python script to sort things out >>>> >>>> # cat /programs/common/sync.recover >>>> #!/usr/local/bin/python3 >>>> >>>> #Force sync between servers that are reporting bad? >>>> >>>> import os,sys,django,socket >>>> from optparse import OptionParser >>>> >>>> >>>> from lib import * >>>> >>>> #Sample Re-Index MB >>>> #doveadm -D force-resync -u paul@scom.ca -f INBOX* >>>> >>>> >>>> >>>> USAGE_TEXT = '''
>>>> usage: %%prog %s[options] >>>> ''' >>>> >>>> parser = OptionParser(usage=USAGE_TEXT % '', version='0.4') >>>> >>>> parser.add_option("-m", "--send_to", dest="send_to", help="Send >>>> Email To") >>>> parser.add_option("-e", "--email", dest="email_box", help="Box to >>>> Index") >>>> parser.add_option("-d", "--detail",action='store_true', >>>> dest="detail",default =False, help="Detailed report") >>>> parser.add_option("-i", "--index",action='store_true', >>>> dest="index",default =False, help="Index") >>>> >>>> options, args = parser.parse_args() >>>> >>>> print (options.email_box) >>>> print (options.send_to) >>>> print (options.detail) >>>> >>>> #sys.exit() >>>> >>>> >>>> >>>> print ('Getting Current User Sync Status') >>>> command = commands("/usr/local/bin/doveadm replicator status '*'") >>>> >>>> >>>> #print command >>>> >>>> sync_user_status = command.output.split('\n') >>>> >>>> #print sync_user_status >>>> >>>> synced = [] >>>> >>>> for n in range(1,len(sync_user_status)) : >>>> user = sync_user_status[n] >>>> print ('Processing User : %s' %user.split(' ')[0]) >>>> if user.split(' ')[0] != options.email_box : >>>> if options.email_box != None : >>>> continue >>>> >>>> if options.index == True : >>>> command = '/usr/local/bin/doveadm -D force-resync >>>> -u %s >>>> -f INBOX*' %user.split(' ')[0] >>>> command = commands(command) >>>> command = command.output >>>> >>>> #print user >>>> for nn in range (len(user)-1,0,-1) : >>>> #print nn >>>> #print user[nn] >>>> >>>> if user[nn] == '-' : >>>> #print 'skipping ... %s' %user.split(' ')[0] >>>> >>>> break >>>> >>>> >>>> >>>> if user[nn] == 'y': #Found a Bad Mailbox >>>> print ('syncing ... %s' %user.split(' ')[0]) >>>> >>>> >>>> if options.detail == True : >>>> command = '/usr/local/bin/doveadm -D >>>> sync -u %s -d -N -l 30 -U' %user.split(' ')[0] >>>> print (command) >>>> command = commands(command) >>>> command = command.output.split('\n') >>>> print (command) >>>> print ('Processed Mailbox for ... >>>> %s' >>>> %user.split(' ')[0] ) >>>> synced.append('Processed Mailbox >>>> for ... >>>> %s' %user.split(' ')[0]) >>>> for nnn in range(len(command)): >>>> synced.append(command[nnn] + '\n') >>>> break >>>> >>>> >>>> if options.detail == False : >>>> #command = >>>> '/usr/local/bin/doveadm -D >>>> sync -u %s -d -N -l 30 -U' %user.split(' ')[0] >>>> #print (command) >>>> #command = os.system(command) >>>> command = subprocess.Popen( >>>> ["/usr/local/bin/doveadm sync -u %s -d -N -l 30 -U" %user.split(' >>>> ')[0] >>>> ],
>>>> shell = True, stdin=None, >>>> stdout=None, >>>> stderr=None, close_fds=True) >>>> >>>> print ( 'Processed Mailbox for >>>> ... %s' >>>> %user.split(' ')[0] ) >>>> synced.append('Processed Mailbox >>>> for ... >>>> %s' %user.split(' ')[0]) >>>> #sys.exit() >>>> break >>>> >>>> if len(synced) != 0 : >>>> #send email showing bad synced boxes ? >>>> >>>> if options.send_to != None : >>>> send_from = 'monitor@scom.ca' >>>> send_to = ['%s' %options.send_to] >>>> send_subject = 'Dovecot Bad Sync Report for : %s' >>>> %(socket.gethostname()) >>>> send_text = '\n\n' >>>> for n in range (len(synced)) : >>>> send_text = send_text + synced[n] + '\n' >>>> >>>> send_files = [] >>>> sendmail (send_from, send_to, send_subject, >>>> send_text, >>>> send_files) >>>> >>>> >>>> >>>> sys.exit() >>>> >>>> second : >>>> >>>> i posted this a month ago - no response >>>> >>>> please appreciate that i am trying to help .... >>>> >>>> after much testing i can now reporduce the replication issues at hand >>>> >>>> I am running on freebsd 12 & 13 stable (both test and production >>>> servers) >>>> >>>> sdram drives etc ... >>>> >>>> Basically replication works fine until reaching a folder quantity >>>> of ~ >>>> 256 or more >>>> >>>> to reproduce using doveadm i created folders like >>>> >>>> INBOX/folder-0 >>>> INBOX/folder-1 >>>> INBOX/folder-2 >>>> INBOX/folder-3 >>>> and so forth ...... >>>> >>>> I created 200 folders and they replicated ok on both servers >>>> >>>> I created another 200 (400 total) and the replicator got stuck and >>>> would >>>> not update the mbox on the alternate server anymore and is still >>>> updating 4 days later ? >>>> >>>> basically replicator goes so far and either hangs or more likely >>>> bails >>>> on an error that is not reported to the debug reporting ? >>>> >>>> however dsync will sync the two servers but only when run manually >>>> (ie >>>> all the folders will sync) >>>> >>>> I have two test servers avaliable if you need any kind of access - >>>> again >>>> here to help. >>>> >>>> [07:28:42] mail18.scom.ca [root:0] ~ >>>> # sync.status >>>> Queued 'sync' requests 0 >>>> Queued 'high' requests 6 >>>> Queued 'low' requests 0 >>>> Queued 'failed' requests 0 >>>> Queued 'full resync' requests 0 >>>> Waiting 'failed' requests 0 >>>> Total number of known users 255 >>>> >>>> username type status >>>> paul@scom.ca normal Waiting for dsync to >>>> finish >>>> keith@elirpa.com incremental Waiting for dsync to >>>> finish >>>> ed.hanna@dssmgmt.com incremental Waiting for dsync to >>>> finish >>>> ed@scom.ca incremental Waiting for dsync to >>>> finish >>>> nick@elirpa.com incremental Waiting for dsync to >>>> finish >>>> paul@paulkudla.net incremental Waiting for dsync to >>>> finish >>>> >>>> >>>> i have been going through the c code and it seems the replication >>>> gets >>>> requested ok >>>> >>>> replicator.db does get updated ok with the replicated request for the >>>> mbox in question. >>>> >>>> however i am still looking for the actual replicator function in the >>>> lib's that do the actual replication requests >>>> >>>> the number of folders & subfolders is defanately the issue - not the >>>> mbox pyhsical size as thought origionally. >>>> >>>> if someone can point me in the right direction, it seems either the >>>> replicator is not picking up on the number of folders to replicat >>>> properly or it has a hard set limit like 256 / 512 / 65535 etc and >>>> stops >>>> the replication request thereafter. >>>> >>>> I am mainly a machine code programmer from the 80's and have >>>> concentrated on python as of late, 'c' i am starting to go through >>>> just >>>> to give you a background on my talents. >>>> >>>> It took 2 months to finger this out. >>>> >>>> this issue also seems to be indirectly causing the duplicate messages >>>> supression not to work as well. >>>> >>>> python programming to reproduce issue (loops are for last run >>>> started @ >>>> 200 - fyi) : >>>> >>>> # cat mbox.gen >>>> #!/usr/local/bin/python2 >>>> >>>> import os,sys >>>> >>>> from lib import * >>>> >>>> >>>> user = 'paul@paulkudla.net' >>>> >>>> """ >>>> for count in range (0,600) : >>>> box = 'INBOX/folder-%s' %count >>>> print count >>>> command = '/usr/local/bin/doveadm mailbox create -s -u %s >>>> %s' >>>> %(user,box) >>>> print command >>>> a = commands.getoutput(command) >>>> print a >>>> """ >>>> >>>> for count in range (0,600) : >>>> box = 'INBOX/folder-0/sub-%' %count >>>> print count >>>> command = '/usr/local/bin/doveadm mailbox create -s -u %s >>>> %s' >>>> %(user,box) >>>> print command >>>> a = commands.getoutput(command) >>>> print a >>>> >>>> >>>> >>>> #sys.exit() >>>> >>>> >>>> >>>> >>>> >>>> Happy Sunday !!! >>>> Thanks - paul >>>> >>>> Paul Kudla >>>> >>>> >>>> Scom.ca Internet Services <http://www.scom.ca> >>>> 004-1009 Byron Street South >>>> Whitby, Ontario - Canada >>>> L1N 4S3 >>>> >>>> Toronto 416.642.7266 >>>> Main 1.866.411.7266 >>>> Fax 1.888.892.7266 >>>> >>>> On 4/24/2022 10:22 AM, Arnaud Abélard wrote: >>>>> Hello, >>>>> >>>>> I am working on replicating a server (and adding compression on the >>>>> other side) and since I had "Error: dsync I/O has stalled, no >>>>> activity >>>>> for 600 seconds (version not received)" errors I upgraded both >>>>> source >>>>> and destination server with the latest 2.3 version (2.3.18). While >>>>> before the upgrade all the 15 replication connections were busy >>>>> after >>>>> upgrading dovecot replicator dsync-status shows that most of the >>>>> time >>>>> nothing is being replicated at all. I can see some brief >>>>> replications >>>>> that last, but 99,9% of the time nothing is happening at all. >>>>> >>>>> I have a replication_full_sync_interval of 12 hours but I have >>>>> thousands of users with their last full sync over 90 hours ago. >>>>> >>>>> "doveadm replicator status" also shows that i have over 35,000 >>>>> queued >>>>> full resync requests, but no sync, high or low queued requests so >>>>> why >>>>> aren't the full requests occuring? >>>>> >>>>> There are no errors in the logs. >>>>> >>>>> Thanks, >>>>> >>>>> Arnaud >>>>> >>>>> >>>>> >>>>> >>>>> >> >

...
...
...
...
...
>>>> ed.hanna@dssmgmt.com high 00:09:42 463:46:58

...
...
...
...
...
>>>> ed@scom.ca high 00:09:43 463:47:01

...
...
...
...
...
>>>> paul@scom.ca high 00:09:41 463:46:51

...
...
...
...
...
>>>> keith@elirpa.com high 00:09:23 463:45:43

...
On 28/04/2022 13:57 Paul Kudla (SCOM.CA Internet Services Inc.) < paul@scom.ca> wrote:

Thanks for the update.

is this for both replication issues (folders +300 etc)

Just Asking - Any ETA

Happy Thursday !!! Thanks - paul

Paul Kudla

Scom.ca Internet Services <http://www.scom.ca> 004-1009 Byron Street South Whitby, Ontario - Canada L1N 4S3

Toronto 416.642.7266 Main 1.866.411.7266 Fax 1.888.892.7266

On 4/27/2022 9:01 AM, Aki Tuomi wrote:

...
Hi!

This is probably going to get fixed in 2.3.19, this looks like an

issue we are already fixing.

...
Aki

...
On 26/04/2022 16:38 Paul Kudla (SCOM.CA Internet Services Inc.) <

paul@scom.ca> wrote:

...
...
Agreed there seems to be no way of posting these kinds of issues to

see

...
if they are even being addressed or even known about moving forward on new updates

i read somewhere there is a new branch soming out but nothing as of yet?

2.4 maybe .... 5.0 ........

my previous replication issues (back in feb) went unanswered.

not faulting anyone, but the developers do seem to be disconnected from issues as of late? or concentrating on other issues.

I have no problem with support contracts for day to day maintence however as a programmer myself they usually dont work as the other end relies on the latest source code anyways. Thus can not help.

I am trying to take a part the replicator c programming based on 2.3.18 as most of it does work to some extent.

tcps just does not work (ie 600 seconds default in the c programming)

My thoughts are tcp works ok but fails when the replicator through dsync-client.c when asked to return the folder list?

replicator-brain.c seems to control the overall process and timing.

replicator-queue.c seems to handle the que file that does seem to carry acurate info.

things in the source code are documented enough to figure this out but i am still going through all the related .h files documentation wise which are all over the place.

there is no clear documentation on the .h lib files so i have to walk through the tree one at a time finding relative code.

since the dsync from doveadm does see to work ok i have to assume the dsync-client used to compile the replicator is at fault somehow or a call from it upstream?

Thanks for your input on the other issues noted below, i will keep that in mind when disassembling the source code.

No sense in fixing one thing and leaving something else behind, probably all related anyways.

i have two test servers avaliable so i can play with all this offline to reproduce the issues

Unfortunately I have to make a living first, this will be addressed when possible as i dont like systems that are live running this way and currently only have 5 accounts with this issue (mine included)

Happy Tuesday !!! Thanks - paul

Paul Kudla

Scom.ca Internet Services <http://www.scom.ca> 004-1009 Byron Street South Whitby, Ontario - Canada L1N 4S3

Toronto 416.642.7266 Main 1.866.411.7266 Fax 1.888.892.7266

On 4/26/2022 9:03 AM, Reuben Farrelly wrote:

...
I ran into this back in February and documented a reproducible test

case

...
(and sent it to this list). In short - I was able to reproduce this by having a valid and consistent mailbox on the source/local, creating a very standard empty Maildir/(new|cur|tmp) folder on the remote replica, and then initiating the replicate from the source. This consistently caused dsync to fail replication with the error "dovecot.index reset, view is now inconsistent" and sync aborted, leaving the replica mailbox in a screwed up inconsistent state. Client connections on the source replica were also dropped when this error occurred. You can see the error by enabling debug level logging if you initiate dsync manually on a test mailbox.

The only workaround I found was to remove the remote Maildir and let Dovecot create the whole thing from scratch. Dovecot did not like any existing folders on the destination replica even if they were the same names as the source and completely empty. I was able to reproduce this the bare minimum of folders - just an INBOX!

I have no idea if any of the developers saw my post or if the bug has been fixed for the next release. But it seemed to be quite a common problem over time (saw a few posts from people going back a long way with the same problem) and it is seriously disruptive to clients. The error message is not helpful in tracking down the problem either.

Secondly, I also have had an ongoing and longstanding problem using tcps: for replication. For some reason using tcps: (with no other changes at all to the config) results in a lot of timeout messages "Error: dsync I/O has stalled, no activity for 600 seconds". This goes away if I revert back to tcp: instead of tcps - with tcp: I very rarely get timeouts. No idea why, guess this is a bug of some sort also.

It's disappointing that there appears to be no way to have these sorts or problems addressed like there once was. I am not using Dovecot for commercial purposes so paying a fortune for a support contract for a high end installation just isn't going to happen, and this list seems to be quite ordinary for getting support and reporting bugs nowadays....

Reuben

On 26/04/2022 7:21 pm, Paul Kudla (SCOM.CA Internet Services Inc.) wrote:

...
side issue

if you are getting inconsistant dsyncs there is no real way to fix this in the long run.

i know its a pain (already had to my self)

i needed to do a full sync, take one server offline, delete the user dir (with dovecot offline) and then rsync (or somehow duplicate the main server's user data) over the the remote again.

then bring remote back up and it kind or worked worked

best suggestion is to bring the main server down at night so the

copy

...
is clean?

if using postfix you can enable the soft bounce option and the mail will back spool until everything comes back online

(needs to be enable on bother servers)

replication was still an issue on accounts with 300+ folders in them, still working on a fix for that.

Happy Tuesday !!! Thanks - paul

Paul Kudla

Scom.ca Internet Services <http://www.scom.ca> 004-1009 Byron Street South Whitby, Ontario - Canada L1N 4S3

Toronto 416.642.7266 Main 1.866.411.7266 Fax 1.888.892.7266

On 4/25/2022 10:01 AM, Arnaud Abélard wrote: > Ah, I'm now getting errors in the logs, that would explains the > increasing number of failed sync requests: > > dovecot: imap(xxxxx)<2961235><Bs6w43rdQPAqAcsFiXEmAInUhhA3Rfqh>: > Error: Mailbox INBOX: /vmail/l/i/xxxxx/dovecot.index reset, view is > now inconsistent > > > And sure enough: > > # dovecot replicator status xxxxx > > xxxxx none 00:02:54 07:11:28 - y > > > What could explain that error? > > Arnaud > > > > On 25/04/2022 15:13, Arnaud Abélard wrote: >> Hello, >> >> On my side we are running Linux (Debian Buster). >> >> I'm not sure my problem is actually the same as Paul or you >> Sebastian since I have a lot of boxes but those are actually small >> (quota of 110MB) so I doubt any of them have more than a dozen imap >> folders. >> >> The main symptom is that I have tons of full sync requests awaiting >> but even though no other sync is pending the replicator just waits >> for something to trigger those syncs. >> >> Today, with users back I can see that normal and incremental syncs >> are being done on the 15 connections, with an occasional full sync >> here or there and lots of "Waiting 'failed' requests": >> >> Queued 'sync' requests 0 >> >> Queued 'high' requests 0 >> >> Queued 'low' requests 0 >> >> Queued 'failed' requests 122 >> >> Queued 'full resync' requests 28785 >> >> Waiting 'failed' requests 4294 >> >> Total number of known users 42512 >> >> >> >> So, why didn't the replicator take advantage of the weekend to >> replicate the mailboxes while no user were using them? >> >> Arnaud >> >> >> >> >> On 25/04/2022 13:54, Sebastian Marske wrote: >>> Hi there, >>> >>> thanks for your insights and for diving deeper into this Paul! >>> >>> For me, the users ending up in 'Waiting for dsync to finish' all have >>> more than 256 Imap folders as well (ranging from 288 up to >5500; >>> as per >>> 'doveadm mailbox list -u <username> | wc -l'). For more details on my >>> setup please see my post from February [1]. >>> >>> @Arnaud: What OS are you running on? >>> >>> >>> Best >>> Sebastian >>> >>> >>> [1] https://dovecot.org/pipermail/dovecot/2022-February/124168.html >>> >>> >>> On 4/24/22 19:36, Paul Kudla (SCOM.CA Internet Services Inc.) wrote: >>>> >>>> Question having similiar replication issues >>>> >>>> pls read everything below and advise the folder counts on the >>>> non-replicated users? >>>> >>>> i find the total number of folders / account seems to be a factor >>>> and >>>> NOT the size of the mail box >>>> >>>> ie i have customers with 40G of emails no problem over 40 or so >>>> folders >>>> and it works ok >>>> >>>> 300+ folders seems to be the issue >>>> >>>> i have been going through the replication code >>>> >>>> no errors being logged >>>> >>>> i am assuming that the replication --> dhclient --> other server is >>>> timing out or not reading the folder lists correctly (ie dies after X >>>> folders read) >>>> >>>> thus i am going through the code patching for log entries etc to find >>>> the issues. >>>> >>>> see >>>> >>>> [13:33:57] mail18.scom.ca [root:0] /usr/local/var/lib/dovecot >>>> # ll >>>> total 86 >>>> drwxr-xr-x 2 root wheel uarch 4B Apr 24 11:11 . >>>> drwxr-xr-x 4 root wheel uarch 4B Mar 8 2021 .. >>>> -rw-r--r-- 1 root wheel uarch 73B Apr 24 11:11 instances >>>> -rw-r--r-- 1 root wheel uarch 160K Apr 24 13:33 replicator.db >>>> >>>> [13:33:58] mail18.scom.ca [root:0] /usr/local/var/lib/dovecot >>>> # >>>> >>>> replicator.db seems to get updated ok but never processed properly. >>>> >>>> # sync.users >>>> nick@elirpa.com high 00:09:41 463:47:01

Aki Tuomi

8:31 a.m.

I was hoping it would be fixed, apparently not then.

Can you enable mail_debug=yes and send the logs?

Aki

...

On 11/05/2022 07:25 Cassidy B. Larson <alandaluz@gmail.com> wrote:

Hi Aki,

We just installed 2.3.19, and are seeing a couple of users throwing the "INBOX/dovecot.index reset, view is now inconsistent" and their replicator status erroring out. Tried force-resync on the full mailbox, but to no avail just yet. Not sure if this bug was supposedly fixed in 2.3.19?

Thanks,

Cassidy

On Thu, Apr 28, 2022 at 5:02 AM Aki Tuomi <aki.tuomi@open-xchange.com> wrote:

...
2.3.19 is round the corner, so not long. I cannot yet promise an exact date but hopefully within week or two.

Aki

...
On 28/04/2022 13:57 Paul Kudla (SCOM.CA (http://SCOM.CA) Internet Services Inc.) <paul@scom.ca> wrote:

Thanks for the update.

is this for both replication issues (folders +300 etc)

Just Asking - Any ETA

Happy Thursday !!! Thanks - paul

Paul Kudla

Scom.ca Internet Services <http://www.scom.ca> 004-1009 Byron Street South Whitby, Ontario - Canada L1N 4S3

Toronto 416.642.7266 Main 1.866.411.7266 Fax 1.888.892.7266

On 4/27/2022 9:01 AM, Aki Tuomi wrote:

...
Hi!

This is probably going to get fixed in 2.3.19, this looks like an issue we are already fixing.

Aki

...
On 26/04/2022 16:38 Paul Kudla (SCOM.CA (http://SCOM.CA) Internet Services Inc.) <paul@scom.ca> wrote:

Agreed there seems to be no way of posting these kinds of issues to see if they are even being addressed or even known about moving forward on new updates

i read somewhere there is a new branch soming out but nothing as of yet?

2.4 maybe .... 5.0 ........

my previous replication issues (back in feb) went unanswered.

not faulting anyone, but the developers do seem to be disconnected from issues as of late? or concentrating on other issues.

I have no problem with support contracts for day to day maintence however as a programmer myself they usually dont work as the other end relies on the latest source code anyways. Thus can not help.

I am trying to take a part the replicator c programming based on 2.3.18 as most of it does work to some extent.

tcps just does not work (ie 600 seconds default in the c programming)

My thoughts are tcp works ok but fails when the replicator through dsync-client.c when asked to return the folder list?

replicator-brain.c seems to control the overall process and timing.

replicator-queue.c seems to handle the que file that does seem to carry acurate info.

things in the source code are documented enough to figure this out but i am still going through all the related .h files documentation wise which are all over the place.

there is no clear documentation on the .h lib files so i have to walk through the tree one at a time finding relative code.

since the dsync from doveadm does see to work ok i have to assume the dsync-client used to compile the replicator is at fault somehow or a call from it upstream?

Thanks for your input on the other issues noted below, i will keep that in mind when disassembling the source code.

No sense in fixing one thing and leaving something else behind, probably all related anyways.

i have two test servers avaliable so i can play with all this offline to reproduce the issues

Unfortunately I have to make a living first, this will be addressed when possible as i dont like systems that are live running this way and currently only have 5 accounts with this issue (mine included)

Happy Tuesday !!! Thanks - paul

Paul Kudla

Scom.ca Internet Services <http://www.scom.ca> 004-1009 Byron Street South Whitby, Ontario - Canada L1N 4S3

Toronto 416.642.7266 Main 1.866.411.7266 Fax 1.888.892.7266

On 4/26/2022 9:03 AM, Reuben Farrelly wrote:

...
I ran into this back in February and documented a reproducible test case (and sent it to this list). In short - I was able to reproduce this by having a valid and consistent mailbox on the source/local, creating a very standard empty Maildir/(new|cur|tmp) folder on the remote replica, and then initiating the replicate from the source. This consistently caused dsync to fail replication with the error "dovecot.index reset, view is now inconsistent" and sync aborted, leaving the replica mailbox in a screwed up inconsistent state. Client connections on the source replica were also dropped when this error occurred. You can see the error by enabling debug level logging if you initiate dsync manually on a test mailbox.

The only workaround I found was to remove the remote Maildir and let Dovecot create the whole thing from scratch. Dovecot did not like any existing folders on the destination replica even if they were the same names as the source and completely empty. I was able to reproduce this the bare minimum of folders - just an INBOX!

I have no idea if any of the developers saw my post or if the bug has been fixed for the next release. But it seemed to be quite a common problem over time (saw a few posts from people going back a long way with the same problem) and it is seriously disruptive to clients. The error message is not helpful in tracking down the problem either.

Secondly, I also have had an ongoing and longstanding problem using tcps: for replication. For some reason using tcps: (with no other changes at all to the config) results in a lot of timeout messages "Error: dsync I/O has stalled, no activity for 600 seconds". This goes away if I revert back to tcp: instead of tcps - with tcp: I very rarely get timeouts. No idea why, guess this is a bug of some sort also.

It's disappointing that there appears to be no way to have these sorts or problems addressed like there once was. I am not using Dovecot for commercial purposes so paying a fortune for a support contract for a high end installation just isn't going to happen, and this list seems to be quite ordinary for getting support and reporting bugs nowadays....

Reuben

On 26/04/2022 7:21 pm, Paul Kudla (SCOM.CA (http://SCOM.CA) Internet Services Inc.) wrote:

> > side issue > > if you are getting inconsistant dsyncs there is no real way to fix > this in the long run. > > i know its a pain (already had to my self) > > i needed to do a full sync, take one server offline, delete the user > dir (with dovecot offline) and then rsync (or somehow duplicate the > main server's user data) over the the remote again. > > then bring remote back up and it kind or worked worked > > best suggestion is to bring the main server down at night so the copy > is clean? > > if using postfix you can enable the soft bounce option and the mail > will back spool until everything comes back online > > (needs to be enable on bother servers) > > replication was still an issue on accounts with 300+ folders in them, > still working on a fix for that. > > > Happy Tuesday !!! > Thanks - paul > > Paul Kudla > > > Scom.ca Internet Services <http://www.scom.ca> > 004-1009 Byron Street South > Whitby, Ontario - Canada > L1N 4S3 > > Toronto 416.642.7266 > Main 1.866.411.7266 > Fax 1.888.892.7266 > > On 4/25/2022 10:01 AM, Arnaud Abélard wrote: >> Ah, I'm now getting errors in the logs, that would explains the >> increasing number of failed sync requests: >> >> dovecot: imap(xxxxx)<2961235><Bs6w43rdQPAqAcsFiXEmAInUhhA3Rfqh>: >> Error: Mailbox INBOX: /vmail/l/i/xxxxx/dovecot.index reset, view is >> now inconsistent >> >> >> And sure enough: >> >> # dovecot replicator status xxxxx >> >> xxxxx none 00:02:54 07:11:28 - y >> >> >> What could explain that error? >> >> Arnaud >> >> >> >> On 25/04/2022 15:13, Arnaud Abélard wrote: >>> Hello, >>> >>> On my side we are running Linux (Debian Buster). >>> >>> I'm not sure my problem is actually the same as Paul or you >>> Sebastian since I have a lot of boxes but those are actually small >>> (quota of 110MB) so I doubt any of them have more than a dozen imap >>> folders. >>> >>> The main symptom is that I have tons of full sync requests awaiting >>> but even though no other sync is pending the replicator just waits >>> for something to trigger those syncs. >>> >>> Today, with users back I can see that normal and incremental syncs >>> are being done on the 15 connections, with an occasional full sync >>> here or there and lots of "Waiting 'failed' requests": >>> >>> Queued 'sync' requests 0 >>> >>> Queued 'high' requests 0 >>> >>> Queued 'low' requests 0 >>> >>> Queued 'failed' requests 122 >>> >>> Queued 'full resync' requests 28785 >>> >>> Waiting 'failed' requests 4294 >>> >>> Total number of known users 42512 >>> >>> >>> >>> So, why didn't the replicator take advantage of the weekend to >>> replicate the mailboxes while no user were using them? >>> >>> Arnaud >>> >>> >>> >>> >>> On 25/04/2022 13:54, Sebastian Marske wrote: >>>> Hi there, >>>> >>>> thanks for your insights and for diving deeper into this Paul! >>>> >>>> For me, the users ending up in 'Waiting for dsync to finish' all have >>>> more than 256 Imap folders as well (ranging from 288 up to >5500; >>>> as per >>>> 'doveadm mailbox list -u <username> | wc -l'). For more details on my >>>> setup please see my post from February [1]. >>>> >>>> @Arnaud: What OS are you running on? >>>> >>>> >>>> Best >>>> Sebastian >>>> >>>> >>>> [1] https://dovecot.org/pipermail/dovecot/2022-February/124168.html >>>> >>>> >>>> On 4/24/22 19:36, Paul Kudla (SCOM.CA (http://SCOM.CA) Internet Services Inc.) wrote: >>>>> >>>>> Question having similiar replication issues >>>>> >>>>> pls read everything below and advise the folder counts on the >>>>> non-replicated users? >>>>> >>>>> i find the total number of folders / account seems to be a factor >>>>> and >>>>> NOT the size of the mail box >>>>> >>>>> ie i have customers with 40G of emails no problem over 40 or so >>>>> folders >>>>> and it works ok >>>>> >>>>> 300+ folders seems to be the issue >>>>> >>>>> i have been going through the replication code >>>>> >>>>> no errors being logged >>>>> >>>>> i am assuming that the replication --> dhclient --> other server is >>>>> timing out or not reading the folder lists correctly (ie dies after X >>>>> folders read) >>>>> >>>>> thus i am going through the code patching for log entries etc to find >>>>> the issues. >>>>> >>>>> see >>>>> >>>>> [13:33:57] mail18.scom.ca (http://mail18.scom.ca) [root:0] /usr/local/var/lib/dovecot >>>>> # ll >>>>> total 86 >>>>> drwxr-xr-x 2 root wheel uarch 4B Apr 24 11:11 . >>>>> drwxr-xr-x 4 root wheel uarch 4B Mar 8 2021 .. >>>>> -rw-r--r-- 1 root wheel uarch 73B Apr 24 11:11 instances >>>>> -rw-r--r-- 1 root wheel uarch 160K Apr 24 13:33 replicator.db >>>>> >>>>> [13:33:58] mail18.scom.ca (http://mail18.scom.ca) [root:0] /usr/local/var/lib/dovecot >>>>> # >>>>> >>>>> replicator.db seems to get updated ok but never processed properly. >>>>> >>>>> # sync.users >>>>> nick@elirpa.com high 00:09:41 463:47:01 - y >>>>> keith@elirpa.com high 00:09:23 463:45:43 - y >>>>> paul@scom.ca high 00:09:41 463:46:51 - y >>>>> ed@scom.ca high 00:09:43 463:47:01 - y >>>>> ed.hanna@dssmgmt.com high 00:09:42 463:46:58 - y >>>>> paul@paulkudla.net high 00:09:44 463:47:03 >>>>> 580:35:07 >>>>> y >>>>> >>>>> >>>>> >>>>> >>>>> so .... >>>>> >>>>> >>>>> >>>>> two things : >>>>> >>>>> first to get the production stuff to work i had to write a script >>>>> that >>>>> whould find the bad sync's and the force a dsync between the servers >>>>> >>>>> i run this every five minutes or each server. >>>>> >>>>> in crontab >>>>> >>>>> */10 * * * * root /usr/bin/nohup >>>>> /programs/common/sync.recover > /dev/null >>>>> >>>>> >>>>> python script to sort things out >>>>> >>>>> # cat /programs/common/sync.recover >>>>> #!/usr/local/bin/python3 >>>>> >>>>> #Force sync between servers that are reporting bad? >>>>> >>>>> import os,sys,django,socket >>>>> from optparse import OptionParser >>>>> >>>>> >>>>> from lib import * >>>>> >>>>> #Sample Re-Index MB >>>>> #doveadm -D force-resync -u paul@scom.ca -f INBOX* >>>>> >>>>> >>>>> >>>>> USAGE_TEXT = '''
>>>>> usage: %%prog %s[options] >>>>> ''' >>>>> >>>>> parser = OptionParser(usage=USAGE_TEXT % '', version='0.4') >>>>> >>>>> parser.add_option("-m", "--send_to", dest="send_to", help="Send >>>>> Email To") >>>>> parser.add_option("-e", "--email", dest="email_box", help="Box to >>>>> Index") >>>>> parser.add_option("-d", "--detail",action='store_true', >>>>> dest="detail",default =False, help="Detailed report") >>>>> parser.add_option("-i", "--index",action='store_true', >>>>> dest="index",default =False, help="Index") >>>>> >>>>> options, args = parser.parse_args() >>>>> >>>>> print (options.email_box) >>>>> print (options.send_to) >>>>> print (options.detail) >>>>> >>>>> #sys.exit() >>>>> >>>>> >>>>> >>>>> print ('Getting Current User Sync Status') >>>>> command = commands("/usr/local/bin/doveadm replicator status '*'") >>>>> >>>>> >>>>> #print command >>>>> >>>>> sync_user_status = command.output.split('\n') >>>>> >>>>> #print sync_user_status >>>>> >>>>> synced = [] >>>>> >>>>> for n in range(1,len(sync_user_status)) : >>>>> user = sync_user_status[n] >>>>> print ('Processing User : %s' %user.split(' ')[0]) >>>>> if user.split(' ')[0] != options.email_box : >>>>> if options.email_box != None : >>>>> continue >>>>> >>>>> if options.index == True : >>>>> command = '/usr/local/bin/doveadm -D force-resync >>>>> -u %s >>>>> -f INBOX*' %user.split(' ')[0] >>>>> command = commands(command) >>>>> command = command.output >>>>> >>>>> #print user >>>>> for nn in range (len(user)-1,0,-1) : >>>>> #print nn >>>>> #print user[nn] >>>>> >>>>> if user[nn] == '-' : >>>>> #print 'skipping ... %s' %user.split(' ')[0] >>>>> >>>>> break >>>>> >>>>> >>>>> >>>>> if user[nn] == 'y': #Found a Bad Mailbox >>>>> print ('syncing ... %s' %user.split(' ')[0]) >>>>> >>>>> >>>>> if options.detail == True : >>>>> command = '/usr/local/bin/doveadm -D >>>>> sync -u %s -d -N -l 30 -U' %user.split(' ')[0] >>>>> print (command) >>>>> command = commands(command) >>>>> command = command.output.split('\n') >>>>> print (command) >>>>> print ('Processed Mailbox for ... >>>>> %s' >>>>> %user.split(' ')[0] ) >>>>> synced.append('Processed Mailbox >>>>> for ... >>>>> %s' %user.split(' ')[0]) >>>>> for nnn in range(len(command)): >>>>> synced.append(command[nnn] + '\n') >>>>> break >>>>> >>>>> >>>>> if options.detail == False : >>>>> #command = >>>>> '/usr/local/bin/doveadm -D >>>>> sync -u %s -d -N -l 30 -U' %user.split(' ')[0] >>>>> #print (command) >>>>> #command = os.system(command) >>>>> command = subprocess.Popen( >>>>> ["/usr/local/bin/doveadm sync -u %s -d -N -l 30 -U" %user.split(' >>>>> ')[0] >>>>> ],
>>>>> shell = True, stdin=None, >>>>> stdout=None, >>>>> stderr=None, close_fds=True) >>>>> >>>>> print ( 'Processed Mailbox for >>>>> ... %s' >>>>> %user.split(' ')[0] ) >>>>> synced.append('Processed Mailbox >>>>> for ... >>>>> %s' %user.split(' ')[0]) >>>>> #sys.exit() >>>>> break >>>>> >>>>> if len(synced) != 0 : >>>>> #send email showing bad synced boxes ? >>>>> >>>>> if options.send_to != None : >>>>> send_from = 'monitor@scom.ca' >>>>> send_to = ['%s' %options.send_to] >>>>> send_subject = 'Dovecot Bad Sync Report for : %s' >>>>> %(socket.gethostname()) >>>>> send_text = '\n\n' >>>>> for n in range (len(synced)) : >>>>> send_text = send_text + synced[n] + '\n' >>>>> >>>>> send_files = [] >>>>> sendmail (send_from, send_to, send_subject, >>>>> send_text, >>>>> send_files) >>>>> >>>>> >>>>> >>>>> sys.exit() >>>>> >>>>> second : >>>>> >>>>> i posted this a month ago - no response >>>>> >>>>> please appreciate that i am trying to help .... >>>>> >>>>> after much testing i can now reporduce the replication issues at hand >>>>> >>>>> I am running on freebsd 12 & 13 stable (both test and production >>>>> servers) >>>>> >>>>> sdram drives etc ... >>>>> >>>>> Basically replication works fine until reaching a folder quantity >>>>> of ~ >>>>> 256 or more >>>>> >>>>> to reproduce using doveadm i created folders like >>>>> >>>>> INBOX/folder-0 >>>>> INBOX/folder-1 >>>>> INBOX/folder-2 >>>>> INBOX/folder-3 >>>>> and so forth ...... >>>>> >>>>> I created 200 folders and they replicated ok on both servers >>>>> >>>>> I created another 200 (400 total) and the replicator got stuck and >>>>> would >>>>> not update the mbox on the alternate server anymore and is still >>>>> updating 4 days later ? >>>>> >>>>> basically replicator goes so far and either hangs or more likely >>>>> bails >>>>> on an error that is not reported to the debug reporting ? >>>>> >>>>> however dsync will sync the two servers but only when run manually >>>>> (ie >>>>> all the folders will sync) >>>>> >>>>> I have two test servers avaliable if you need any kind of access - >>>>> again >>>>> here to help. >>>>> >>>>> [07:28:42] mail18.scom.ca (http://mail18.scom.ca) [root:0] ~ >>>>> # sync.status >>>>> Queued 'sync' requests 0 >>>>> Queued 'high' requests 6 >>>>> Queued 'low' requests 0 >>>>> Queued 'failed' requests 0 >>>>> Queued 'full resync' requests 0 >>>>> Waiting 'failed' requests 0 >>>>> Total number of known users 255 >>>>> >>>>> username type status >>>>> paul@scom.ca normal Waiting for dsync to >>>>> finish >>>>> keith@elirpa.com incremental Waiting for dsync to >>>>> finish >>>>> ed.hanna@dssmgmt.com incremental Waiting for dsync to >>>>> finish >>>>> ed@scom.ca incremental Waiting for dsync to >>>>> finish >>>>> nick@elirpa.com incremental Waiting for dsync to >>>>> finish >>>>> paul@paulkudla.net incremental Waiting for dsync to >>>>> finish >>>>> >>>>> >>>>> i have been going through the c code and it seems the replication >>>>> gets >>>>> requested ok >>>>> >>>>> replicator.db does get updated ok with the replicated request for the >>>>> mbox in question. >>>>> >>>>> however i am still looking for the actual replicator function in the >>>>> lib's that do the actual replication requests >>>>> >>>>> the number of folders & subfolders is defanately the issue - not the >>>>> mbox pyhsical size as thought origionally. >>>>> >>>>> if someone can point me in the right direction, it seems either the >>>>> replicator is not picking up on the number of folders to replicat >>>>> properly or it has a hard set limit like 256 / 512 / 65535 etc and >>>>> stops >>>>> the replication request thereafter. >>>>> >>>>> I am mainly a machine code programmer from the 80's and have >>>>> concentrated on python as of late, 'c' i am starting to go through >>>>> just >>>>> to give you a background on my talents. >>>>> >>>>> It took 2 months to finger this out. >>>>> >>>>> this issue also seems to be indirectly causing the duplicate messages >>>>> supression not to work as well. >>>>> >>>>> python programming to reproduce issue (loops are for last run >>>>> started @ >>>>> 200 - fyi) : >>>>> >>>>> # cat mbox.gen >>>>> #!/usr/local/bin/python2 >>>>> >>>>> import os,sys >>>>> >>>>> from lib import * >>>>> >>>>> >>>>> user = 'paul@paulkudla.net' >>>>> >>>>> """ >>>>> for count in range (0,600) : >>>>> box = 'INBOX/folder-%s' %count >>>>> print count >>>>> command = '/usr/local/bin/doveadm mailbox create -s -u %s >>>>> %s' >>>>> %(user,box) >>>>> print command >>>>> a = commands.getoutput(command) >>>>> print a >>>>> """ >>>>> >>>>> for count in range (0,600) : >>>>> box = 'INBOX/folder-0/sub-%' %count >>>>> print count >>>>> command = '/usr/local/bin/doveadm mailbox create -s -u %s >>>>> %s' >>>>> %(user,box) >>>>> print command >>>>> a = commands.getoutput(command) >>>>> print a >>>>> >>>>> >>>>> >>>>> #sys.exit() >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Happy Sunday !!! >>>>> Thanks - paul >>>>> >>>>> Paul Kudla >>>>> >>>>> >>>>> Scom.ca Internet Services <http://www.scom.ca> >>>>> 004-1009 Byron Street South >>>>> Whitby, Ontario - Canada >>>>> L1N 4S3 >>>>> >>>>> Toronto 416.642.7266 >>>>> Main 1.866.411.7266 >>>>> Fax 1.888.892.7266 >>>>> >>>>> On 4/24/2022 10:22 AM, Arnaud Abélard wrote: >>>>>> Hello, >>>>>> >>>>>> I am working on replicating a server (and adding compression on the >>>>>> other side) and since I had "Error: dsync I/O has stalled, no >>>>>> activity >>>>>> for 600 seconds (version not received)" errors I upgraded both >>>>>> source >>>>>> and destination server with the latest 2.3 version (2.3.18). While >>>>>> before the upgrade all the 15 replication connections were busy >>>>>> after >>>>>> upgrading dovecot replicator dsync-status shows that most of the >>>>>> time >>>>>> nothing is being replicated at all. I can see some brief >>>>>> replications >>>>>> that last, but 99,9% of the time nothing is happening at all. >>>>>> >>>>>> I have a replication_full_sync_interval of 12 hours but I have >>>>>> thousands of users with their last full sync over 90 hours ago. >>>>>> >>>>>> "doveadm replicator status" also shows that i have over 35,000 >>>>>> queued >>>>>> full resync requests, but no sync, high or low queued requests so >>>>>> why >>>>>> aren't the full requests occuring? >>>>>> >>>>>> There are no errors in the logs. >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Arnaud >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>> >>

Paul Kudla (SCOM.CA Internet Services Inc.)

12 May 12 May

3:20 p.m.

Ok update from my end

under 2.3.18 (have not upgraded production to 2.3.19 yet)

replication issues as stated before

however i need to note that i had to manually sync a user that was not being listed as a replicator fail

this means i have to force a full sync between servers on all accounts regardless of replication status

this was discovered this morning on a customers account that did not replicate between the servers properly and thus emails were being delivered days later because the client was accessing the other server.

its one thing to be 10 minutes late etc but a day late is not practical

again not complaining

I will load 2.3.19 on the test servers and try that and advise, also will test for the folder count replication issue as well and advise

please note NO errors are being thrown in the debug log, it reports the replication request, gets qued but does not complete??

Happy Thursday !!! Thanks - paul

Paul Kudla

Scom.ca Internet Services <http://www.scom.ca> 004-1009 Byron Street South Whitby, Ontario - Canada L1N 4S3

Toronto 416.642.7266 Main 1.866.411.7266 Fax 1.888.892.7266

On 5/11/2022 12:25 AM, Cassidy B. Larson wrote:

...

Hi Aki,

Thanks,

Cassidy

On Thu, Apr 28, 2022 at 5:02 AM Aki Tuomi <aki.tuomi@open-xchange.com <mailto:aki.tuomi@open-xchange.com>> wrote:

2.3.19 is round the corner, so not long. I cannot yet promise an
exact date but hopefully within week or two.

Aki

 > On 28/04/2022 13:57 Paul Kudla (SCOM.CA &lt;http://SCOM.CA> Internet
Services Inc.) &lt;paul@scom.ca &lt;mailto:paul@scom.ca>> wrote:
 >
 >
 > Thanks for the update.
 >
 > is this for both replication issues (folders +300 etc)
 >
 > Just Asking - Any ETA
 >
 >
 >
 >
 >
 > Happy Thursday !!!
 > Thanks - paul
 >
 > Paul Kudla
 >
 >
 > Scom.ca Internet Services &lt;http://www.scom.ca &lt;http://www.scom.ca>>
 > 004-1009 Byron Street South
 > Whitby, Ontario - Canada
 > L1N 4S3
 >
 > Toronto 416.642.7266
 > Main 1.866.411.7266
 > Fax 1.888.892.7266
 >
 > On 4/27/2022 9:01 AM, Aki Tuomi wrote:
 > >
 > > Hi!
 > >
 > > This is probably going to get fixed in 2.3.19, this looks like
an issue we are already fixing.
 > >
 > > Aki
 > >
 > >> On 26/04/2022 16:38 Paul Kudla (SCOM.CA &lt;http://SCOM.CA>
Internet Services Inc.) &lt;paul@scom.ca &lt;mailto:paul@scom.ca>> wrote:
 > >>
 > >>
 > >> Agreed there seems to be no way of posting these kinds of
issues to see
 > >> if they are even being addressed or even known about moving
forward on
 > >> new updates
 > >>
 > >> i read somewhere there is a new branch soming out but nothing
as of yet?
 > >>
 > >> 2.4 maybe ....
 > >> 5.0 ........
 > >>
 > >> my previous replication issues (back in feb) went unanswered.
 > >>
 > >> not faulting anyone, but the developers do seem to be
disconnected from
 > >> issues as of late? or concentrating on other issues.
 > >>
 > >> I have no problem with support contracts for day to day maintence
 > >> however as a programmer myself they usually dont work as the
other end
 > >> relies on the latest source code anyways. Thus can not help.
 > >>
 > >> I am trying to take a part the replicator c programming based
on 2.3.18
 > >> as most of it does work to some extent.
 > >>
 > >> tcps just does not work (ie 600 seconds default in the c
programming)
 > >>
 > >> My thoughts are tcp works ok but fails when the replicator through
 > >> dsync-client.c when asked to return the folder list?
 > >>
 > >>
 > >> replicator-brain.c seems to control the overall process and
timing.
 > >>
 > >> replicator-queue.c seems to handle the que file that does seem
to carry
 > >> acurate info.
 > >>
 > >>
 > >> things in the source code are documented enough to figure this
out but i
 > >> am still going through all the related .h files documentation
wise which
 > >> are all over the place.
 > >>
 > >> there is no clear documentation on the .h lib files so i have
to walk
 > >> through the tree one at a time finding relative code.
 > >>
 > >> since the dsync from doveadm does see to work ok i have to
assume the
 > >> dsync-client used to compile the replicator is at fault
somehow or a
 > >> call from it upstream?
 > >>
 > >> Thanks for your input on the other issues noted below, i will
keep that
 > >> in mind when disassembling the source code.
 > >>
 > >> No sense in fixing one thing and leaving something else
behind, probably
 > >> all related anyways.
 > >>
 > >> i have two test servers avaliable so i can play with all this
offline to
 > >> reproduce the issues
 > >>
 > >> Unfortunately I have to make a living first, this will be
addressed when
 > >> possible as i dont like systems that are live running this way and
 > >> currently only have 5 accounts with this issue (mine included)
 > >>
 > >>
 > >>
 > >>
 > >> Happy Tuesday !!!
 > >> Thanks - paul
 > >>
 > >> Paul Kudla
 > >>
 > >>
 > >> Scom.ca Internet Services &lt;http://www.scom.ca
&lt;http://www.scom.ca>>
 > >> 004-1009 Byron Street South
 > >> Whitby, Ontario - Canada
 > >> L1N 4S3
 > >>
 > >> Toronto 416.642.7266
 > >> Main 1.866.411.7266
 > >> Fax 1.888.892.7266
 > >>
 > >> On 4/26/2022 9:03 AM, Reuben Farrelly wrote:
 > >>>
 > >>> I ran into this back in February and documented a
reproducible test case
 > >>> (and sent it to this list).  In short - I was able to
reproduce this by
 > >>> having a valid and consistent mailbox on the source/local,
creating a
 > >>> very standard empty Maildir/(new|cur|tmp) folder on the
remote replica,
 > >>> and then initiating the replicate from the source. This
consistently
 > >>> caused dsync to fail replication with the error
"dovecot.index reset,
 > >>> view is now inconsistent" and sync aborted, leaving the
replica mailbox
 > >>> in a screwed up inconsistent state. Client connections on the
source
 > >>> replica were also dropped when this error occurred.  You can
see the
 > >>> error by enabling debug level logging if you initiate dsync
manually on
 > >>> a test mailbox.
 > >>>
 > >>> The only workaround I found was to remove the remote Maildir
and let
 > >>> Dovecot create the whole thing from scratch.  Dovecot did not
like any
 > >>> existing folders on the destination replica even if they were
the same
 > >>> names as the source and completely empty.  I was able to
reproduce this
 > >>> the bare minimum of folders - just an INBOX!
 > >>>
 > >>> I have no idea if any of the developers saw my post or if the
bug has
 > >>> been fixed for the next release.  But it seemed to be quite a
common
 > >>> problem over time (saw a few posts from people going back a
long way
 > >>> with the same problem) and it is seriously disruptive to
clients.  The
 > >>> error message is not helpful in tracking down the problem either.
 > >>>
 > >>> Secondly, I also have had an ongoing and longstanding problem
using
 > >>> tcps: for replication.  For some reason using tcps: (with no
other
 > >>> changes at all to the config) results in a lot of timeout
messages
 > >>> "Error: dsync I/O has stalled, no activity for 600 seconds". 
This goes
 > >>> away if I revert back to tcp: instead of tcps - with tcp: I
very rarely
 > >>> get timeouts.  No idea why, guess this is a bug of some sort
also.
 > >>>
 > >>> It's disappointing that there appears to be no way to have
these sorts
 > >>> or problems addressed like there once was.  I am not using
Dovecot for
 > >>> commercial purposes so paying a fortune for a support
contract for a
 > >>> high end installation just isn't going to happen, and this
list seems to
 > >>> be quite ordinary for getting support and reporting bugs
nowadays....
 > >>>
 > >>> Reuben
 > >>>
 > >>> On 26/04/2022 7:21 pm, Paul Kudla (SCOM.CA &lt;http://SCOM.CA>
Internet Services Inc.) wrote:
 > >>>
 > >>>>
 > >>>> side issue
 > >>>>
 > >>>> if you are getting inconsistant dsyncs there is no real way
to fix
 > >>>> this in the long run.
 > >>>>
 > >>>> i know its a pain (already had to my self)
 > >>>>
 > >>>> i needed to do a full sync, take one server offline, delete
the user
 > >>>> dir (with dovecot offline) and then rsync (or somehow
duplicate the
 > >>>> main server's user data) over the the remote again.
 > >>>>
 > >>>> then bring remote back up and it kind or worked worked
 > >>>>
 > >>>> best suggestion is to bring the main server down at night so
the copy
 > >>>> is clean?
 > >>>>
 > >>>> if using postfix you can enable the soft bounce option and
the mail
 > >>>> will back spool until everything comes back online
 > >>>>
 > >>>> (needs to be enable on bother servers)
 > >>>>
 > >>>> replication was still an issue on accounts with 300+ folders
in them,
 > >>>> still working on a fix for that.
 > >>>>
 > >>>>
 > >>>> Happy Tuesday !!!
 > >>>> Thanks - paul
 > >>>>
 > >>>> Paul Kudla
 > >>>>
 > >>>>
 > >>>> Scom.ca Internet Services &lt;http://www.scom.ca
&lt;http://www.scom.ca>>
 > >>>> 004-1009 Byron Street South
 > >>>> Whitby, Ontario - Canada
 > >>>> L1N 4S3
 > >>>>
 > >>>> Toronto 416.642.7266
 > >>>> Main 1.866.411.7266
 > >>>> Fax 1.888.892.7266
 > >>>>
 > >>>> On 4/25/2022 10:01 AM, Arnaud Abélard wrote:
 > >>>>> Ah, I'm now getting errors in the logs, that would explains the
 > >>>>> increasing number of failed sync requests:
 > >>>>>
 > >>>>> dovecot:
imap(xxxxx)&lt;2961235>&lt;Bs6w43rdQPAqAcsFiXEmAInUhhA3Rfqh>:
 > >>>>> Error: Mailbox INBOX: /vmail/l/i/xxxxx/dovecot.index reset,
view is
 > >>>>> now inconsistent
 > >>>>>
 > >>>>>
 > >>>>> And sure enough:
 > >>>>>
 > >>>>> # dovecot replicator status xxxxx
 > >>>>>
 > >>>>> xxxxx         none     00:02:54  07:11:28  -            y
 > >>>>>
 > >>>>>
 > >>>>> What could explain that error?
 > >>>>>
 > >>>>> Arnaud
 > >>>>>
 > >>>>>
 > >>>>>
 > >>>>> On 25/04/2022 15:13, Arnaud Abélard wrote:
 > >>>>>> Hello,
 > >>>>>>
 > >>>>>> On my side we are running Linux (Debian Buster).
 > >>>>>>
 > >>>>>> I'm not sure my problem is actually the same as Paul or you
 > >>>>>> Sebastian since I have a lot of boxes but those are
actually small
 > >>>>>> (quota of 110MB) so I doubt any of them have more than a
dozen imap
 > >>>>>> folders.
 > >>>>>>
 > >>>>>> The main symptom is that I have tons of full sync requests
awaiting
 > >>>>>> but even though no other sync is pending the replicator
just waits
 > >>>>>> for something to trigger those syncs.
 > >>>>>>
 > >>>>>> Today, with users back I can see that normal and
incremental syncs
 > >>>>>> are being done on the 15 connections, with an occasional
full sync
 > >>>>>> here or there and lots of "Waiting 'failed' requests":
 > >>>>>>
 > >>>>>> Queued 'sync' requests        0
 > >>>>>>
 > >>>>>> Queued 'high' requests        0
 > >>>>>>
 > >>>>>> Queued 'low' requests         0
 > >>>>>>
 > >>>>>> Queued 'failed' requests      122
 > >>>>>>
 > >>>>>> Queued 'full resync' requests 28785
 > >>>>>>
 > >>>>>> Waiting 'failed' requests     4294
 > >>>>>>
 > >>>>>> Total number of known users   42512
 > >>>>>>
 > >>>>>>
 > >>>>>>
 > >>>>>> So, why didn't the replicator take advantage of the weekend to
 > >>>>>> replicate the mailboxes while no user were using them?
 > >>>>>>
 > >>>>>> Arnaud
 > >>>>>>
 > >>>>>>
 > >>>>>>
 > >>>>>>
 > >>>>>> On 25/04/2022 13:54, Sebastian Marske wrote:
 > >>>>>>> Hi there,
 > >>>>>>>
 > >>>>>>> thanks for your insights and for diving deeper into this
Paul!
 > >>>>>>>
 > >>>>>>> For me, the users ending up in 'Waiting for dsync to
finish' all have
 > >>>>>>> more than 256 Imap folders as well (ranging from 288 up
to >5500;
 > >>>>>>> as per
 > >>>>>>> 'doveadm mailbox list -u &lt;username> | wc -l'). For more
details on my
 > >>>>>>> setup please see my post from February [1].
 > >>>>>>>
 > >>>>>>> @Arnaud: What OS are you running on?
 > >>>>>>>
 > >>>>>>>
 > >>>>>>> Best
 > >>>>>>> Sebastian
 > >>>>>>>
 > >>>>>>>
 > >>>>>>> [1]
https://dovecot.org/pipermail/dovecot/2022-February/124168.html
&lt;https://dovecot.org/pipermail/dovecot/2022-February/124168.html>
 > >>>>>>>
 > >>>>>>>
 > >>>>>>> On 4/24/22 19:36, Paul Kudla (SCOM.CA &lt;http://SCOM.CA>
Internet Services Inc.) wrote:
 > >>>>>>>>
 > >>>>>>>> Question having similiar replication issues
 > >>>>>>>>
 > >>>>>>>> pls read everything below and advise the folder counts
on the
 > >>>>>>>> non-replicated users?
 > >>>>>>>>
 > >>>>>>>> i find  the total number of folders / account seems to
be a factor
 > >>>>>>>> and
 > >>>>>>>> NOT the size of the mail box
 > >>>>>>>>
 > >>>>>>>> ie i have customers with 40G of emails no problem over
40 or so
 > >>>>>>>> folders
 > >>>>>>>> and it works ok
 > >>>>>>>>
 > >>>>>>>> 300+ folders seems to be the issue
 > >>>>>>>>
 > >>>>>>>> i have been going through the replication code
 > >>>>>>>>
 > >>>>>>>> no errors being logged
 > >>>>>>>>
 > >>>>>>>> i am assuming that the replication --> dhclient -->
other server is
 > >>>>>>>> timing out or not reading the folder lists correctly (ie
dies after X
 > >>>>>>>> folders read)
 > >>>>>>>>
 > >>>>>>>> thus i am going through the code patching for log
entries etc to find
 > >>>>>>>> the issues.
 > >>>>>>>>
 > >>>>>>>> see
 > >>>>>>>>
 > >>>>>>>> [13:33:57] mail18.scom.ca &lt;http://mail18.scom.ca>
[root:0] /usr/local/var/lib/dovecot
 > >>>>>>>> # ll
 > >>>>>>>> total 86
 > >>>>>>>> drwxr-xr-x  2 root  wheel  uarch    4B Apr 24 11:11 .
 > >>>>>>>> drwxr-xr-x  4 root  wheel  uarch    4B Mar  8  2021 ..
 > >>>>>>>> -rw-r--r--  1 root  wheel  uarch   73B Apr 24 11:11
instances
 > >>>>>>>> -rw-r--r--  1 root  wheel  uarch  160K Apr 24 13:33
replicator.db
 > >>>>>>>>
 > >>>>>>>> [13:33:58] mail18.scom.ca &lt;http://mail18.scom.ca>
[root:0] /usr/local/var/lib/dovecot
 > >>>>>>>> #
 > >>>>>>>>
 > >>>>>>>> replicator.db seems to get updated ok but never
processed properly.
 > >>>>>>>>
 > >>>>>>>> # sync.users
 > >>>>>>>> nick@elirpa.com
&lt;mailto:nick@elirpa.com>                   high     00:09:41
463:47:01 -     y
 > >>>>>>>> keith@elirpa.com
&lt;mailto:keith@elirpa.com>                  high     00:09:23
463:45:43 -     y
 > >>>>>>>> paul@scom.ca &lt;mailto:paul@scom.ca>                     
high     00:09:41 463:46:51 -     y
 > >>>>>>>> ed@scom.ca &lt;mailto:ed@scom.ca>                       
high     00:09:43 463:47:01 -     y
 > >>>>>>>> ed.hanna@dssmgmt.com
&lt;mailto:ed.hanna@dssmgmt.com>              high     00:09:42
463:46:58 -     y
 > >>>>>>>> paul@paulkudla.net
&lt;mailto:paul@paulkudla.net>                high     00:09:44 463:47:03
 > >>>>>>>> 580:35:07
 > >>>>>>>>      y
 > >>>>>>>>
 > >>>>>>>>
 > >>>>>>>>
 > >>>>>>>>
 > >>>>>>>> so ....
 > >>>>>>>>
 > >>>>>>>>
 > >>>>>>>>
 > >>>>>>>> two things :
 > >>>>>>>>
 > >>>>>>>> first to get the production stuff to work i had to write
a script
 > >>>>>>>> that
 > >>>>>>>> whould find the bad sync's and the force a dsync between
the servers
 > >>>>>>>>
 > >>>>>>>> i run this every five minutes or each server.
 > >>>>>>>>
 > >>>>>>>> in crontab
 > >>>>>>>>
 > >>>>>>>> */10    *                *    *    *    root /usr/bin/nohup
 > >>>>>>>> /programs/common/sync.recover > /dev/null
 > >>>>>>>>
 > >>>>>>>>
 > >>>>>>>> python script to sort things out
 > >>>>>>>>
 > >>>>>>>> # cat /programs/common/sync.recover
 > >>>>>>>> #!/usr/local/bin/python3
 > >>>>>>>>
 > >>>>>>>> #Force sync between servers that are reporting bad?
 > >>>>>>>>
 > >>>>>>>> import os,sys,django,socket
 > >>>>>>>> from optparse import OptionParser
 > >>>>>>>>
 > >>>>>>>>
 > >>>>>>>> from lib import *
 > >>>>>>>>
 > >>>>>>>> #Sample Re-Index MB
 > >>>>>>>> #doveadm -D force-resync -u paul@scom.ca
&lt;mailto:paul@scom.ca> -f INBOX*
 > >>>>>>>>
 > >>>>>>>>
 > >>>>>>>>
 > >>>>>>>> USAGE_TEXT = '''\
 > >>>>>>>> usage: %%prog %s[options]
 > >>>>>>>> '''
 > >>>>>>>>
 > >>>>>>>> parser = OptionParser(usage=USAGE_TEXT % '', version='0.4')
 > >>>>>>>>
 > >>>>>>>> parser.add_option("-m", "--send_to", dest="send_to",
help="Send
 > >>>>>>>> Email To")
 > >>>>>>>> parser.add_option("-e", "--email", dest="email_box",
help="Box to
 > >>>>>>>> Index")
 > >>>>>>>> parser.add_option("-d", "--detail",action='store_true',
 > >>>>>>>> dest="detail",default =False, help="Detailed report")
 > >>>>>>>> parser.add_option("-i", "--index",action='store_true',
 > >>>>>>>> dest="index",default =False, help="Index")
 > >>>>>>>>
 > >>>>>>>> options, args = parser.parse_args()
 > >>>>>>>>
 > >>>>>>>> print (options.email_box)
 > >>>>>>>> print (options.send_to)
 > >>>>>>>> print (options.detail)
 > >>>>>>>>
 > >>>>>>>> #sys.exit()
 > >>>>>>>>
 > >>>>>>>>
 > >>>>>>>>
 > >>>>>>>> print ('Getting Current User Sync Status')
 > >>>>>>>> command = commands("/usr/local/bin/doveadm replicator
status '*'")
 > >>>>>>>>
 > >>>>>>>>
 > >>>>>>>> #print command
 > >>>>>>>>
 > >>>>>>>> sync_user_status = command.output.split('\n')
 > >>>>>>>>
 > >>>>>>>> #print sync_user_status
 > >>>>>>>>
 > >>>>>>>> synced = []
 > >>>>>>>>
 > >>>>>>>> for n in range(1,len(sync_user_status)) :
 > >>>>>>>>           user = sync_user_status[n]
 > >>>>>>>>           print ('Processing User : %s' %user.split(' ')[0])
 > >>>>>>>>           if user.split(' ')[0] != options.email_box :
 > >>>>>>>>                   if options.email_box != None :
 > >>>>>>>>                           continue
 > >>>>>>>>
 > >>>>>>>>           if options.index == True :
 > >>>>>>>>                   command = '/usr/local/bin/doveadm -D
force-resync
 > >>>>>>>> -u %s
 > >>>>>>>> -f INBOX*' %user.split(' ')[0]
 > >>>>>>>>                   command = commands(command)
 > >>>>>>>>                   command = command.output
 > >>>>>>>>
 > >>>>>>>>           #print user
 > >>>>>>>>           for nn in range (len(user)-1,0,-1) :
 > >>>>>>>>                   #print nn
 > >>>>>>>>                   #print user[nn]
 > >>>>>>>>
 > >>>>>>>>                   if user[nn] == '-' :
 > >>>>>>>>                           #print 'skipping ... %s'
%user.split(' ')[0]
 > >>>>>>>>
 > >>>>>>>>                           break
 > >>>>>>>>
 > >>>>>>>>
 > >>>>>>>>
 > >>>>>>>>                   if user[nn] == 'y': #Found a Bad Mailbox
 > >>>>>>>>                           print ('syncing ... %s'
%user.split(' ')[0])
 > >>>>>>>>
 > >>>>>>>>
 > >>>>>>>>                           if options.detail == True :
 > >>>>>>>>                                   command =
'/usr/local/bin/doveadm -D
 > >>>>>>>> sync -u %s -d -N -l 30 -U' %user.split(' ')[0]
 > >>>>>>>>                                   print (command)
 > >>>>>>>>                                   command =
commands(command)
 > >>>>>>>>                                   command =
command.output.split('\n')
 > >>>>>>>>                                   print (command)
 > >>>>>>>>                                   print ('Processed
Mailbox for ...
 > >>>>>>>> %s'
 > >>>>>>>> %user.split(' ')[0] )
 > >>>>>>>>                                  
synced.append('Processed Mailbox
 > >>>>>>>> for ...
 > >>>>>>>> %s' %user.split(' ')[0])
 > >>>>>>>>                                   for nnn in
range(len(command)):
 > >>>>>>>> synced.append(command[nnn] + '\n')
 > >>>>>>>>                                   break
 > >>>>>>>>
 > >>>>>>>>
 > >>>>>>>>                           if options.detail == False :
 > >>>>>>>>                                   #command =
 > >>>>>>>> '/usr/local/bin/doveadm -D
 > >>>>>>>> sync -u %s -d -N -l 30 -U' %user.split(' ')[0]
 > >>>>>>>>                                   #print (command)
 > >>>>>>>>                                   #command =
os.system(command)
 > >>>>>>>>                                   command =
subprocess.Popen(
 > >>>>>>>> ["/usr/local/bin/doveadm sync -u %s -d -N -l 30 -U"
%user.split('
 > >>>>>>>> ')[0]
 > >>>>>>>> ], \
 > >>>>>>>>                                   shell = True, stdin=None,
 > >>>>>>>> stdout=None,
 > >>>>>>>> stderr=None, close_fds=True)
 > >>>>>>>>
 > >>>>>>>>                                   print ( 'Processed
Mailbox for
 > >>>>>>>> ... %s'
 > >>>>>>>> %user.split(' ')[0] )
 > >>>>>>>>                                  
synced.append('Processed Mailbox
 > >>>>>>>> for ...
 > >>>>>>>> %s' %user.split(' ')[0])
 > >>>>>>>>                                   #sys.exit()
 > >>>>>>>>                                   break
 > >>>>>>>>
 > >>>>>>>> if len(synced) != 0 :
 > >>>>>>>>           #send email showing bad synced boxes ?
 > >>>>>>>>
 > >>>>>>>>           if options.send_to != None :
 > >>>>>>>>                   send_from = 'monitor@scom.ca
&lt;mailto:monitor@scom.ca>'
 > >>>>>>>>                   send_to = ['%s' %options.send_to]
 > >>>>>>>>                   send_subject = 'Dovecot Bad Sync
Report for : %s'
 > >>>>>>>> %(socket.gethostname())
 > >>>>>>>>                   send_text = '\n\n'
 > >>>>>>>>                   for n in range (len(synced)) :
 > >>>>>>>>                           send_text = send_text +
synced[n] + '\n'
 > >>>>>>>>
 > >>>>>>>>                   send_files = []
 > >>>>>>>>                   sendmail (send_from, send_to,
send_subject,
 > >>>>>>>> send_text,
 > >>>>>>>> send_files)
 > >>>>>>>>
 > >>>>>>>>
 > >>>>>>>>
 > >>>>>>>> sys.exit()
 > >>>>>>>>
 > >>>>>>>> second :
 > >>>>>>>>
 > >>>>>>>> i posted this a month ago - no response
 > >>>>>>>>
 > >>>>>>>> please appreciate that i am trying to help ....
 > >>>>>>>>
 > >>>>>>>> after much testing i can now reporduce the replication
issues at hand
 > >>>>>>>>
 > >>>>>>>> I am running on freebsd 12 & 13 stable (both test and
production
 > >>>>>>>> servers)
 > >>>>>>>>
 > >>>>>>>> sdram drives etc ...
 > >>>>>>>>
 > >>>>>>>> Basically replication works fine until reaching a folder
quantity
 > >>>>>>>> of ~
 > >>>>>>>> 256 or more
 > >>>>>>>>
 > >>>>>>>> to reproduce using doveadm i created folders like
 > >>>>>>>>
 > >>>>>>>> INBOX/folder-0
 > >>>>>>>> INBOX/folder-1
 > >>>>>>>> INBOX/folder-2
 > >>>>>>>> INBOX/folder-3
 > >>>>>>>> and so forth ......
 > >>>>>>>>
 > >>>>>>>> I created 200 folders and they replicated ok on both servers
 > >>>>>>>>
 > >>>>>>>> I created another 200 (400 total) and the replicator got
stuck and
 > >>>>>>>> would
 > >>>>>>>> not update the mbox on the alternate server anymore and
is still
 > >>>>>>>> updating 4 days later ?
 > >>>>>>>>
 > >>>>>>>> basically replicator goes so far and either hangs or
more likely
 > >>>>>>>> bails
 > >>>>>>>> on an error that is not reported to the debug reporting ?
 > >>>>>>>>
 > >>>>>>>> however dsync will sync the two servers but only when
run manually
 > >>>>>>>> (ie
 > >>>>>>>> all the folders will sync)
 > >>>>>>>>
 > >>>>>>>> I have two test servers avaliable if you need any kind
of access -
 > >>>>>>>> again
 > >>>>>>>> here to help.
 > >>>>>>>>
 > >>>>>>>> [07:28:42] mail18.scom.ca &lt;http://mail18.scom.ca> [root:0] ~
 > >>>>>>>> # sync.status
 > >>>>>>>> Queued 'sync' requests        0
 > >>>>>>>> Queued 'high' requests        6
 > >>>>>>>> Queued 'low' requests         0
 > >>>>>>>> Queued 'failed' requests      0
 > >>>>>>>> Queued 'full resync' requests 0
 > >>>>>>>> Waiting 'failed' requests     0
 > >>>>>>>> Total number of known users   255
 > >>>>>>>>
 > >>>>>>>> username                       type        status
 > >>>>>>>> paul@scom.ca &lt;mailto:paul@scom.ca>                  
normal      Waiting for dsync to
 > >>>>>>>> finish
 > >>>>>>>> keith@elirpa.com &lt;mailto:keith@elirpa.com>              
incremental Waiting for dsync to
 > >>>>>>>> finish
 > >>>>>>>> ed.hanna@dssmgmt.com
&lt;mailto:ed.hanna@dssmgmt.com>           incremental Waiting for dsync to
 > >>>>>>>> finish
 > >>>>>>>> ed@scom.ca &lt;mailto:ed@scom.ca>                    
incremental Waiting for dsync to
 > >>>>>>>> finish
 > >>>>>>>> nick@elirpa.com &lt;mailto:nick@elirpa.com>               
incremental Waiting for dsync to
 > >>>>>>>> finish
 > >>>>>>>> paul@paulkudla.net
&lt;mailto:paul@paulkudla.net>             incremental Waiting for dsync to
 > >>>>>>>> finish
 > >>>>>>>>
 > >>>>>>>>
 > >>>>>>>> i have been going through the c code and it seems the
replication
 > >>>>>>>> gets
 > >>>>>>>> requested ok
 > >>>>>>>>
 > >>>>>>>> replicator.db does get updated ok with the replicated
request for the
 > >>>>>>>> mbox in question.
 > >>>>>>>>
 > >>>>>>>> however i am still looking for the actual replicator
function in the
 > >>>>>>>> lib's that do the actual replication requests
 > >>>>>>>>
 > >>>>>>>> the number of folders & subfolders is defanately the
issue - not the
 > >>>>>>>> mbox pyhsical size as thought origionally.
 > >>>>>>>>
 > >>>>>>>> if someone can point me in the right direction, it seems
either the
 > >>>>>>>> replicator is not picking up on the number of folders to
replicat
 > >>>>>>>> properly or it has a hard set limit like 256 / 512 /
65535 etc and
 > >>>>>>>> stops
 > >>>>>>>> the replication request thereafter.
 > >>>>>>>>
 > >>>>>>>> I am mainly a machine code programmer from the 80's and have
 > >>>>>>>> concentrated on python as of late, 'c' i am starting to
go through
 > >>>>>>>> just
 > >>>>>>>> to give you a background on my talents.
 > >>>>>>>>
 > >>>>>>>> It took 2 months to finger this out.
 > >>>>>>>>
 > >>>>>>>> this issue also seems to be indirectly causing the
duplicate messages
 > >>>>>>>> supression not to work as well.
 > >>>>>>>>
 > >>>>>>>> python programming to reproduce issue (loops are for
last run
 > >>>>>>>> started @
 > >>>>>>>> 200 - fyi) :
 > >>>>>>>>
 > >>>>>>>> # cat mbox.gen
 > >>>>>>>> #!/usr/local/bin/python2
 > >>>>>>>>
 > >>>>>>>> import os,sys
 > >>>>>>>>
 > >>>>>>>> from lib import *
 > >>>>>>>>
 > >>>>>>>>
 > >>>>>>>> user = 'paul@paulkudla.net &lt;mailto:paul@paulkudla.net>'
 > >>>>>>>>
 > >>>>>>>> """
 > >>>>>>>> for count in range (0,600) :
 > >>>>>>>>           box = 'INBOX/folder-%s' %count
 > >>>>>>>>           print count
 > >>>>>>>>           command = '/usr/local/bin/doveadm mailbox
create -s -u %s
 > >>>>>>>> %s'
 > >>>>>>>> %(user,box)
 > >>>>>>>>           print command
 > >>>>>>>>           a = commands.getoutput(command)
 > >>>>>>>>           print a
 > >>>>>>>> """
 > >>>>>>>>
 > >>>>>>>> for count in range (0,600) :
 > >>>>>>>>           box = 'INBOX/folder-0/sub-%' %count
 > >>>>>>>>           print count
 > >>>>>>>>           command = '/usr/local/bin/doveadm mailbox
create -s -u %s
 > >>>>>>>> %s'
 > >>>>>>>> %(user,box)
 > >>>>>>>>           print command
 > >>>>>>>>           a = commands.getoutput(command)
 > >>>>>>>>           print a
 > >>>>>>>>
 > >>>>>>>>
 > >>>>>>>>
 > >>>>>>>>           #sys.exit()
 > >>>>>>>>
 > >>>>>>>>
 > >>>>>>>>
 > >>>>>>>>
 > >>>>>>>>
 > >>>>>>>> Happy Sunday !!!
 > >>>>>>>> Thanks - paul
 > >>>>>>>>
 > >>>>>>>> Paul Kudla
 > >>>>>>>>
 > >>>>>>>>
 > >>>>>>>> Scom.ca Internet Services &lt;http://www.scom.ca
&lt;http://www.scom.ca>>
 > >>>>>>>> 004-1009 Byron Street South
 > >>>>>>>> Whitby, Ontario - Canada
 > >>>>>>>> L1N 4S3
 > >>>>>>>>
 > >>>>>>>> Toronto 416.642.7266
 > >>>>>>>> Main 1.866.411.7266
 > >>>>>>>> Fax 1.888.892.7266
 > >>>>>>>>
 > >>>>>>>> On 4/24/2022 10:22 AM, Arnaud Abélard wrote:
 > >>>>>>>>> Hello,
 > >>>>>>>>>
 > >>>>>>>>> I am working on replicating a server (and adding
compression on the
 > >>>>>>>>> other side) and since I had "Error: dsync I/O has
stalled, no
 > >>>>>>>>> activity
 > >>>>>>>>> for 600 seconds (version not received)" errors I
upgraded both
 > >>>>>>>>> source
 > >>>>>>>>> and destination server with the latest 2.3 version
(2.3.18). While
 > >>>>>>>>> before the upgrade all the 15 replication connections
were busy
 > >>>>>>>>> after
 > >>>>>>>>> upgrading dovecot replicator dsync-status shows that
most of the
 > >>>>>>>>> time
 > >>>>>>>>> nothing is being replicated at all. I can see some brief
 > >>>>>>>>> replications
 > >>>>>>>>> that last, but 99,9% of the time nothing is happening
at all.
 > >>>>>>>>>
 > >>>>>>>>> I have a replication_full_sync_interval of 12 hours but
I have
 > >>>>>>>>> thousands of users with their last full sync over 90
hours ago.
 > >>>>>>>>>
 > >>>>>>>>> "doveadm replicator status" also shows that i have over
35,000
 > >>>>>>>>> queued
 > >>>>>>>>> full resync requests, but no sync, high or low queued
requests so
 > >>>>>>>>> why
 > >>>>>>>>> aren't the full requests occuring?
 > >>>>>>>>>
 > >>>>>>>>> There are no errors in the logs.
 > >>>>>>>>>
 > >>>>>>>>> Thanks,
 > >>>>>>>>>
 > >>>>>>>>> Arnaud
 > >>>>>>>>>
 > >>>>>>>>>
 > >>>>>>>>>
 > >>>>>>>>>
 > >>>>>>>>>
 > >>>>>>
 > >>>>>
 > >>>
 > >

-- This message has been scanned for viruses and dangerous content by *MailScanner* <http://www.mailscanner.info/>, and is believed to be clean.

Paul Kudla (SCOM.CA Internet Services Inc.)

26 Apr 26 Apr

12:36 p.m.

more specific to this issue

i looks like (at this was fun for me to figure out as well)

note replication does not work well on nfs file systems etc

i started with sdram drives on one server and nfs on the other and found i simply had to go sdram (or whatever) on the other one

smoothed all of this out a lot.

basically both servers need to be the same at the end of the day.

the replicator can be a bit fun to setup

i found tcpip (no ssl) worked best

i run one config file (not the 10- etc) so here is one side, other side is the same except for the ip replicator address connection just set accordingly, and i run a local backbone network hence the 10.221.0./16 which also smooths things out as there is only replication traffic & auth traffic running across this link. No bottlenecks at the end of the day.

i included sni, postgresql & sieve (for duplicates) as well for the complete picture.

took three months to get this going, mainly due to outdated documentation

dovecot works way better then cyrus and supports sni but current (2.3.18) complete documentation from setting up beginning to end would help!

I program for a living and even with me documentation always seems to take a back seat.

note that sni loades from a database and i wrote a python script to do that to support auto updating of yearly ssl certs.

/programs/common/getssl.cert

cat dovecot.conf

2.3.14 (cee3cbc0d): /usr/local/etc/dovecot/dovecot.conf

OS: FreeBSD 12.1-RELEASE amd64

Hostname: mail18.scom.ca

auth_debug = no auth_debug_passwords = no

default_process_limit = 16384

mail_debug = no

#lock_method = dotlock #mail_max_lock_timeout = 300s

#mbox_read_locks = dotlock #mbox_write_locks = dotlock

mmap_disable = yes dotlock_use_excl = no mail_fsync = always mail_nfs_storage = no mail_nfs_index = no

auth_mechanisms = plain login auth_verbose = yes base_dir = /data/dovecot/run/ debug_log_path = syslog disable_plaintext_auth = no dsync_features = empty-header-workaround

#imapc_features = rfc822.size fetch-headers #imapc_host = mail.scom.ca #imapc_password = Pk554669 #imapc_user = paul@scom.ca

info_log_path = syslog login_greeting = SCOM.CA Internet Services Inc. - Dovecot ready login_log_format_elements = user=<%u> method=%m rip=%r lip=%l mpid=%e %c

mail_location = maildir:~/

mail_plugins = " virtual notify replication fts fts_lucene " mail_prefetch_count = 20

protocols = imap pop3 lmtp sieve

protocol lmtp { mail_plugins = $mail_plugins sieve postmaster_address = monitor@scom.ca }

service lmtp { process_limit=1000 vsz_limit = 512m client_limit=1 unix_listener /usr/home/postfix.local/private/dovecot-lmtp { group = postfix mode = 0600 user = postfix } }

protocol lda { mail_plugins = $mail_plugins sieve }

service lda { process_limit=1000 vsz_limit = 512m }

service imap { process_limit=4096 vsz_limit = 2g client_limit=1 }

service pop3 { process_limit=1000 vsz_limit = 512m client_limit=1 }

namespace inbox { inbox = yes location = mailbox Drafts { auto = subscribe special_use = \Drafts } mailbox Sent { auto = subscribe special_use = \Sent } mailbox Trash { auto = subscribe special_use = \Trash } prefix = separator = / }

passdb { args = /usr/local/etc/dovecot/dovecot-pgsql.conf driver = sql }

doveadm_port = 12345 doveadm_password = secretxyyyyyy

service doveadm { process_limit = 0 process_min_avail = 0 idle_kill = 0 client_limit = 1 user = vmail inet_listener { port = 12345 } }

service config { unix_listener config { user = vmail } }

dsync_remote_cmd = ssh -l%{login} %{host} doveadm dsync-server -u%u #dsync_remote_cmd = doveadm sync -d -u%u

replication_dsync_parameters = -d -N -l 300 -U

plugin { mail_log_events = delete undelete expunge copy mailbox_delete mailbox_rename mail_log_fields = uid, box, msgid, from, subject, size, vsize, flags push_notification_driver = dlog

sieve = file:~/sieve;active=~/sieve/.dovecot.sieve #sieve = ~/.dovecot.sieve sieve_duplicate_default_period = 1h sieve_duplicate_max_period = 1d sieve_extensions = +duplicate +notify +imapflags +vacation-seconds sieve_global_dir = /usr/local/etc/dovecot/sieve sieve_before = /usr/local/etc/dovecot/sieve/duplicates.sieve

mail_replica = tcp:10.221.0.19:12345 #mail_replica = remote:vmail@10.221.0.19 #replication_sync_timeout = 2

fts = lucene fts_lucene = whitespace_chars=@. }

#sieve_extensions = vnd.dovecot.duplicate

#sieve_plugins = vnd.dovecot.duplicate

service anvil { process_limit = 1 client_limit=5000 vsz_limit = 512m unix_listener anvil { group = vmail mode = 0666 } }

service auth { process_limit = 1 client_limit=5000 vsz_limit = 1g

unix_listener auth-userdb {
   mode = 0660
   user = vmail
   group = vmail
}
unix_listener /var/spool/postfix/private/auth {
   mode = 0666
}

}

service stats { process_limit = 1000 vsz_limit = 1g unix_listener stats-reader { group = vmail mode = 0666 } unix_listener stats-writer { group = vmail mode = 0666 } } userdb { args = /usr/local/etc/dovecot/dovecot-pgsql.conf driver = sql

}

protocol imap { mail_max_userip_connections = 50 mail_plugins = $mail_plugins notify replication }

protocol pop3 { mail_max_userip_connections = 50 mail_plugins = $mail_plugins notify replication }

protocol imaps { mail_max_userip_connections = 25 mail_plugins = $mail_plugins notify replication }

protocol pop3s { mail_max_userip_connections = 25 mail_plugins = $mail_plugins notify replication }

service managesieve-login { process_limit = 1000 vsz_limit = 1g inet_listener sieve { port = 4190 } }

verbose_proctitle = yes

replication_max_conns = 100

replication_full_sync_interval = 1d

service replicator { client_limit = 0 drop_priv_before_exec = no idle_kill = 4294967295s process_limit = 1 process_min_avail = 0 service_count = 0 vsz_limit = 8g unix_listener replicator-doveadm { mode = 0600 user = vmail } vsz_limit = 8192M }

service aggregator { process_limit = 1000 #vsz_limit = 1g fifo_listener replication-notify-fifo { user = vmail group = vmail mode = 0666 }

}

service pop3-login { process_limit = 1000 client_limit = 100 vsz_limit = 512m }

service imap-urlauth-login { process_limit = 1000 client_limit = 1000 vsz_limit = 1g }

service imap-login { process_limit=1000 client_limit = 1000 vsz_limit = 1g }

protocol sieve { managesieve_implementation_string = Dovecot Pigeonhole managesieve_max_line_length = 65536 }

#Addition ssl config !include sni.conf

cat sni.conf

#sni.conf ssl = yes verbose_ssl = yes ssl_dh =</usr/local/etc/dovecot/dh-4096.pem ssl_prefer_server_ciphers = yes #ssl_min_protocol = TLSv1.2

#Default *.scom.ca ssl_key =</usr/local/etc/dovecot/scom.pem ssl_cert =</usr/local/etc/dovecot/scom.pem ssl_ca =</usr/local/etc/dovecot/scom.pem

local_name .scom.ca { ssl_key = /programs/common/getssl.cert -c *.scom.ca -q yes ssl_cert = /programs/common/getssl.cert -c *.scom.ca -q yes ssl_ca = /programs/common/getssl.cert -c *.scom.ca -q yes }

local_name mail.clancyca.com { ssl_key = /programs/common/getssl.cert -c mail.clancyca.com -q yes ssl_cert = /programs/common/getssl.cert -c mail.clancyca.com -q yes ssl_ca = /programs/common/getssl.cert -c mail.clancyca.com -q yes }

local_name secure.clancyca.com { ssl_key = /programs/common/getssl.cert -c secure.clancyca.com -q yes ssl_cert = /programs/common/getssl.cert -c secure.clancyca.com -q yes ssl_ca = /programs/common/getssl.cert -c secure.clancyca.com -q yes }

local_name mail.paulkudla.net { ssl_key = /programs/common/getssl.cert -c mail.paulkudla.net -q yes ssl_cert = /programs/common/getssl.cert -c mail.paulkudla.net -q yes ssl_ca = /programs/common/getssl.cert -c mail.paulkudla.net -q yes }

local_name mail.ekst.ca { ssl_key = /programs/common/getssl.cert -c mail.ekst.ca -q yes ssl_cert = /programs/common/getssl.cert -c mail.ekst.ca -q yes ssl_ca = /programs/common/getssl.cert -c mail.ekst.ca -q yes }

local_name mail.hamletdevelopments.ca { ssl_key = /programs/common/getssl.cert -c mail.hamletdevelopments.ca -q yes ssl_cert = /programs/common/getssl.cert -c mail.hamletdevelopments.ca -q yes ssl_ca = /programs/common/getssl.cert -c mail.hamletdevelopments.ca -q yes }

cat dovecot-pgsql.conf

driver = pgsql connect = host=localhost port=5433 dbname=scom_billing user=pgsql password=Scom411400 default_pass_scheme = PLAIN

password_query = SELECT username as user, password FROM email_users WHERE username = '%u' and password <> 'alias' and status = True and destination = '%u'

user_query = SELECT home, uid, gid FROM email_users WHERE username = '%u' and password <> 'alias' and status = True and destination = '%u'

#iterate_query = SELECT user, password FROM email_users WHERE username = '%u' and password <> 'alias' and status = True and destination = '%u'

iterate_query = SELECT "username" as user, domain FROM email_users WHERE status = True and alias_flag = False

cat duplicates.sieve

require "duplicate"; # for dovecot >= 2.2.18

if duplicate { discard; stop; }

cat /programs/common/getssl.cert

#!/usr/local/bin/python3 #update the ssl certificates for this mail server

import sys import os import string import psycopg2

from optparse import OptionParser

USAGE_TEXT = '''
usage: %%prog %s[options] '''

parser = OptionParser(usage=USAGE_TEXT % '', version='0.4') parser.add_option("-c", "--cert", dest="cert", help="Domain Certificate Requested") parser.add_option("-k", "--key", dest="key", help="Domain Key Requested") parser.add_option("-r", "--crt", dest="crt", help="Domain CRT Requested") parser.add_option("-s", "--csr", dest="csr", help="Domain CSR Requested") parser.add_option("-i", "--inter", dest="inter", help="Domain INTER Requested") parser.add_option("-x", "--pem", dest="pem", help="Domain Pem Requested") parser.add_option("-q", "--quiet", dest="quiet", help="Quiet")

options, args = parser.parse_args()

#print (options.quiet)

if options.cert != None : ssl = options.cert if options.quiet == None : print ('\nGetting Full Pem Certificate : %s\n' %options.cert)

if options.key != None : ssl = options.key if options.quiet == None : print ('\nGetting Key Certificate : %s\n' %options.key)

if options.crt != None : ssl = options.crt if options.quiet == None : print ('\nGetting CRT Certificate : %s\n' %options.crt)

if options.csr != None : ssl = options.csr if options.quiet == None : print ('\nGetting CSR Certificate : %s\n' %options.csr)

if options.inter != None : ssl = options.inter if options.quiet == None : print ('\nGetting Inter Certificate : %s\n' %options.inter)

if options.pem != None : ssl = options.pem if options.quiet == None : print ('\nGetting Pem Certificate : %s\n' %options.pem)

#sys.exit()

#from lib import *

#print ('Opening the Database ....') conn = psycopg2.connect(host='localhost', port = 5433, database='scom_billing', user='pgsql', password='Scom411400') pg = conn.cursor()

#print ('Connected !')

#Ok now go get the email keys command = ("""select domain,ssl_key,ssl_cert,ssl_csr,ssl_chain from email_ssl_certificates where domain = $$%s$$ """ %ssl) #print (command)

pg.execute(command) certs = pg.fetchone()

#print (certs)

#ok from here we have to decide the output ? domain = certs[0]

if options.cert != None : key = '#SSL Pem file (Key / Certificate / Intermediate) for %s\n\n#Key\n\n' %domain + certs[1] + '\n\n#Certificate\n' + certs[2] + '\n\n#Intermediate\n' + certs[4]

if options.key != None : key = '#SSL Key file for %s\n\n' %domain + certs[1]

if options.crt != None : key = '#SSL CERT file for %s\n\n' %domain + certs[2]

if options.csr != None : key = '#SSL CSR Request file for %s\n\n' %domain + certs[3]

if options.inter != None : key = '#SSL Intermediate file for %s\n\n' %domain + certs[4]

if options.pem != None : key = '#SSL Pem (Certificate / Intermediate) file for %s\n\n#Certificate\n\n' %domain + certs[2] + '\n\n#Intermediate\n' + certs[4]

key = key.replace('\r','')

print (key)

conn.close() sys.exit()

Happy Tuesday !!! Thanks - paul

Paul Kudla

Scom.ca Internet Services <http://www.scom.ca> 004-1009 Byron Street South Whitby, Ontario - Canada L1N 4S3

Toronto 416.642.7266 Main 1.866.411.7266 Fax 1.888.892.7266

On 4/25/2022 9:13 AM, Arnaud Abélard wrote:

...

Hello,

On my side we are running Linux (Debian Buster).

I'm not sure my problem is actually the same as Paul or you Sebastian since I have a lot of boxes but those are actually small (quota of 110MB) so I doubt any of them have more than a dozen imap folders.

The main symptom is that I have tons of full sync requests awaiting but even though no other sync is pending the replicator just waits for something to trigger those syncs.

Today, with users back I can see that normal and incremental syncs are being done on the 15 connections, with an occasional full sync here or there and lots of "Waiting 'failed' requests":

Queued 'sync' requests        0

Queued 'high' requests        0

Queued 'low' requests         0

Queued 'failed' requests      122

Queued 'full resync' requests 28785

Waiting 'failed' requests     4294

Total number of known users   42512

So, why didn't the replicator take advantage of the weekend to replicate the mailboxes while no user were using them?

Arnaud

On 25/04/2022 13:54, Sebastian Marske wrote:

...
Hi there,

thanks for your insights and for diving deeper into this Paul!

For me, the users ending up in 'Waiting for dsync to finish' all have more than 256 Imap folders as well (ranging from 288 up to >5500; as per 'doveadm mailbox list -u <username> | wc -l'). For more details on my setup please see my post from February [1].

@Arnaud: What OS are you running on?

Best Sebastian

[1] https://dovecot.org/pipermail/dovecot/2022-February/124168.html

On 4/24/22 19:36, Paul Kudla (SCOM.CA Internet Services Inc.) wrote:

...
Question having similiar replication issues

pls read everything below and advise the folder counts on the non-replicated users?

i find the total number of folders / account seems to be a factor and NOT the size of the mail box

ie i have customers with 40G of emails no problem over 40 or so folders and it works ok

300+ folders seems to be the issue

i have been going through the replication code

no errors being logged

i am assuming that the replication --> dhclient --> other server is timing out or not reading the folder lists correctly (ie dies after X folders read)

thus i am going through the code patching for log entries etc to find the issues.

see

[13:33:57] mail18.scom.ca [root:0] /usr/local/var/lib/dovecot

ll

total 86 drwxr-xr-x 2 root wheel uarch    4B Apr 24 11:11 . drwxr-xr-x 4 root wheel uarch    4B Mar 8 2021 .. -rw-r--r-- 1 root wheel uarch   73B Apr 24 11:11 instances -rw-r--r-- 1 root wheel uarch 160K Apr 24 13:33 replicator.db

[13:33:58] mail18.scom.ca [root:0] /usr/local/var/lib/dovecot

replicator.db seems to get updated ok but never processed properly.

sync.users

nick@elirpa.com                   high     00:09:41 463:47:01 -     y keith@elirpa.com                  high     00:09:23 463:45:43 -     y paul@scom.ca                      high     00:09:41 463:46:51 -     y ed@scom.ca                        high     00:09:43 463:47:01 -     y ed.hanna@dssmgmt.com              high     00:09:42 463:46:58 -     y paul@paulkudla.net                high     00:09:44 463:47:03 580:35:07 y

so ....

two things :

first to get the production stuff to work i had to write a script that whould find the bad sync's and the force a dsync between the servers

i run this every five minutes or each server.

in crontab

*/10    *                *    *    *    root            /usr/bin/nohup /programs/common/sync.recover > /dev/null

python script to sort things out

cat /programs/common/sync.recover

#!/usr/local/bin/python3

#Force sync between servers that are reporting bad?

import os,sys,django,socket from optparse import OptionParser

from lib import *

#Sample Re-Index MB #doveadm -D force-resync -u paul@scom.ca -f INBOX*

USAGE_TEXT = '''
usage: %%prog %s[options] '''

parser = OptionParser(usage=USAGE_TEXT % '', version='0.4')

parser.add_option("-m", "--send_to", dest="send_to", help="Send Email To") parser.add_option("-e", "--email", dest="email_box", help="Box to Index") parser.add_option("-d", "--detail",action='store_true', dest="detail",default =False, help="Detailed report") parser.add_option("-i", "--index",action='store_true', dest="index",default =False, help="Index")

options, args = parser.parse_args()

print (options.email_box) print (options.send_to) print (options.detail)

#sys.exit()

print ('Getting Current User Sync Status') command = commands("/usr/local/bin/doveadm replicator status '*'")

#print command

sync_user_status = command.output.split('\n')

#print sync_user_status

synced = []

for n in range(1,len(sync_user_status)) : user = sync_user_status[n] print ('Processing User : %s' %user.split(' ')[0]) if user.split(' ')[0] != options.email_box : if options.email_box != None : continue

         if options.index == True : command = '/usr/local/bin/doveadm -D force-resync -u %s -f INBOX*' %user.split(' ')[0] command = commands(command) command = command.output

         #print user for nn in range (len(user)-1,0,-1) : #print nn #print user[nn]

                 if user[nn] == '-' : #print 'skipping ... %s' %user.split(' ')[0]

                         break

                 if user[nn] == 'y': #Found a Bad Mailbox print ('syncing ... %s' %user.split(' ')[0])

                         if options.detail == True : command = '/usr/local/bin/doveadm -D sync -u %s -d -N -l 30 -U' %user.split(' ')[0] print (command) command = commands(command) command = command.output.split('\n') print (command) print ('Processed Mailbox for ... %s' %user.split(' ')[0] ) synced.append('Processed Mailbox for ... %s' %user.split(' ')[0]) for nnn in range(len(command)): synced.append(command[nnn] + '\n') break

                         if options.detail == False : #command = '/usr/local/bin/doveadm -D sync -u %s -d -N -l 30 -U' %user.split(' ')[0] #print (command) #command = os.system(command) command = subprocess.Popen( ["/usr/local/bin/doveadm sync -u %s -d -N -l 30 -U" %user.split(' ')[0] ],
shell = True, stdin=None, stdout=None, stderr=None, close_fds=True)

                                 print ( 'Processed Mailbox for ... %s' %user.split(' ')[0] ) synced.append('Processed Mailbox for ... %s' %user.split(' ')[0]) #sys.exit() break

if len(synced) != 0 : #send email showing bad synced boxes ?

         if options.send_to != None : send_from = 'monitor@scom.ca' send_to = ['%s' %options.send_to] send_subject = 'Dovecot Bad Sync Report for : %s' %(socket.gethostname()) send_text = '\n\n' for n in range (len(synced)) : send_text = send_text + synced[n] + '\n'

                 send_files = [] sendmail (send_from, send_to, send_subject, send_text, send_files)

sys.exit()

second :

i posted this a month ago - no response

please appreciate that i am trying to help ....

after much testing i can now reporduce the replication issues at hand

I am running on freebsd 12 & 13 stable (both test and production servers)

sdram drives etc ...

Basically replication works fine until reaching a folder quantity of ~ 256 or more

to reproduce using doveadm i created folders like

INBOX/folder-0 INBOX/folder-1 INBOX/folder-2 INBOX/folder-3 and so forth ......

I created 200 folders and they replicated ok on both servers

I created another 200 (400 total) and the replicator got stuck and would not update the mbox on the alternate server anymore and is still updating 4 days later ?

basically replicator goes so far and either hangs or more likely bails on an error that is not reported to the debug reporting ?

however dsync will sync the two servers but only when run manually (ie all the folders will sync)

I have two test servers avaliable if you need any kind of access - again here to help.

[07:28:42] mail18.scom.ca [root:0] ~

sync.status

Queued 'sync' requests        0 Queued 'high' requests        6 Queued 'low' requests         0 Queued 'failed' requests      0 Queued 'full resync' requests 0 Waiting 'failed' requests     0 Total number of known users   255

username                       type        status paul@scom.ca                   normal      Waiting for dsync to finish keith@elirpa.com               incremental Waiting for dsync to finish ed.hanna@dssmgmt.com           incremental Waiting for dsync to finish ed@scom.ca                     incremental Waiting for dsync to finish nick@elirpa.com                incremental Waiting for dsync to finish paul@paulkudla.net             incremental Waiting for dsync to finish

i have been going through the c code and it seems the replication gets requested ok

replicator.db does get updated ok with the replicated request for the mbox in question.

however i am still looking for the actual replicator function in the lib's that do the actual replication requests

the number of folders & subfolders is defanately the issue - not the mbox pyhsical size as thought origionally.

if someone can point me in the right direction, it seems either the replicator is not picking up on the number of folders to replicat properly or it has a hard set limit like 256 / 512 / 65535 etc and stops the replication request thereafter.

I am mainly a machine code programmer from the 80's and have concentrated on python as of late, 'c' i am starting to go through just to give you a background on my talents.

It took 2 months to finger this out.

this issue also seems to be indirectly causing the duplicate messages supression not to work as well.

python programming to reproduce issue (loops are for last run started @ 200 - fyi) :

cat mbox.gen

#!/usr/local/bin/python2

import os,sys

from lib import *

user = 'paul@paulkudla.net'

""" for count in range (0,600) : box = 'INBOX/folder-%s' %count print count command = '/usr/local/bin/doveadm mailbox create -s -u %s %s' %(user,box) print command a = commands.getoutput(command) print a """

for count in range (0,600) : box = 'INBOX/folder-0/sub-%' %count print count command = '/usr/local/bin/doveadm mailbox create -s -u %s %s' %(user,box) print command a = commands.getoutput(command) print a

         #sys.exit()

Happy Sunday !!! Thanks - paul

Paul Kudla

Scom.ca Internet Services <http://www.scom.ca> 004-1009 Byron Street South Whitby, Ontario - Canada L1N 4S3

Toronto 416.642.7266 Main 1.866.411.7266 Fax 1.888.892.7266

On 4/24/2022 10:22 AM, Arnaud Abélard wrote:

...
Hello,

I am working on replicating a server (and adding compression on the other side) and since I had "Error: dsync I/O has stalled, no activity for 600 seconds (version not received)" errors I upgraded both source and destination server with the latest 2.3 version (2.3.18). While before the upgrade all the 15 replication connections were busy after upgrading dovecot replicator dsync-status shows that most of the time nothing is being replicated at all. I can see some brief replications that last, but 99,9% of the time nothing is happening at all.

I have a replication_full_sync_interval of 12 hours but I have thousands of users with their last full sync over 90 hours ago.

"doveadm replicator status" also shows that i have over 35,000 queued full resync requests, but no sync, high or low queued requests so why aren't the full requests occuring?

There are no errors in the logs.

Thanks,

Arnaud

Daniel Lange

27 Apr 27 Apr

3:57 p.m.

New subject: Better not post your email password on a public mailing list, was: Re: no full syncs after upgrading to dovecot 2.3.18

Am 26.04.22 um 11:36 schrieb Paul Kudla (SCOM.CA Internet Services Inc.):

...

#imapc_host = mail.scom.ca #imapc_password = Pk554669 #imapc_user = paul@scom.ca

I suggest to change that password immediately.

$ openssl s_client -crlf -connect mail.scom.ca:993 CONNECTED(00000003)

OK [CAPABILITY IMAP4rev1 SASL-IR LOGIN-REFERRALS ID ENABLE IDLE LITERAL+ AUTH=PLAIN AUTH=LOGIN] SCOM.CA Internet Services Inc. - Dovecot ready A login paul@scom.ca Pk554669 A OK [CAPABILITY IMAP4rev1 SASL-IR LOGIN-REFERRALS ID ENABLE IDLE SORT SORT=DISPLAY THREAD=REFERENCES THREAD=REFS THREAD=ORDEREDSUBJECT MULTIAPPEND URL-PARTIAL CATENATE UNSELECT CHILDREN NAMESPACE UIDPLUS LIST-EXTENDED I18NLEVEL=1 CONDSTORE QRESYNC ESEARCH ESORT SEARCHRES WITHIN CONTEXT=SEARCH LIST-STATUS BINARY MOVE SNIPPET=FUZZY PREVIEW=FUZZY PREVIEW STATUS=SIZE SAVEDATE LITERAL+ NOTIFY SPECIAL-USE] Logged in A status INBOX (messages)
STATUS INBOX (MESSAGES 344) A OK Status completed (0.002 + 0.000 + 0.001 secs). ^C

Kind regards, Daniel

Sebastian Nielsen

28 Apr 28 Apr

12:21 a.m.

New subject: Sv: Better not post your email password on a public mailing list, was: Re: no full syncs after upgra

Even more stupid that the IMAP port is available to the public. Should have been firewalled to authorized IPs only, then it wouldn't have mattered that the password have leaked.

-----Ursprungligt meddelande----- Från: dovecot-bounces@dovecot.org <dovecot-bounces@dovecot.org> För Daniel Lange Skickat: den 27 april 2022 14:59 Till: Paul Kudla (SCOM.CA Internet Services Inc.) <paul@scom.ca> Kopia: dovecot@dovecot.org Ämne: Better not post your email password on a public mailing list, was: Re: no full syncs after upgrading

Am 26.04.22 um 11:36 schrieb Paul Kudla (SCOM.CA Internet Services Inc.):

...

#imapc_host = mail.scom.ca #imapc_password = Pk554669 #imapc_user = paul@scom.ca

I suggest to change that password immediately.

$ openssl s_client -crlf -connect mail.scom.ca:993 CONNECTED(00000003)

OK [CAPABILITY IMAP4rev1 SASL-IR LOGIN-REFERRALS ID ENABLE IDLE LITERAL+ AUTH=PLAIN AUTH=LOGIN] SCOM.CA Internet Services Inc. - Dovecot ready A login paul@scom.ca Pk554669 A OK [CAPABILITY IMAP4rev1 SASL-IR LOGIN-REFERRALS ID ENABLE IDLE SORT SORT=DISPLAY THREAD=REFERENCES THREAD=REFS THREAD=ORDEREDSUBJECT MULTIAPPEND URL-PARTIAL CATENATE UNSELECT CHILDREN NAMESPACE UIDPLUS LIST-EXTENDED I18NLEVEL=1 CONDSTORE QRESYNC ESEARCH ESORT SEARCHRES WITHIN CONTEXT=SEARCH LIST-STATUS BINARY MOVE SNIPPET=FUZZY PREVIEW=FUZZY PREVIEW STATUS=SIZE SAVEDATE LITERAL+ NOTIFY SPECIAL-USE] Logged in A status INBOX (messages)
STATUS INBOX (MESSAGES 344) A OK Status completed (0.002 + 0.000 + 0.001 secs). ^C

Kind regards, Daniel

Paul Kudla (SCOM.CA Internet Services Inc.)

1:53 p.m.

New subject: Better not post your email password on a public mailing list, was: Re: no full syncs after upgrading to dovecot 2.3.18

thanks

i love to share but sometime forget whats noted inside a config file

Been meaning to change this for a while anyways.

Happy Thursday !!! Thanks - paul

Paul Kudla

Scom.ca Internet Services <http://www.scom.ca> 004-1009 Byron Street South Whitby, Ontario - Canada L1N 4S3

Toronto 416.642.7266 Main 1.866.411.7266 Fax 1.888.892.7266

On 4/27/2022 8:57 AM, Daniel Lange wrote:

...

Am 26.04.22 um 11:36 schrieb Paul Kudla (SCOM.CA Internet Services Inc.):

...
#imapc_host = mail.scom.ca #imapc_password = Pk554669 #imapc_user = paul@scom.ca

I suggest to change that password immediately.

$ openssl s_client -crlf -connect mail.scom.ca:993 CONNECTED(00000003)

OK [CAPABILITY IMAP4rev1 SASL-IR LOGIN-REFERRALS ID ENABLE IDLE LITERAL+ AUTH=PLAIN AUTH=LOGIN] SCOM.CA Internet Services Inc. - Dovecot ready A login paul@scom.ca Pk554669 A OK [CAPABILITY IMAP4rev1 SASL-IR LOGIN-REFERRALS ID ENABLE IDLE SORT SORT=DISPLAY THREAD=REFERENCES THREAD=REFS THREAD=ORDEREDSUBJECT MULTIAPPEND URL-PARTIAL CATENATE UNSELECT CHILDREN NAMESPACE UIDPLUS LIST-EXTENDED I18NLEVEL=1 CONDSTORE QRESYNC ESEARCH ESORT SEARCHRES WITHIN CONTEXT=SEARCH LIST-STATUS BINARY MOVE SNIPPET=FUZZY PREVIEW=FUZZY PREVIEW STATUS=SIZE SAVEDATE LITERAL+ NOTIFY SPECIAL-USE] Logged in A status INBOX (messages)

STATUS INBOX (MESSAGES 344) A OK Status completed (0.002 + 0.000 + 0.001 secs). ^C

Kind regards, Daniel

Paul Kudla (SCOM.CA Internet Services Inc.)

26 Apr 26 Apr

12:43 p.m.

ok until someone fixes the folder count (or i figure that out)

the errors will stay on the replicator side and i use that error (knowing they are not replicating) coming back to trigger a background dsync with a python script through doveadm (seems to work)

but this is a hack.

[05:41:37] mail18.scom.ca [root:0] /programs/lib

sync.users

nick@elirpa.com high 00:06:41 503:54:00 - y keith@elirpa.com high 00:06:20 503:52:42 - y paul@scom.ca high 00:06:40 503:53:50 - y ed@scom.ca high 00:06:42 503:54:00 - y ed.hanna@dssmgmt.com high 00:06:41 503:53:57 - y paul@paulkudla.net high 00:06:43 503:54:02 620:42:06 y

[05:41:46] mail18.scom.ca [root:0] /programs/lib

cat /programs/common/sync.users

doveadm replicator status '*' | grep ' y'

crontab :

*/10 * * * * root /usr/bin/nohup /programs/common/sync.recover > /dev/null

cat /programs/common/sync.recover

#!/usr/local/bin/python3

#Force sync between servers that are reporting bad?

import os,sys,django,socket from optparse import OptionParser

from lib import *

#Sample Re-Index MB #doveadm -D force-resync -u paul@scom.ca -f INBOX*

USAGE_TEXT = '''
usage: %%prog %s[options] '''

parser = OptionParser(usage=USAGE_TEXT % '', version='0.4')

options, args = parser.parse_args()

print (options.email_box) print (options.send_to) print (options.detail)

#sys.exit()

print ('Getting Current User Sync Status') command = commands("/usr/local/bin/doveadm replicator status '*'")

#print command

sync_user_status = command.output.split('\n')

#print sync_user_status

synced = []

     if options.index == True :
             command = '/usr/local/bin/doveadm -D force-resync -u %s

-f INBOX*' %user.split(' ')[0] command = commands(command) command = command.output

     #print user
     for nn in range (len(user)-1,0,-1) :
             #print nn
             #print user[nn]

             if user[nn] == '-' :
                     #print 'skipping ... %s' %user.split(' ')[0]

                     break



             if user[nn] == 'y': #Found a Bad Mailbox
                     print ('syncing ... %s' %user.split(' ')[0])


                     if options.detail == True :
                             command = '/usr/local/bin/doveadm -D

                     if options.detail == False :
                             #command = '/usr/local/bin/doveadm -D

                             print ( 'Processed Mailbox for ... %s'

%user.split(' ')[0] ) synced.append('Processed Mailbox for ... %s' %user.split(' ')[0]) #sys.exit() break

if len(synced) != 0 : #send email showing bad synced boxes ?

     if options.send_to != None :
             send_from = 'monitor@scom.ca'
             send_to = ['%s' %options.send_to]
             send_subject = 'Dovecot Bad Sync Report for : %s'

%(socket.gethostname()) send_text = '\n\n' for n in range (len(synced)) : send_text = send_text + synced[n] + '\n'

             send_files = []
             sendmail (send_from, send_to, send_subject, send_text,

send_files)

sys.exit()

Lib3 (lib.py) my general libary load

cat lib3.py

#Load the librarys for the system

import os,sys,time,socket import string from ftplib import FTP from decimal import * from datetime import date import datetime import smtplib from email.mime.multipart import MIMEMultipart from email.mime.base import MIMEBase from email.mime.text import MIMEText from email.utils import COMMASPACE, formatdate from email import encoders import subprocess

getcontext().prec = 20

class commands: def __init__(self,command) : self.command = command #print (self.command) self.output = 'Error' self.status = '255'

             #sample
             #rc, gopath = subprocess.getstatusoutput('ls -a')

             self.status, self.output =

subprocess.getstatusoutput(self.command)

             try:
               self.cr = self.output.split('\n')
             except :
               self.cr = []
             try:
               self.count = len(self.cr)
             except :
               self.count = 0

             self.status = int(self.status)

             #return count=number of lines, cr = lines split,

getoutput = actual output returned, status = return code

             return

#Email with attachment class sendmail: def __init__(self, send_from, send_to, send_subject, send_text, send_files): #send_from, send_to, send_subject, send_text, send_files): #print ('lib.py sending email') assert type(send_to)==list assert type(send_files)==list

             msg = MIMEMultipart()
             msg['From'] = send_from
             msg['To'] = COMMASPACE.join(send_to)
             msg['Date'] = formatdate(localtime=True)
             msg['Subject'] = send_subject

             msg.attach( MIMEText(send_text) )

             for f in send_files:
                     part = MIMEBase('application', "octet-stream")
                     part.set_payload( open(f,"rb").read() )
                     Encoders.encode_base64(part)
                     part.add_header('Content-Disposition',

'attachment; filename="%s"' % os.path.basename(f)) msg.attach(part)

             try : #Send Local?
                     smtp = smtplib.SMTP('mail.local.scom.ca')
                     #smtp.login('backup@scom.ca','522577')
                     #print ('Sending Email to : %s' %send_to)
                     smtp.sendmail(send_from, send_to, msg.as_string())
                     smtp.close()

             except :
                     smtp = smtplib.SMTP('mail.scom.ca')
                     smtp.login('backup@scom.ca','522577')
                     #print ('Sending Email to : %s' %send_to)
                     smtp.sendmail(send_from, send_to, msg.as_string())
                     smtp.close()

class getdatetime: def __init__(self): self.datetime = datetime.date.today() self.datetime_now = datetime.datetime.now() self.date = str( time.strftime("%Y-%m-%d %H:%M:%S") ) self.date_long = str( time.strftime("%Y-%m-%d %H:%M:%S") ) self.date_short = str( time.strftime("%Y-%m-%d") ) self.time = str( time.strftime("%H:%M:%S") ) self.date_time_sec = self.datetime_now.strftime ("%Y-%m-%d %H:%M:%S.%f")

#Return edi senddate string (short) 2011-10-31 into 111031

class create_ascii : def __init__(self,string_data) : self.string_data = str(string_data) import string self.printable = set(string.printable) self.list = list(filter(lambda x: x in self.printable, self.string_data)) #print (self.list)

             self.ascii = ''
             for n in range (0,len(self.list)) :
                     self.ascii = self.ascii + self.list[n]
             self.ascii = str(self.ascii)

             return

class edi_send_date_short: def __init__(self, senddate): self.date = senddate self.result = self.date[2] + self.date[3] + self.date[5] + self.date[6] + self.date[8] + self.date[9]

     def __str__(self):
             return '%s' % self.result

##Return edi senddate string (long) 2011-10-31 into 20111031 class edi_send_date_long: def __init__(self, senddate): self.date = senddate self.result1 = self.date[0] + self.date[1] + self.date[2] + self.date[3] + self.date[5] + self.date[6] + self.date[8]

self.date[9] self.result2 = self.date[2] + self.date[3] + self.date[5] + self.date[6] + self.date[8] + self.date[9]
```
   def __str__(self):
           return '%s' % (self.result1,self.result2)
```

class gpsdeg: def __init__(self, dms): self.dms = dms self.is_positive = self.dms >= 0 self.dms = abs(self.dms) self.minutes,self.seconds = divmod(self.dms*3600,60) self.degrees,self.minutes = divmod(self.minutes,60) self.degrees = self.degrees if self.is_positive else -self.degrees

     def __str__(self):
             return '%s' % (self.degrees,self.minutes,self.seconds)

class degdir: def __init__(self, degrees): self.direction_data = ['N','348.75','11.25','NNE', '11.25','33.75','NE','33.75','56.25','ENE', '56.25','78.75','E','78.75','101.25','ESE','101.25','123.75','SE','123.75','146.25','SSE','146.25','168.75','S','168.75','191.25','SSW','191.25','213.75','SW','213.75','236.25','WSW','236.25','258.75','W','258.75','281.25','WNW','281.25','303.75','NW','303.75','326.25','NNW','326.25','348.75']

     def __str__(self):
             return '%s' % (self.direction)

class gettime: def __init__(self): self.uu = time.localtime()

             self.todaystime = str(self.uu[3]) #get the hr

             if int(self.uu[3]) &lt; 10: #add a zero
                     self.todaystime = '0' + self.todaystime
             if int(self.uu[4]) &lt; 10: #add a zero in front
                     self.todaystime = self.todaystime

+":0"+str(self.uu[4]) else: self.todaystime = self.todaystime +":"+str(self.uu[4])

     def __str__(self):
             return self.todaystime

class array2dbstring: def __init__(self,array): self.data = array for self.nn in range(0,len(self.data)): print ('Data %s \t\t %s' % (str(self.data[self.nn]),str( type(self.data[self.nn])) ) ) #change all data into strings self.a = type(self.data[self.nn]) self.a = str(self.a) if 'Decimal' in self.a : self.n = str(self.data[self.nn]) #self.n = self.n.lstrip("'") #self.n = self.n.rstrip("'") #self.data[self.nn] = float(self.data[self.nn]) self.data[self.nn] = str('0.00') print (self.n)

                     if 'NoneType' in self.a :
                             self.data[self.nn] = ''
                     if 'datetime.datetime' in self.a :
                             #self.data[self.nn] =

str(self.data[self.nn]) #self.data[self.nn].replace self.data[self.nn] = '2012-01-25 00:00:00' self.data = str(self.data) self.data = self.data.lstrip('[') self.data = self.data.rstrip(']') self.data = self.data.replace("'NULL'","NULL") #self.data = self.data.replace(" '',", ",") #self.data = self.data.replace(" '0.00'","'100'")

     def __str__(self):
             return self.data

class get_hostname: def __init__(self): self.hostname = socket.gethostname()

Happy Tuesday !!! Thanks - paul

Paul Kudla

Scom.ca Internet Services <http://www.scom.ca> 004-1009 Byron Street South Whitby, Ontario - Canada L1N 4S3

Toronto 416.642.7266 Main 1.866.411.7266 Fax 1.888.892.7266

On 4/25/2022 7:54 AM, Sebastian Marske wrote:

...

Hi there,

thanks for your insights and for diving deeper into this Paul!

For me, the users ending up in 'Waiting for dsync to finish' all have more than 256 Imap folders as well (ranging from 288 up to >5500; as per 'doveadm mailbox list -u <username> | wc -l'). For more details on my setup please see my post from February [1].

@Arnaud: What OS are you running on?

Best Sebastian

[1] https://dovecot.org/pipermail/dovecot/2022-February/124168.html

On 4/24/22 19:36, Paul Kudla (SCOM.CA Internet Services Inc.) wrote:

...
Question having similiar replication issues

pls read everything below and advise the folder counts on the non-replicated users?

i find the total number of folders / account seems to be a factor and NOT the size of the mail box

ie i have customers with 40G of emails no problem over 40 or so folders and it works ok

300+ folders seems to be the issue

i have been going through the replication code

no errors being logged

i am assuming that the replication --> dhclient --> other server is timing out or not reading the folder lists correctly (ie dies after X folders read)

thus i am going through the code patching for log entries etc to find the issues.

see

[13:33:57] mail18.scom.ca [root:0] /usr/local/var/lib/dovecot

ll

total 86 drwxr-xr-x 2 root wheel uarch    4B Apr 24 11:11 . drwxr-xr-x 4 root wheel uarch    4B Mar 8 2021 .. -rw-r--r-- 1 root wheel uarch   73B Apr 24 11:11 instances -rw-r--r-- 1 root wheel uarch 160K Apr 24 13:33 replicator.db

[13:33:58] mail18.scom.ca [root:0] /usr/local/var/lib/dovecot

replicator.db seems to get updated ok but never processed properly.

sync.users

nick@elirpa.com                   high     00:09:41 463:47:01 -     y keith@elirpa.com                  high     00:09:23 463:45:43 -     y paul@scom.ca                      high     00:09:41 463:46:51 -     y ed@scom.ca                        high     00:09:43 463:47:01 -     y ed.hanna@dssmgmt.com              high     00:09:42 463:46:58 -     y paul@paulkudla.net                high     00:09:44 463:47:03 580:35:07 y

so ....

two things :

first to get the production stuff to work i had to write a script that whould find the bad sync's and the force a dsync between the servers

i run this every five minutes or each server.

in crontab

*/10    *                *    *    *    root            /usr/bin/nohup /programs/common/sync.recover > /dev/null

python script to sort things out

cat /programs/common/sync.recover

#!/usr/local/bin/python3

#Force sync between servers that are reporting bad?

import os,sys,django,socket from optparse import OptionParser

from lib import *

#Sample Re-Index MB #doveadm -D force-resync -u paul@scom.ca -f INBOX*

USAGE_TEXT = '''
usage: %%prog %s[options] '''

parser = OptionParser(usage=USAGE_TEXT % '', version='0.4')

parser.add_option("-m", "--send_to", dest="send_to", help="Send Email To") parser.add_option("-e", "--email", dest="email_box", help="Box to Index") parser.add_option("-d", "--detail",action='store_true', dest="detail",default =False, help="Detailed report") parser.add_option("-i", "--index",action='store_true', dest="index",default =False, help="Index")

options, args = parser.parse_args()

print (options.email_box) print (options.send_to) print (options.detail)

#sys.exit()

print ('Getting Current User Sync Status') command = commands("/usr/local/bin/doveadm replicator status '*'")

#print command

sync_user_status = command.output.split('\n')

#print sync_user_status

synced = []

for n in range(1,len(sync_user_status)) : user = sync_user_status[n] print ('Processing User : %s' %user.split(' ')[0]) if user.split(' ')[0] != options.email_box : if options.email_box != None : continue

        if options.index == True : command = '/usr/local/bin/doveadm -D force-resync -u %s -f INBOX*' %user.split(' ')[0] command = commands(command) command = command.output

        #print user for nn in range (len(user)-1,0,-1) : #print nn #print user[nn]

                if user[nn] == '-' : #print 'skipping ... %s' %user.split(' ')[0]

                        break

                if user[nn] == 'y': #Found a Bad Mailbox print ('syncing ... %s' %user.split(' ')[0])

                        if options.detail == True : command = '/usr/local/bin/doveadm -D sync -u %s -d -N -l 30 -U' %user.split(' ')[0] print (command) command = commands(command) command = command.output.split('\n') print (command) print ('Processed Mailbox for ... %s' %user.split(' ')[0] ) synced.append('Processed Mailbox for ... %s' %user.split(' ')[0]) for nnn in range(len(command)): synced.append(command[nnn] + '\n') break

                        if options.detail == False : #command = '/usr/local/bin/doveadm -D sync -u %s -d -N -l 30 -U' %user.split(' ')[0] #print (command) #command = os.system(command) command = subprocess.Popen( ["/usr/local/bin/doveadm sync -u %s -d -N -l 30 -U" %user.split(' ')[0] ],
shell = True, stdin=None, stdout=None, stderr=None, close_fds=True)

                                print ( 'Processed Mailbox for ... %s' %user.split(' ')[0] ) synced.append('Processed Mailbox for ... %s' %user.split(' ')[0]) #sys.exit() break

if len(synced) != 0 : #send email showing bad synced boxes ?

        if options.send_to != None : send_from = 'monitor@scom.ca' send_to = ['%s' %options.send_to] send_subject = 'Dovecot Bad Sync Report for : %s' %(socket.gethostname()) send_text = '\n\n' for n in range (len(synced)) : send_text = send_text + synced[n] + '\n'

                send_files = [] sendmail (send_from, send_to, send_subject, send_text, send_files)

sys.exit()

second :

i posted this a month ago - no response

please appreciate that i am trying to help ....

after much testing i can now reporduce the replication issues at hand

I am running on freebsd 12 & 13 stable (both test and production servers)

sdram drives etc ...

Basically replication works fine until reaching a folder quantity of ~ 256 or more

to reproduce using doveadm i created folders like

INBOX/folder-0 INBOX/folder-1 INBOX/folder-2 INBOX/folder-3 and so forth ......

I created 200 folders and they replicated ok on both servers

I created another 200 (400 total) and the replicator got stuck and would not update the mbox on the alternate server anymore and is still updating 4 days later ?

basically replicator goes so far and either hangs or more likely bails on an error that is not reported to the debug reporting ?

however dsync will sync the two servers but only when run manually (ie all the folders will sync)

I have two test servers avaliable if you need any kind of access - again here to help.

[07:28:42] mail18.scom.ca [root:0] ~

sync.status

Queued 'sync' requests        0 Queued 'high' requests        6 Queued 'low' requests         0 Queued 'failed' requests      0 Queued 'full resync' requests 0 Waiting 'failed' requests     0 Total number of known users   255

username                       type        status paul@scom.ca                   normal      Waiting for dsync to finish keith@elirpa.com               incremental Waiting for dsync to finish ed.hanna@dssmgmt.com           incremental Waiting for dsync to finish ed@scom.ca                     incremental Waiting for dsync to finish nick@elirpa.com                incremental Waiting for dsync to finish paul@paulkudla.net             incremental Waiting for dsync to finish

i have been going through the c code and it seems the replication gets requested ok

replicator.db does get updated ok with the replicated request for the mbox in question.

however i am still looking for the actual replicator function in the lib's that do the actual replication requests

the number of folders & subfolders is defanately the issue - not the mbox pyhsical size as thought origionally.

if someone can point me in the right direction, it seems either the replicator is not picking up on the number of folders to replicat properly or it has a hard set limit like 256 / 512 / 65535 etc and stops the replication request thereafter.

I am mainly a machine code programmer from the 80's and have concentrated on python as of late, 'c' i am starting to go through just to give you a background on my talents.

It took 2 months to finger this out.

this issue also seems to be indirectly causing the duplicate messages supression not to work as well.

python programming to reproduce issue (loops are for last run started @ 200 - fyi) :

cat mbox.gen

#!/usr/local/bin/python2

import os,sys

from lib import *

user = 'paul@paulkudla.net'

""" for count in range (0,600) : box = 'INBOX/folder-%s' %count print count command = '/usr/local/bin/doveadm mailbox create -s -u %s %s' %(user,box) print command a = commands.getoutput(command) print a """

for count in range (0,600) : box = 'INBOX/folder-0/sub-%' %count print count command = '/usr/local/bin/doveadm mailbox create -s -u %s %s' %(user,box) print command a = commands.getoutput(command) print a

        #sys.exit()

Happy Sunday !!! Thanks - paul

Paul Kudla

Scom.ca Internet Services <http://www.scom.ca> 004-1009 Byron Street South Whitby, Ontario - Canada L1N 4S3

Toronto 416.642.7266 Main 1.866.411.7266 Fax 1.888.892.7266

On 4/24/2022 10:22 AM, Arnaud Abélard wrote:

...
Hello,

I am working on replicating a server (and adding compression on the other side) and since I had "Error: dsync I/O has stalled, no activity for 600 seconds (version not received)" errors I upgraded both source and destination server with the latest 2.3 version (2.3.18). While before the upgrade all the 15 replication connections were busy after upgrading dovecot replicator dsync-status shows that most of the time nothing is being replicated at all. I can see some brief replications that last, but 99,9% of the time nothing is happening at all.

I have a replication_full_sync_interval of 12 hours but I have thousands of users with their last full sync over 90 hours ago.

"doveadm replicator status" also shows that i have over 35,000 queued full resync requests, but no sync, high or low queued requests so why aren't the full requests occuring?

There are no errors in the logs.

Thanks,

Arnaud

1192

Age (days ago)

1210

Last active (days ago)

List overview

19 comments

8 participants

participants (8)

Aki Tuomi
Arnaud Abélard
Cassidy B. Larson
Daniel Lange
Paul Kudla (SCOM.CA Internet Services Inc.)
Reuben Farrelly
Sebastian Marske
Sebastian Nielsen

no full syncs after upgrading to dovecot 2.3.18

Arnaud Abélard

-- Arnaud Abélard Responsable pôle Système et Stockage Service Infrastructures DSIN Université de Nantes

ll

sync.users

cat /programs/common/sync.recover

sync.status

cat mbox.gen

ll

sync.users

cat /programs/common/sync.recover

sync.status

cat mbox.gen

Arnaud Abélard

ll

sync.users

cat /programs/common/sync.recover

sync.status

cat mbox.gen

-- Arnaud Abélard Responsable pôle Système et Stockage Service Infrastructures DSIN Université de Nantes

Arnaud Abélard

dovecot replicator status xxxxx

ll

sync.users

cat /programs/common/sync.recover

sync.status

cat mbox.gen

-- Arnaud Abélard Responsable pôle Système et Stockage Service Infrastructures DSIN Université de Nantes

dovecot replicator status xxxxx

ll

sync.users

cat /programs/common/sync.recover

sync.status

cat mbox.gen

dovecot replicator status xxxxx

ll

sync.users

cat /programs/common/sync.recover

sync.status

cat mbox.gen

dovecot replicator status xxxxx

ll

sync.users

cat /programs/common/sync.recover

sync.status

cat mbox.gen

dovecot replicator status xxxxx

dovecot replicator status xxxxx

dovecot replicator status xxxxx

cat dovecot.conf

2.3.14 (cee3cbc0d): /usr/local/etc/dovecot/dovecot.conf

OS: FreeBSD 12.1-RELEASE amd64

Hostname: mail18.scom.ca

cat sni.conf

cat dovecot-pgsql.conf

cat duplicates.sieve

cat /programs/common/getssl.cert

ll

sync.users

cat /programs/common/sync.recover

sync.status

cat mbox.gen

sync.users

cat /programs/common/sync.users

cat /programs/common/sync.recover

cat lib3.py

ll

sync.users

cat /programs/common/sync.recover

sync.status

cat mbox.gen

tags

participants (8)