[Dovecot] dsync backup gets stuck... fails
More dsync issues. We were running 2.1.7 and we updated to 2.1.9. Same
problem with both versions. I'm getting an error 75 on about 40 boxes out of 1800. It is the same list of boxes every time we use 'dsync backup' to backup the server. dsync seems to stop communicating to the backup box (over ssh). strace just shows it sitting at a epoll_wait. Once the program quits (times out?), a 'du' shows the destination is smaller (200kbyte in one case). Has anyone else seen an exit code of 75? Nothing in the documentation mentions what exit code 75 could mean. What can I do to help the developers locate the bug?
...Jeff
Hi Jeff,
Jeff Gustafson wrote:
More dsync issues. We were running 2.1.7 and we updated to 2.1.9. Same problem with both versions. I'm getting an error 75 on about 40 boxes out of 1800. It is the same list of boxes every time we use 'dsync backup' to backup the server. dsync seems to stop communicating to the backup box (over ssh). strace just shows it sitting at a epoll_wait. Once the program quits (times out?), a 'du' shows the destination is smaller (200kbyte in one case). What can I do to help the developers locate the bug?
Please start by following the instructions at http://dovecot.org/bugreport.html and post your 'doveconf -n' output in order to provide possibly important information about your system and configs.
Regards Daniel
On Sat, 2012-08-11 at 00:56 +0200, Daniel Parthey wrote:
Hi Jeff, Please start by following the instructions at http://dovecot.org/bugreport.html and post your 'doveconf -n' output in order to provide possibly important information about your system and configs.
Storage is local hardware-based RAID (SATA drives). FS is ext4. No core dump that I know of. Command used (mailbox is over 2GB):
# su vmail -c "dsync -u user@domain.com backup ssh vmail@bk05 dsync -c /etc/dovecot/dovecot.conf -u user@domain.com"
Config file: # 2.1.9: /etc/dovecot/dovecot.conf # OS: Linux 2.6.32-220.17.1.el6.x86_64 x86_64 CentOS release 6.2 (Final) auth_mechanisms = plain login default_client_limit = 15000 default_process_limit = 10000 disable_plaintext_auth = no listen = * mail_gid = vmail mail_location = mdbox:~/mdbox mail_plugins = zlib quota stats mail_uid = vmail mmap_disable = yes namespace { inbox = yes location = prefix = INBOX. separator = . } passdb { args = /etc/dovecot/conf.d/dovecot-sql.conf.ext driver = sql } plugin { quota = dict:User noenforcing quota::file:%h/dovecot-quota stats_refresh = 30 secs stats_track_cmds = yes zlib_save = gz } protocols = imap pop3 service auth { client_limit = 10000 unix_listener auth-userdb { mode = 0666 } } service imap-postlogin { executable = script-login /usr/bin/postlogin-imap.sh user = $default_internal_user } service imap { drop_priv_before_exec = yes executable = imap process_limit = 10000 } service pop3-postlogin { executable = script-login /usr/bin/postlogin-pop.sh user = $default_internal_user } service pop3 { drop_priv_before_exec = yes executable = pop3 process_limit = 2500 } service stats { fifo_listener stats-mail { mode = 0600 user = vmail } } ssl_cert =
Jeff Gustafson wrote:
On Sat, 2012-08-11 at 00:56 +0200, Daniel Parthey wrote:
Hi Jeff, Please start by following the instructions at http://dovecot.org/bugreport.html and post your 'doveconf -n' output in order to provide possibly important information about your system and configs.
Storage is local hardware-based RAID (SATA drives). FS is ext4. No core dump that I know of. Command used (mailbox is over 2GB):
# su vmail -c "dsync -u user@domain.com backup ssh vmail@bk05 dsync -c /etc/dovecot/dovecot.conf -u user@domain.com"
Config file: # 2.1.9: /etc/dovecot/dovecot.conf # OS: Linux 2.6.32-220.17.1.el6.x86_64 x86_64 CentOS release 6.2 (Final)
Maybe you have run into the epoll kernel bug under RHEL/CentOS:
http://dovecot.org/oldnews.html Thu Mar 22 14:38:53 EET 2012 Red Hat/CentOS users: A recent kernel update causes Dovecot to start failing after it has reached 1000 child processes. To fix this, downgrade your kernel until Red Hat releases a fixed kernel.
See also: https://bugzilla.redhat.com/show_bug.cgi?id=681578
You should try another kernel: https://rhn.redhat.com/errata/RHSA-2012-1129.html
Regards Daniel
On Sat, 2012-08-11 at 15:50 +0200, Daniel Parthey wrote:
Maybe you have run into the epoll kernel bug under RHEL/CentOS:
:) Yeah... been there, done that. We found that bug within *minutes* of
ksplice updating the kernel. I don't think this is an epoll thing because, if it was, customers wouldn't be able to connect to our services. I think there is something else going on. I think it is a bug, but I suppose it could be a setting somewhere. Something is timing out or getting stuck. ...Jeff
I ran a rsync on the mailboxes that I was having issues with. I re-ran
rsync until I had a full sync with no further updates. Then I ran a dsync. dsync was able to run without issue. If I wipe out the target directory and re-run dsync, I'm back to dsync getting stuck. Running rsync on mdbox files is not optimal. What else can I do to track down the issue? I've contacted Timo's company about payed support so we can get a fix for this issue. I hope to hear from them soon.
...Jeff
On 11.8.2012, at 0.54, Jeff Gustafson wrote:
More dsync issues. We were running 2.1.7 and we updated to 2.1.9. Same problem with both versions. I'm getting an error 75 on about 40 boxes out of 1800. It is the same list of boxes every time we use 'dsync backup' to backup the server. dsync seems to stop communicating to the backup box (over ssh). strace just shows it sitting at a epoll_wait.
So you can easily reproduce this by running dsync for a specific user?
Once the program quits (times out?), a 'du' shows the destination is smaller (200kbyte in one case).
As in, some of the mails didn't get synced? (doveadm fetch could be used to do a better comparison, file sizes don't necessarily mean anything.)
Has anyone else seen an exit code of 75? Nothing in the documentation mentions what exit code 75 could mean.
"temporary failure".
What can I do to help the developers locate the bug?
Those hangs are a little bit annoying to debug, and the whole code has been rewritten for v2.2 already in a way that should make the hangs pretty much impossible. Annoyingly v2.2 isn't ready yet..
On Tue, 2012-08-14 at 23:23 +0300, Timo Sirainen wrote:
On 11.8.2012, at 0.54, Jeff Gustafson wrote:
More dsync issues. We were running 2.1.7 and we updated to 2.1.9. Same problem with both versions. I'm getting an error 75 on about 40 boxes out of 1800. It is the same list of boxes every time we use 'dsync backup' to backup the server. dsync seems to stop communicating to the backup box (over ssh). strace just shows it sitting at a epoll_wait.
So you can easily reproduce this by running dsync for a specific user?
Yes. There is a subset of mailboxes that always time out.
Once the program quits (times out?), a 'du' shows the destination is smaller (200kbyte in one case).
As in, some of the mails didn't get synced? (doveadm fetch could be used to do a better comparison, file sizes don't necessarily mean anything.)
True, I will dump out the mailboxes and see if it truly was incomplete.
Those hangs are a little bit annoying to debug, and the whole code has been rewritten for v2.2 already in a way that should make the hangs pretty much impossible. Annoyingly v2.2 isn't ready yet..
I have found a manual work around. I use rsync to get the files over to
the backup machines, then I let the backup script keep things up to date. It is not the best way to go, but at least I have backups. I suppose I can check the log and continue to rsync things over until 2.2 comes out.
...Jeff
On Tue, 2012-08-14 at 23:23 +0300, Timo Sirainen wrote:
On 11.8.2012, at 0.54, Jeff Gustafson wrote:
What can I do to help the developers locate the bug?
Those hangs are a little bit annoying to debug, and the whole code has been rewritten for v2.2 already in a way that should make the hangs pretty much impossible. Annoyingly v2.2 isn't ready yet..
I have an issue related to this problem. dsync returns an error 75 when
it detects the source mailbox is empty (client probably pop3'd all of their email). It also returns an error 75 when I get the timeout error. For not I am parsing the error to find out which is which and act accordingly. It would be much nicer if dsync returned a different error code for empty source mailboxes.
...Jeff
On 15.8.2012, at 5.41, Jeff Gustafson wrote:
I have an issue related to this problem. dsync returns an error 75 when it detects the source mailbox is empty (client probably pop3'd all of their email). It also returns an error 75 when I get the timeout error.
You mean this?
dsync-local(tss): Fatal: dsync backup: Looks like you're trying to run backup in wrong direction. Source is empty and destination is not.
Maybe it needs a force setting, or change the detection somehow..
On Wed, 2012-08-15 at 12:27 +0300, Timo Sirainen wrote:
On 15.8.2012, at 5.41, Jeff Gustafson wrote:
I have an issue related to this problem. dsync returns an error 75 when it detects the source mailbox is empty (client probably pop3'd all of their email). It also returns an error 75 when I get the timeout error.
You mean this?
dsync-local(tss): Fatal: dsync backup: Looks like you're trying to run backup in wrong direction. Source is empty and destination is not.
That's the one!
Maybe it needs a force setting, or change the detection somehow..
That would be nice. A backup script is executing the command, so it
should never do it in the wrong direction. A force setting might be the simplest way to go.
...Jeff
On 15.8.2012, at 22.23, Jeff Gustafson wrote:
dsync-local(tss): Fatal: dsync backup: Looks like you're trying to run backup in wrong direction. Source is empty and destination is not.
That's the one!
Maybe it needs a force setting, or change the detection somehow..
That would be nice. A backup script is executing the command, so it should never do it in the wrong direction. A force setting might be the simplest way to go.
This should do it: http://hg.dovecot.org/dovecot-2.1/rev/5f280c1ec9fd
participants (3)
-
Daniel Parthey
-
Jeff Gustafson
-
Timo Sirainen