[Dovecot] dsync timeout?
I'm using dsync for a regular backup. The backup system flocks so that two cannot run at the same time, which is generally a good thing. The problem is that it seems like dsync sometimes goes off into the weeds and never comes back, leaving a process running and doing nothing forever, hogging the lock and causing my backups never to run again. I just finally figured out that was what was causing the backups not to run on this system was this process:
root 17836 0.0 0.0 40888 1600 ? S 2012 0:00 ssh -i /root/.ssh/backmaildir_id_rsa backmaildir@arg /usr/bin/dsync -u foobar server
yeah, that has been running since 2012 :(
root:/tmp# strace -p 17836 Process 17836 attached - interrupt to quit select(8, [4], [], NULL, NULL
very exciting...
There doesn't seem to be a timeout in dsync, but perhaps there should be? At this point my only option is to write a cronjob that will look for dsync processes that are over a certain amount of time old and then kill them, after I do that I will need to take a shower because that is a very dirty solution :P
thanks for any ideas, or help! micah
On 31.1.2013, at 0.06, Micah Anderson <micah@riseup.net> wrote:
I'm using dsync for a regular backup. The backup system flocks so that two cannot run at the same time, which is generally a good thing. The problem is that it seems like dsync sometimes goes off into the weeds and never comes back, leaving a process running and doing nothing forever, hogging the lock and causing my backups never to run again. I just finally figured out that was what was causing the backups not to run on this system was this process:
root 17836 0.0 0.0 40888 1600 ? S 2012 0:00 ssh -i /root/.ssh/backmaildir_id_rsa backmaildir@arg /usr/bin/dsync -u foobar server
yeah, that has been running since 2012 :(
So that's the ssh process. What about the dsync process that started it? Does/did it exist?
There doesn't seem to be a timeout in dsync, but perhaps there should be? At this point my only option is to write a cronjob that will look for dsync processes that are over a certain amount of time old and then kill them, after I do that I will need to take a shower because that is a very dirty solution :P
There is a 15 minute timeout in dsync after which it stops itself. Normally the child process should also die.. v2.2 now will make sure that the child process dies: http://hg.dovecot.org/dovecot-2.2/rev/070ca24e5846
Timo Sirainen <tss@iki.fi> writes:
On 31.1.2013, at 0.06, Micah Anderson <micah@riseup.net> wrote:
I'm using dsync for a regular backup. The backup system flocks so that two cannot run at the same time, which is generally a good thing. The problem is that it seems like dsync sometimes goes off into the weeds and never comes back, leaving a process running and doing nothing forever, hogging the lock and causing my backups never to run again. I just finally figured out that was what was causing the backups not to run on this system was this process:
root 17836 0.0 0.0 40888 1600 ? S 2012 0:00 ssh -i /root/.ssh/backmaildir_id_rsa backmaildir@arg /usr/bin/dsync -u foobar server
yeah, that has been running since 2012 :(
So that's the ssh process. What about the dsync process that started it? Does/did it exist?
Seems that only the above process was still around and no other dsync processes. I have three machines that all have this happening it seems.
I wonder if there is a ssh configuration option I could set to make these die off.
There doesn't seem to be a timeout in dsync, but perhaps there should be? At this point my only option is to write a cronjob that will look for dsync processes that are over a certain amount of time old and then kill them, after I do that I will need to take a shower because that is a very dirty solution :P
There is a 15 minute timeout in dsync after which it stops itself. Normally the child process should also die.. v2.2 now will make sure that the child process dies: http://hg.dovecot.org/dovecot-2.2/rev/070ca24e5846
Interesting... I wonder why the child is not dying off properly, maybe the wrong signal is sent?
looking forward to using 2.2! micah
--
On Jan 30, 2013, at 3:46 PM, micah anderson <micah@riseup.net> wrote:
Seems that only the above process was still around and no other dsync processes. I have three machines that all have this happening it seems.
I wonder if there is a ssh configuration option I could set to make these die off.
If the ssh process isn't sending anything, and just waiting for read()s, and keepalives are turned off, the SSH session might never know the remote side is long gone. . .
If any data were transmitted, it would discover the remote side is turned off.
See man ssh_config and the option TCPKeepAlive.
BTW: Since it's not on the command line, it's likely in /etc/ssh_config or /etc/ssh/ssh_config. Or ~/.ssh/config.
Sean
Sean Kamath <kamath@moltingpenguin.com> writes:
On Jan 30, 2013, at 3:46 PM, micah anderson <micah@riseup.net> wrote:
Seems that only the above process was still around and no other dsync processes. I have three machines that all have this happening it seems.
I wonder if there is a ssh configuration option I could set to make these die off.
If the ssh process isn't sending anything, and just waiting for read()s, and keepalives are turned off, the SSH session might never know the remote side is long gone. . .
If any data were transmitted, it would discover the remote side is turned off.
See man ssh_config and the option TCPKeepAlive.
BTW: Since it's not on the command line, it's likely in /etc/ssh_config or /etc/ssh/ssh_config. Or ~/.ssh/config.
In /etc/ssh/sshd_config on the server I'm sending to, TCPKeepAlive yes is set.
The default on this system, according to the man page, seems to be to have TCPKeepAlive set.
Perhaps I should set ServerAliveInterval?
micah
On Feb 1, 2013, at 8:09 AM, micah anderson <micah@riseup.net> wrote:
Sean Kamath <kamath@moltingpenguin.com> writes:
On Jan 30, 2013, at 3:46 PM, micah anderson <micah@riseup.net> wrote:
Seems that only the above process was still around and no other dsync processes. I have three machines that all have this happening it seems.
I wonder if there is a ssh configuration option I could set to make these die off.
If the ssh process isn't sending anything, and just waiting for read()s, and keepalives are turned off, the SSH session might never know the remote side is long gone. . .
If any data were transmitted, it would discover the remote side is turned off.
See man ssh_config and the option TCPKeepAlive.
BTW: Since it's not on the command line, it's likely in /etc/ssh_config or /etc/ssh/ssh_config. Or ~/.ssh/config.
In /etc/ssh/sshd_config on the server I'm sending to, TCPKeepAlive yes is set.
Did you check ~/.ssh/config for the user running the dsync?
The default on this system, according to the man page, seems to be to have TCPKeepAlive set.
Perhaps I should set ServerAliveInterval?
Perhaps. That states how long to send the KeepAlive packet.
There are many settings that can affect this, including
ServerAliveCountMax ServerAliveInterval TCPKeepAlive
There is also the sshd_config settings
ClientAliveCountMax ClientAliveInterval TCPKeepAlive
At this point, I think you need to see what's happening on both sides of the SSH connection. I don't recall what system you're on, but for linux you can use netstat -anp (as root) to find out what process is connected to which port, and on linux and other systems you can use lsof to find out what is connected to ports.
Maybe the TCP port is open and valid and there's no data coming through? This can happen if, for example, you have any port forwarding or X session forwarding through SSH (i.e., if ssh -X is the default) and something accidentally is holding that port open (this can happen in your regular shell if, for example, you have something open an X application and you forget (because you backgrounded it) -- you're logout of the server will hang until the X applications are closed. Note that it isn't always a visible client that will do this. :-().
Sean
Sean Kamath <kamath@moltingpenguin.com> writes:
On Jan 30, 2013, at 3:46 PM, micah anderson <micah@riseup.net> wrote:
Seems that only the above process was still around and no other dsync processes. I have three machines that all have this happening it seems.
I wonder if there is a ssh configuration option I could set to make these die off.
If the ssh process isn't sending anything, and just waiting for read()s, and keepalives are turned off, the SSH session might never know the remote side is long gone. . .
This time I managed to capture a process that was stuck and look at it from the server side, and the client side:
on the server:
2000 19470 0.0 0.0 7512 3816 ? Ss Feb05 0:01 /usr/bin/dsync dsync-server -E -u foo # strace -s 1024 -F -p 19470 Process 19470 attached - interrupt to quit write(2, "dsync-remote(foo): Error: mdbox /srv/maildirbackups/foo/daily.1/storage: Duplicate GUID 96860517f68aa94f8b51000097f19f0b in m.41:682501 and m.37:653225\n", 167
on the client:
root 19001 0.0 0.0 41308 1600 ? S Feb05 0:00 ssh -i /root/.ssh/backmaildir_id_rsa backmaildir@hoopoe-pn /usr/bin/dsync -u foo server
# strace -s 1024 -F -p 19001 Process 19001 attached - interrupt to quit select(8, [4], [], NULL, NULL
interestingly, now that I've been watching this more, the same users keep getting wedged.
When I attempt to do a dsync of that user by hand, I get this:
dsync-local(foo): Error: Unexpected reply from server: 13 d2a100118c45d24f760f000097f19f0b 3561 128 \Recent 1353980259
I tried one of the other users that was stuck, and it gave me:
dsync-remote(bar): Error: Corrupted dbox file /srv/maildirbackups/bar/daily.1/storage/m.130 (around offset=22532): msg header has bad magic value
This looks like there is something corrupted with the dbox for the user on the client side, is there something I can do to repair those?
If any data were transmitted, it would discover the remote side is turned off.
One thing I am doing is using a ssh controlmaster socket, and if I kill the process on the client's side, the server side process also dies.
micah
participants (4)
-
micah anderson
-
Micah Anderson
-
Sean Kamath
-
Timo Sirainen