newer
[Dovecot] GSS-SPNEGO with dovecot...

[Dovecot] dsync timeout?

older
[Dovecot] Dovecot-2.1.14 - pop3...

Micah Anderson

31 Jan 2013 31 Jan '13

12:06 a.m.

I'm using dsync for a regular backup. The backup system flocks so that two cannot run at the same time, which is generally a good thing. The problem is that it seems like dsync sometimes goes off into the weeds and never comes back, leaving a process running and doing nothing forever, hogging the lock and causing my backups never to run again. I just finally figured out that was what was causing the backups not to run on this system was this process:

root 17836 0.0 0.0 40888 1600 ? S 2012 0:00 ssh -i /root/.ssh/backmaildir_id_rsa backmaildir@arg /usr/bin/dsync -u foobar server

yeah, that has been running since 2012 :(

root:/tmp# strace -p 17836 Process 17836 attached - interrupt to quit select(8, [4], [], NULL, NULL

very exciting...

There doesn't seem to be a timeout in dsync, but perhaps there should be? At this point my only option is to write a cronjob that will look for dsync processes that are over a certain amount of time old and then kill them, after I do that I will need to take a shower because that is a very dirty solution :P

thanks for any ideas, or help! micah

Show replies by date

Timo Sirainen

31 Jan 31 Jan

12:15 a.m.

On 31.1.2013, at 0.06, Micah Anderson <micah@riseup.net> wrote:

...

I'm using dsync for a regular backup. The backup system flocks so that two cannot run at the same time, which is generally a good thing. The problem is that it seems like dsync sometimes goes off into the weeds and never comes back, leaving a process running and doing nothing forever, hogging the lock and causing my backups never to run again. I just finally figured out that was what was causing the backups not to run on this system was this process:

root 17836 0.0 0.0 40888 1600 ? S 2012 0:00 ssh -i /root/.ssh/backmaildir_id_rsa backmaildir@arg /usr/bin/dsync -u foobar server

yeah, that has been running since 2012 :(

So that's the ssh process. What about the dsync process that started it? Does/did it exist?

...

There doesn't seem to be a timeout in dsync, but perhaps there should be? At this point my only option is to write a cronjob that will look for dsync processes that are over a certain amount of time old and then kill them, after I do that I will need to take a shower because that is a very dirty solution :P

There is a 15 minute timeout in dsync after which it stops itself. Normally the child process should also die.. v2.2 now will make sure that the child process dies: http://hg.dovecot.org/dovecot-2.2/rev/070ca24e5846

micah anderson

1:46 a.m.

Timo Sirainen <tss@iki.fi> writes:

...

On 31.1.2013, at 0.06, Micah Anderson <micah@riseup.net> wrote:

...
I'm using dsync for a regular backup. The backup system flocks so that two cannot run at the same time, which is generally a good thing. The problem is that it seems like dsync sometimes goes off into the weeds and never comes back, leaving a process running and doing nothing forever, hogging the lock and causing my backups never to run again. I just finally figured out that was what was causing the backups not to run on this system was this process:

root 17836 0.0 0.0 40888 1600 ? S 2012 0:00 ssh -i /root/.ssh/backmaildir_id_rsa backmaildir@arg /usr/bin/dsync -u foobar server

yeah, that has been running since 2012 :(

So that's the ssh process. What about the dsync process that started it? Does/did it exist?

Seems that only the above process was still around and no other dsync processes. I have three machines that all have this happening it seems.

I wonder if there is a ssh configuration option I could set to make these die off.

...

...
There doesn't seem to be a timeout in dsync, but perhaps there should be? At this point my only option is to write a cronjob that will look for dsync processes that are over a certain amount of time old and then kill them, after I do that I will need to take a shower because that is a very dirty solution :P

There is a 15 minute timeout in dsync after which it stops itself. Normally the child process should also die.. v2.2 now will make sure that the child process dies: http://hg.dovecot.org/dovecot-2.2/rev/070ca24e5846

Interesting... I wonder why the child is not dying off properly, maybe the wrong signal is sent?

looking forward to using 2.2! micah

Sean Kamath

4:29 a.m.

On Jan 30, 2013, at 3:46 PM, micah anderson <micah@riseup.net> wrote:

...

Seems that only the above process was still around and no other dsync processes. I have three machines that all have this happening it seems.

I wonder if there is a ssh configuration option I could set to make these die off.

If the ssh process isn't sending anything, and just waiting for read()s, and keepalives are turned off, the SSH session might never know the remote side is long gone. . .

If any data were transmitted, it would discover the remote side is turned off.

See man ssh_config and the option TCPKeepAlive.

BTW: Since it's not on the command line, it's likely in /etc/ssh_config or /etc/ssh/ssh_config. Or ~/.ssh/config.

Sean

micah anderson

1 Feb 1 Feb

6:09 p.m.

Sean Kamath <kamath@moltingpenguin.com> writes:

...

On Jan 30, 2013, at 3:46 PM, micah anderson <micah@riseup.net> wrote:

...
Seems that only the above process was still around and no other dsync processes. I have three machines that all have this happening it seems.

I wonder if there is a ssh configuration option I could set to make these die off.

If the ssh process isn't sending anything, and just waiting for read()s, and keepalives are turned off, the SSH session might never know the remote side is long gone. . .

If any data were transmitted, it would discover the remote side is turned off.

See man ssh_config and the option TCPKeepAlive.

BTW: Since it's not on the command line, it's likely in /etc/ssh_config or /etc/ssh/ssh_config. Or ~/.ssh/config.

In /etc/ssh/sshd_config on the server I'm sending to, TCPKeepAlive yes is set.

The default on this system, according to the man page, seems to be to have TCPKeepAlive set.

Perhaps I should set ServerAliveInterval?

micah

Sean Kamath

2 Feb 2 Feb

10:07 a.m.

On Feb 1, 2013, at 8:09 AM, micah anderson <micah@riseup.net> wrote:

...

Sean Kamath <kamath@moltingpenguin.com> writes:

...
On Jan 30, 2013, at 3:46 PM, micah anderson <micah@riseup.net> wrote:

...
Seems that only the above process was still around and no other dsync processes. I have three machines that all have this happening it seems.

I wonder if there is a ssh configuration option I could set to make these die off.

If the ssh process isn't sending anything, and just waiting for read()s, and keepalives are turned off, the SSH session might never know the remote side is long gone. . .

If any data were transmitted, it would discover the remote side is turned off.

See man ssh_config and the option TCPKeepAlive.

BTW: Since it's not on the command line, it's likely in /etc/ssh_config or /etc/ssh/ssh_config. Or ~/.ssh/config.

In /etc/ssh/sshd_config on the server I'm sending to, TCPKeepAlive yes is set.

Did you check ~/.ssh/config for the user running the dsync?

...

The default on this system, according to the man page, seems to be to have TCPKeepAlive set.

Perhaps I should set ServerAliveInterval?

Perhaps. That states how long to send the KeepAlive packet.

There are many settings that can affect this, including

ServerAliveCountMax ServerAliveInterval TCPKeepAlive

There is also the sshd_config settings

ClientAliveCountMax ClientAliveInterval TCPKeepAlive

At this point, I think you need to see what's happening on both sides of the SSH connection. I don't recall what system you're on, but for linux you can use netstat -anp (as root) to find out what process is connected to which port, and on linux and other systems you can use lsof to find out what is connected to ports.

Maybe the TCP port is open and valid and there's no data coming through? This can happen if, for example, you have any port forwarding or X session forwarding through SSH (i.e., if ssh -X is the default) and something accidentally is holding that port open (this can happen in your regular shell if, for example, you have something open an X application and you forget (because you backgrounded it) -- you're logout of the server will hang until the X applications are closed. Note that it isn't always a visible client that will do this. :-().

Sean

micah anderson

8 Feb 8 Feb

8:58 p.m.

Sean Kamath <kamath@moltingpenguin.com> writes:

...

On Jan 30, 2013, at 3:46 PM, micah anderson <micah@riseup.net> wrote:

...
Seems that only the above process was still around and no other dsync processes. I have three machines that all have this happening it seems.

I wonder if there is a ssh configuration option I could set to make these die off.

If the ssh process isn't sending anything, and just waiting for read()s, and keepalives are turned off, the SSH session might never know the remote side is long gone. . .

This time I managed to capture a process that was stuck and look at it from the server side, and the client side:

on the server:

2000 19470 0.0 0.0 7512 3816 ? Ss Feb05 0:01 /usr/bin/dsync dsync-server -E -u foo

strace -s 1024 -F -p 19470

Process 19470 attached - interrupt to quit write(2, "dsync-remote(foo): Error: mdbox /srv/maildirbackups/foo/daily.1/storage: Duplicate GUID 96860517f68aa94f8b51000097f19f0b in m.41:682501 and m.37:653225\n", 167

on the client:

root 19001 0.0 0.0 41308 1600 ? S Feb05 0:00 ssh -i /root/.ssh/backmaildir_id_rsa backmaildir@hoopoe-pn /usr/bin/dsync -u foo server

strace -s 1024 -F -p 19001

Process 19001 attached - interrupt to quit select(8, [4], [], NULL, NULL

interestingly, now that I've been watching this more, the same users keep getting wedged.

When I attempt to do a dsync of that user by hand, I get this:

dsync-local(foo): Error: Unexpected reply from server: 13 d2a100118c45d24f760f000097f19f0b 3561 128 \Recent 1353980259

I tried one of the other users that was stuck, and it gave me:

dsync-remote(bar): Error: Corrupted dbox file /srv/maildirbackups/bar/daily.1/storage/m.130 (around offset=22532): msg header has bad magic value

This looks like there is something corrupted with the dbox for the user on the client side, is there something I can do to repair those?

...

If any data were transmitted, it would discover the remote side is turned off.

One thing I am doing is using a ssh controlmaster socket, and if I kill the process on the client's side, the server side process also dies.

micah

4670

Age (days ago)

4679

Last active (days ago)

List overview

6 comments

4 participants

participants (4)

micah anderson
Micah Anderson
Sean Kamath
Timo Sirainen

[Dovecot] dsync timeout?

Micah Anderson

thanks for any ideas, or help! micah

Timo Sirainen

micah anderson

Sean Kamath

micah anderson

Sean Kamath

micah anderson

strace -s 1024 -F -p 19470

strace -s 1024 -F -p 19001

tags

participants (4)