On 20 Apr 2015, at 19:10, Dennis Kuhn <d.kuhn@heinlein-support.de> wrote:
we have some replication issues. From time to time a doveadm-server process takes 100% cpu in the state recv_mailbox_tree_deletes on the replica. The process runs forever until it is manually killed. Strace on this process doesn't show anything. Sometimes we have several doveadm-server processes in this state, all for the same account, all with 100% CPU Load.
Some bug, but there would need to be a way to reproduce it or otherwise it's pretty much impossible to find what the bug is and get it fixed.
My workaround is to delete the user directory on the replica so that the whole account is replicated again. This solves the problem for this specific account.
So killing the doveadm-server process will cause it to hang again for the same user? That's good, since it means it can be reproduced by taking a copy of the mailboxes and trying to run "doveadm sync" manually on them locally, e.g.:
doveadm -D -o mail=mdbox:/tmp/mdbox1 sync mdbox:/tmp/mdbox2
Does that hang? If yes, we can get further with it. The -D parameter is also helpful here - v2.2.16 logs much more useful debug logging with dsync that can also help catch these kind of hangs. Even if you can't reproduce the hang the above way, having mail_debug=yes for dsync and getting the debug logs from a hanging session would be useful. (But it may also mean that a hang might start flooding your logs with debug messages and eat up all the disk space.)