Auth process sometimes stop responding after upgrade

Simone Lazzaris s.lazzaris at interactive.eu
Sat Sep 8 15:26:24 EEST 2018


In data venerdì 7 settembre 2018 19:11:24 CEST, Timo Sirainen ha scritto:
> On 7 Sep 2018, at 19.43, Timo Sirainen <tss at iki.fi> wrote:
> > On 7 Sep 2018, at 16.50, Simone Lazzaris <s.lazzaris at interactive.eu 
<mailto:s.lazzaris at interactive.eu>> wrote:
> >> Some more information: the issue has just occurred, again on an instance
> >> without the "service_count = 0" configuration directive on pop3-login.
> >> 
> >> I've observed that while the issue is occurring, the director process
> >> goes 100% CPU. I've straced the process. It is seemingly looping:
> >> 
> >> ...
> >> ...
> >> epoll_ctl(13, EPOLL_CTL_ADD, 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP,
> >> {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78,
> >> {0, {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD,
> >> 78, {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320,
> >> u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0,
> >> {u32=149035320, u64=149035320}}) = 0 epoll_ctl(13, EPOLL_CTL_ADD, 78,
> >> {EPOLLIN|EPOLLPRI|EPOLLERR|EPOLLHUP, {u32=149035320, u64=149035320}}) =
> >> 0 epoll_ctl(13, EPOLL_CTL_DEL, 78, {0, {u32=149035320, u64=149035320}})
> >> = 0> 
> > Nothing else but these epoll_ctl() calls? So it's gone to some loop where
> > it keeps calling io_add() and io_remove().
> I'm guessing it's because of doveadm command handling issues, since there's
> some weirdness in the code. Although I couldn't figure out exactly why it
> would go to infinite loop there. But attached a patch that may fix it, if
> you're able to test. We haven't noticed such infinite looping in other
> installations or automated director stresstests though..
> >> FD 13 is "anon_inode:[eventpoll]"
> > 
> > What about fd 78? I guess some socket.
> > 
> > Could you also try two more things when it happens again:
> > 
> > ltrace -tt -e '*' -o ltrace.log -p <pid>
> > (My guess this isn't going to be very useful, but just in case it might
> > be..)
> > 
> > gdb -p <pid>
> > bt full
> > quit
> > 
> > Preferably install dovecot-dbg package also so the gdb backtrace output
> > will be better.
> These would still be useful to verify whether I'm even on the right track.

Timo, unfortunately the patch doesn't compile;  I've moved the declaration of 
"conn" one line up to make it work. 

In other words, the function becomes like this:
void doveadm_connections_ring_synced(void)
{
        struct doveadm_connection *conn = 
                        doveadm_ring_sync_pending_connections;
        while (doveadm_ring_sync_pending_connections != NULL &&
               conn->dir->ring_synced) {
                doveadm_connection_ring_sync_callback_t *callback =
                        conn->ring_sync_callback;

                conn->ring_sync_callback = NULL;
                doveadm_connection_ring_sync_list_move(conn);
                doveadm_connection_set_io(conn);
                io_set_pending(conn->io);
                callback(conn);
        }
        if (conn->dir->ring_synced)
                doveadm_connections_continue_reset_cmds();
}

I've tried the patched version but without luck: the director keeps crashing:

Sep  8 14:10:56 imapfront2 dovecot: director: Connecting to 
212.183.164.163:9090 (as 212.183.164.162): Initial connection
Sep  8 14:10:56 imapfront2 dovecot: director: Incoming connection from 
director 212.183.164.161/in
Sep  8 14:10:56 imapfront2 dovecot: director: director(212.183.164.163:9090/
out): Handshake finished in 0.002 secs (bytes in=61, bytes out=473, 0+0 USERs 
received, last input 0.000 s ago, last output 0.000 s ago, connected 0.002 s 
ago, 207 peak output buffer size, 0.000 CPU secs since connected)
Sep  8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in): 
Host 192.168.1.142 vhost count changed from 100 to 100
Sep  8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in): 
Host 192.168.1.143 vhost count changed from 100 to 100
Sep  8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in): 
Host 192.168.1.144 vhost count changed from 100 to 100
Sep  8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in): 
Host 192.168.1.145 vhost count changed from 100 to 100
Sep  8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in): 
Host 192.168.1.216 vhost count changed from 100 to 100
Sep  8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in): 
Host 192.168.1.217 vhost count changed from 100 to 100
Sep  8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in): 
Host 192.168.1.218 vhost count changed from 100 to 100
Sep  8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in): 
Host 192.168.1.219 vhost count changed from 100 to 100
Sep  8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in): 
Handshake finished in 0.005 secs (bytes in=67951, bytes out=56, 1788+0 USERs 
received, last input 0.000 s ago, last output 0.005 s ago, connected 0.005 s 
ago, 0 peak output buffer size, 0.004 CPU secs since connected)
Sep  8 14:10:56 imapfront2 dovecot: pop3-login: Warning: Auth connection 
closed with 1 pending requests (max 0 secs, pid=22703, EOF)
Sep  8 14:10:56 imapfront2 dovecot: auth: Warning: auth client 22703 
disconnected with 1 pending requests: EOF
Sep  8 14:10:56 imapfront2 dovecot: director: Fatal: master: 
service(director): child 23037 killed with signal 11 (core dumps disabled)
Sep  8 14:10:56 imapfront2 dovecot: pop3-login: Disconnected (auth process 
communication failure): user=<>, method=DIGEST-MD5, rip=95.253.200.113, 
lip=80.76.81.122, session=<CLkuBFt1icNf/chx>
Sep  8 14:10:56 imapfront2 dovecot: pop3-login: proxy(paola at cosmocarta.it): 
started proxying to 192.168.1.142:110: user=<paola at cosmocarta.it>, 
method=PLAIN, rip=85.32.56.86, lip=80.76.81.122, session=<KbkuBFt1NXVVIDhW>
Sep  8 14:10:56 imapfront2 dovecot: director: Connecting to 
212.183.164.163:9090 (as 212.183.164.162): Initial connection
Sep  8 14:10:56 imapfront2 dovecot: director: director(212.183.164.163:9090/
out): Handshake finished in 0.000 secs (bytes in=61, bytes out=383, 0+0 USERs 
received, last input 0.000 s ago, last output 0.000 s ago, connected 0.000 s 
ago, 207 peak output buffer size, 0.000 CPU secs since connected)
Sep  8 14:10:56 imapfront2 dovecot: director: Incoming connection from 
director 212.183.164.161/in
Sep  8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in): 
Host 192.168.1.142 vhost count changed from 100 to 100
Sep  8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in): 
Host 192.168.1.143 vhost count changed from 100 to 100
Sep  8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in): 
Host 192.168.1.144 vhost count changed from 100 to 100
Sep  8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in): 
Host 192.168.1.145 vhost count changed from 100 to 100
Sep  8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in): 
Host 192.168.1.216 vhost count changed from 100 to 100
Sep  8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in): 
Host 192.168.1.217 vhost count changed from 100 to 100
Sep  8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in): 
Host 192.168.1.218 vhost count changed from 100 to 100
Sep  8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in): 
Host 192.168.1.219 vhost count changed from 100 to 100
Sep  8 14:10:56 imapfront2 dovecot: director: director(212.183.164.161/in): 
Handshake finished in 0.005 secs (bytes in=67921, bytes out=56, 1788+0 USERs 
received, last input 0.000 s ago, last output 0.005 s ago, connected 0.005 s 
ago, 0 peak output buffer size, 0.004 CPU secs since connected)

I'm thinking to downgrade to version 2.2.30.2, which I use on another cluster 
without issues.


-- 
Simone Lazzaris
Qcom SpA




More information about the dovecot mailing list