[Dovecot] imap-login hanging when firewall blocks ssl handshaking
imap-login processes are hanging (using 100% of CPU) when connected from a client that is partially blocked by a firewall. It appears that imap-login is stuck in a loop trying to complete an ssl handshake. imap-login is working fine for other clients not blocked by the firewall (including localhost).
This is dovecot 2.1.10 under Mac OS X 10.8.2 (compiled from sources); the firewall is Little Snitch 3.0.1 blocking port 993, which appears to let the connection initiate but then squashes and disconnects the socket during ssl handshaking.
gdb backtrace and Activity Monitor's "Sample Process" show that imap-login is stuck calling ioloop-kqueue's io_loop_handler_run -> io_loop_call_io -> ssl_step repeatedly; dtruss shows that it is repeatedly making system calls to kevent and read, the latter returning -1 with errno 57=ENOTCONN="Socket is not connected". (I also tried ./configure --with-ioloop=poll and --with-iopoll=select instead of the default best = kqueue but the results were the same; --with-iopoll=epoll didn't work because epoll is not available on this machine.) The client, initiated by the command "openssl s_client -connect SERVER:993", first responds "CONNECTED(00000003)" but then immediately the error "60278:error:140790E5:SSL routines:SSL23_WRITE:ssl handshake failure:/SourceCache/OpenSSL098/OpenSSL098-44/src/ssl/s23_lib.c:182:". The infinite loop is in src/lib/ioloop.c in the function "io_loop_run" where the statement "while (ioloop->running) io_loop_handler_run(ioloop)" is executed.
While I think the firewall is not doing the right thing (i.e., it lets a connection start but then kills it), it seems that imap-login should fail gracefully when the socket is no longer connected.
Ideas? Thanks.
/usr/local/sbin/dovecot -n
# 2.1.10: /usr/local/etc/dovecot/dovecot.conf
# OS: Darwin 12.2.0 x86_64
default_internal_user = _dovecot
default_login_user = _dovenull
namespace {
hidden = no
list = yes
location = mbox:/Users/%u/Documents/Mailboxes
prefix = mbox/
separator = /
type = private
}
namespace {
inbox = yes
location = maildir:/Users/%u/Documents/Maildir
mailbox Drafts {
special_use = \Drafts
}
mailbox Junk {
special_use = \Junk
}
mailbox Sent {
special_use = \Sent
}
mailbox Trash {
special_use = \Trash
}
prefix =
separator = /
type = private
}
passdb {
driver = pam
}
protocols = imap lmtp
ssl = required
ssl_cert =
gdb's "bt full" gives various results depending on where in the loop, but they are they same from ssl23_accept on down.
(gdb) bt full #0 0x00007fff92d19f63 in sha1_block_data_order () #1 0x00007fff92d1a4ae in SHA1_Final () #2 0x00007fff92d45430 in EVP_DigestFinal_ex () #3 0x00007fff92d17d7a in ssleay_rand_add () #4 0x00007fff974529be in ssl23_accept () #5 0x000000010cac0729 in ssl_handshake [inlined] () at /private/tmp/johnsone/dovecot-2.1.10/src/login-common/ssl-proxy-openssl.c:465 ret = -949939536 #6 0x000000010cac0729 in ssl_step (proxy=0x7fff53152b90) at ssl-proxy-openssl.c:528 ret = -949939536 #7 0x000000010cb11bd1 in io_loop_call_io (io=0x7f7fa0c1c240) at ioloop.c:379 ioloop = (struct ioloop *) 0x7f7fa0c03b00 t_id = 2 #8 0x000000010cb12c69 in io_loop_handler_run (ioloop=0x0) at ioloop-kqueue.c:148 tv = { tv_sec = 130, tv_usec = 124797 } ctx = (struct ioloop_handler_context *) 0x7f7fa0c1a890 ret = 1 ts = { tv_sec = 130, tv_nsec = 124797000 } events = (struct kevent *) 0x10cb4ee78 #9 0x000000010cb11b88 in io_loop_run (ioloop=0x7fff53152b90) at ioloop.c:398 No locals. #10 0x000000010cafde9b in master_service_run (service=0x7f7fa0c039f0, callback=0x7f7fa0c1c2d8) at master-service.c:543 No locals. #11 0x000000010cabdcd1 in main_deinit [inlined] () at /private/tmp/johnsone/dovecot-2.1.10/src/login-common/main.c:406 set_pool = (pool_t) 0x7f7fa100b420 login_socket = 0x10cb4f7f8 "?9??" #12 0x000000010cabdcd1 in login_binary_run (argc=1, argv=0x7f7fa0c037c0, binary=0x7fff53152b90) at main.c:407 set_pool = (pool_t) 0x7f7fa100b420 login_socket = 0x10cb4f7f8 "?9??" #13 0x00007fff8ac847e1 in start ()
(gdb) bt full #0 0x00007fff96e4c110 in malloc_zone_malloc () #1 0x00007fff96e4cba7 in malloc () #2 0x00007fff92d2c9e2 in CRYPTO_malloc () #3 0x00007fff92d4535f in EVP_DigestInit_ex () #4 0x00007fff92d17caa in ssleay_rand_add () #5 0x00007fff974529be in ssl23_accept () ...
#0 0x00007fff94775ffa in read () #1 0x00007fff92cf41e6 in sock_read () #2 0x00007fff92d15598 in BIO_read () #3 0x00007fff9745294a in ssl23_read_bytes () #4 0x00007fff97452c3e in ssl23_get_client_hello () #5 0x00007fff97452b4d in ssl23_accept () ...
On 10.11.2012, at 12.44, Erik A Johnson wrote:
imap-login processes are hanging (using 100% of CPU) when connected from a client that is partially blocked by a firewall. It appears that imap-login is stuck in a loop trying to complete an ssl handshake. imap-login is working fine for other clients not blocked by the firewall (including localhost).
This is dovecot 2.1.10 under Mac OS X 10.8.2 (compiled from sources); the firewall is Little Snitch 3.0.1 blocking port 993, which appears to let the connection initiate but then squashes and disconnects the socket during ssl handshaking.
gdb backtrace and Activity Monitor's "Sample Process" show that imap-login is stuck calling ioloop-kqueue's io_loop_handler_run -> io_loop_call_io -> ssl_step repeatedly; dtruss shows that it is repeatedly making system calls to kevent and read, the latter returning -1 with errno 57=ENOTCONN="Socket is not connected". (I also tried ./configure --with-ioloop=poll and --with-iopoll=select instead of the default best = kqueue but the results were the same; --with-iopoll=epoll didn't work because epoll is not available on this machine.) The client, initiated by the command "openssl s_client -connect SERVER:993", first responds "CONNECTED(00000003)" but then immediately the error "60278:error:140790E5:SSL routines:SSL23_WRITE:ssl handshake failure:/SourceCache/OpenSSL098/OpenSSL098-44/src/ssl/s23_lib.c:182:". The infinite loop is in src/lib/ioloop.c in the function "io_loop_run" where the statement "while (ioloop->running) io_loop_handler_run(ioloop)" is executed.
I wonder if this fixes it? http://hg.dovecot.org/dovecot-2.1/rev/e95479f439aa
Thanks, Timo. Nope, still an infinite loop. Anything I can try using gdb to trace?
On Nov 22, 2012, at 10:52 PM, Timo Sirainen tss@iki.fi wrote:
On 10.11.2012, at 12.44, Erik A Johnson wrote:
imap-login processes are hanging (using 100% of CPU) when connected from a client that is partially blocked by a firewall. It appears that imap-login is stuck in a loop trying to complete an ssl handshake. imap-login is working fine for other clients not blocked by the firewall (including localhost).
This is dovecot 2.1.10 under Mac OS X 10.8.2 (compiled from sources); the firewall is Little Snitch 3.0.1 blocking port 993, which appears to let the connection initiate but then squashes and disconnects the socket during ssl handshaking.
gdb backtrace and Activity Monitor's "Sample Process" show that imap-login is stuck calling ioloop-kqueue's io_loop_handler_run -> io_loop_call_io -> ssl_step repeatedly; dtruss shows that it is repeatedly making system calls to kevent and read, the latter returning -1 with errno 57=ENOTCONN="Socket is not connected". (I also tried ./configure --with-ioloop=poll and --with-iopoll=select instead of the default best = kqueue but the results were the same; --with-iopoll=epoll didn't work because epoll is not available on this machine.) The client, initiated by the command "openssl s_client -connect SERVER:993", first responds "CONNECTED(00000003)" but then immediately the error "60278:error:140790E5:SSL routines:SSL23_WRITE:ssl handshake failure:/SourceCache/OpenSSL098/OpenSSL098-44/src/ssl/s23_lib.c:182:". The infinite loop is in src/lib/ioloop.c in the function "io_loop_run" where the statement "while (ioloop->running) io_loop_handler_run(ioloop)" is executed.
I wonder if this fixes it? http://hg.dovecot.org/dovecot-2.1/rev/e95479f439aa
Could you try with the attached patch, and with only the problematic client running? What does it log (the beginning of the session until it starts repeating the same lines)?
On Sat, 2012-11-24 at 00:16 -0800, Erik A Johnson wrote:
Thanks, Timo. Nope, still an infinite loop. Anything I can try using gdb to trace?
On Nov 22, 2012, at 10:52 PM, Timo Sirainen tss@iki.fi wrote:
On 10.11.2012, at 12.44, Erik A Johnson wrote:
imap-login processes are hanging (using 100% of CPU) when connected from a client that is partially blocked by a firewall. It appears that imap-login is stuck in a loop trying to complete an ssl handshake. imap-login is working fine for other clients not blocked by the firewall (including localhost).
This is dovecot 2.1.10 under Mac OS X 10.8.2 (compiled from sources); the firewall is Little Snitch 3.0.1 blocking port 993, which appears to let the connection initiate but then squashes and disconnects the socket during ssl handshaking.
gdb backtrace and Activity Monitor's "Sample Process" show that imap-login is stuck calling ioloop-kqueue's io_loop_handler_run -> io_loop_call_io -> ssl_step repeatedly; dtruss shows that it is repeatedly making system calls to kevent and read, the latter returning -1 with errno 57=ENOTCONN="Socket is not connected". (I also tried ./configure --with-ioloop=poll and --with-iopoll=select instead of the default best = kqueue but the results were the same; --with-iopoll=epoll didn't work because epoll is not available on this machine.) The client, initiated by the command "openssl s_client -connect SERVER:993", first responds "CONNECTED(00000003)" but then immediately the error "60278:error:140790E5:SSL routines:SSL23_WRITE:ssl handshake failure:/SourceCache/OpenSSL098/OpenSSL098-44/src/ssl/s23_lib.c:182:". The infinite loop is in src/lib/ioloop.c in the function "io_loop_run" where the statement "while (ioloop->running) io_loop_handler_run(ioloop)" is executed.
I wonder if this fixes it? http://hg.dovecot.org/dovecot-2.1/rev/e95479f439aa
Here's the log:
Nov 28 21:28:11 macbookpro-e17d.home dovecot[54139]: master: Dovecot v2.1.10 starting up (core dumps disabled) Nov 28 21:30:19 macbookpro-e17d.home dovecot[54141]: imap-login: Debug: ssl_step() Nov 28 21:30:19 macbookpro-e17d.home dovecot[54141]: imap-login: Debug: ssl_handshake: SSL_accept()=-1 Nov 28 21:30:19 macbookpro-e17d.home dovecot[54141]: imap-login: Debug: SSL_get_error() = 2 Nov 28 21:30:19 macbookpro-e17d.home dovecot[54141]: imap-login: Debug: - want_read Nov 28 21:30:19 macbookpro-e17d.home dovecot[54141]: imap-login: Debug: ssl_set_io(0) [last 5 lines are repeated until process is killed]
On Nov 26, 2012, as 11:38PM PST, Timo Sirainen tss@iki.fi wrote:
Could you try with the attached patch, and with only the problematic client running? What does it log (the beginning of the session until it starts repeating the same lines)?
On 10.11.2012, at 12.44, Erik A Johnson wrote:
imap-login processes are hanging (using 100% of CPU) when connected from a client that is partially blocked by a firewall. It appears that imap-login is stuck in a loop trying to complete an ssl handshake. imap-login is working fine for other clients not blocked by the firewall (including localhost).
This is dovecot 2.1.10 under Mac OS X 10.8.2 (compiled from sources); the firewall is Little Snitch 3.0.1 blocking port 993, which appears to let the connection initiate but then squashes and disconnects the socket during ssl handshaking.
gdb backtrace and Activity Monitor's "Sample Process" show that imap-login is stuck calling ioloop-kqueue's io_loop_handler_run -> io_loop_call_io -> ssl_step repeatedly; dtruss shows that it is repeatedly making system calls to kevent and read, the latter returning -1 with errno 57=ENOTCONN="Socket is not connected". (I also tried ./configure --with-ioloop=poll and --with-iopoll=select instead of the default best = kqueue but the results were the same; --with-iopoll=epoll didn't work because epoll is not available on this machine.) The client, initiated by the command "openssl s_client -connect SERVER:993", first responds "CONNECTED(00000003)" but then immediately the error "60278:error:140790E5:SSL routines:SSL23_WRITE:ssl handshake failure:/SourceCache/OpenSSL098/OpenSSL098-44/src/ssl/s23_lib.c:182:". The infinite loop is in src/lib/ioloop.c in the function "io_loop_run" where the statement "while (ioloop->running) io_loop_handler_run(ioloop)" is executed.
This is either OSX bug or OpenSSL bug.. Apparently what happens is:
- Client sends SYN packet to Dovecot
- Dovecot accept()s the connection (sends SYN-ACK) and goes into OpenSSL code
- Client doesn't send ACK to Dovecot. Does it send RST or nothing or something else? I don't know.
- OSX notices anyway that something is wrong with the socket, and kqueue says that the socket is ready for reading
- OpenSSL read()s, which fails with ENOTCONN. But OpenSSL thinks this is a non-fatal error and simply asks to be notified again when something can be read
- goto 4
So, whose bug is it? OpenSSL's ENOTCONN handling probably makes sense for client connections where connect() hasn't finished yet. But then again, this is accept()ed connection where it typically should fail like that. Except I guess it might be correct behavior if read() is done after SYN-ACK but before receiving ACK.
While OSX is receiving ACK from the client, it shouldn't say that the fd is readable. It probably doesn't. But after it receives <something> it realizes that the socket is disconnected. So read() probably shouldn't be returning ENOTCONN anymore at this point, but instead ECONNRESET or ETIMEDOUT.
See if the attached patch helps.
On 29.11.2012, at 7.45, Erik A Johnson wrote:
Here's the log:
Nov 28 21:28:11 macbookpro-e17d.home dovecot[54139]: master: Dovecot v2.1.10 starting up (core dumps disabled) Nov 28 21:30:19 macbookpro-e17d.home dovecot[54141]: imap-login: Debug: ssl_step() Nov 28 21:30:19 macbookpro-e17d.home dovecot[54141]: imap-login: Debug: ssl_handshake: SSL_accept()=-1 Nov 28 21:30:19 macbookpro-e17d.home dovecot[54141]: imap-login: Debug: SSL_get_error() = 2 Nov 28 21:30:19 macbookpro-e17d.home dovecot[54141]: imap-login: Debug: - want_read Nov 28 21:30:19 macbookpro-e17d.home dovecot[54141]: imap-login: Debug: ssl_set_io(0) [last 5 lines are repeated until process is killed]
On Nov 26, 2012, as 11:38PM PST, Timo Sirainen tss@iki.fi wrote:
Could you try with the attached patch, and with only the problematic client running? What does it log (the beginning of the session until it starts repeating the same lines)?
On 10.11.2012, at 12.44, Erik A Johnson wrote:
imap-login processes are hanging (using 100% of CPU) when connected from a client that is partially blocked by a firewall. It appears that imap-login is stuck in a loop trying to complete an ssl handshake. imap-login is working fine for other clients not blocked by the firewall (including localhost).
This is dovecot 2.1.10 under Mac OS X 10.8.2 (compiled from sources); the firewall is Little Snitch 3.0.1 blocking port 993, which appears to let the connection initiate but then squashes and disconnects the socket during ssl handshaking.
gdb backtrace and Activity Monitor's "Sample Process" show that imap-login is stuck calling ioloop-kqueue's io_loop_handler_run -> io_loop_call_io -> ssl_step repeatedly; dtruss shows that it is repeatedly making system calls to kevent and read, the latter returning -1 with errno 57=ENOTCONN="Socket is not connected". (I also tried ./configure --with-ioloop=poll and --with-iopoll=select instead of the default best = kqueue but the results were the same; --with-iopoll=epoll didn't work because epoll is not available on this machine.) The client, initiated by the command "openssl s_client -connect SERVER:993", first responds "CONNECTED(00000003)" but then immediately the error "60278:error:140790E5:SSL routines:SSL23_WRITE:ssl handshake failure:/SourceCache/OpenSSL098/OpenSSL098-44/src/ssl/s23_lib.c:182:". The infinite loop is in src/lib/ioloop.c in the function "io_loop_run" where the statement "while (ioloop->running) io_loop_handler_run(ioloop)" is executed.
No, the test to bug out doesn't work because net_geterror(proxy->fd_ssl) returns 0 in the statement
if (!proxy->client_proxy && net_geterror(proxy->fd_ssl) == ENOTCONN) {
However, errno is indeed ENOTCONN. Changing the test to
if (!proxy->client_proxy && errno == ENOTCONN) {
does do the job (i.e., stopping the process with the "syscall failed" error message), though I'm sure I don't understand the ins and outs well enough to know if that is an appropriate test. The full log line is:
Nov 29 07:39:14 SERVER dovecot[24362]: imap-login: Disconnected (disconnected before greeting, waited 0 secs): user=<>, rip=CLIENTIP, lip=SERVERIP, TLS handshaking: SSL_accept() syscall failed: Socket is not connected, session=<d5DNFaTP8QAKAAFi>
Thoughts?
On Nov 28, 2012, at 10:18PM PST, Timo Sirainen tss@iki.fi wrote:
This is either OSX bug or OpenSSL bug.. Apparently what happens is:
- Client sends SYN packet to Dovecot
- Dovecot accept()s the connection (sends SYN-ACK) and goes into OpenSSL code
- Client doesn't send ACK to Dovecot. Does it send RST or nothing or something else? I don't know.
- OSX notices anyway that something is wrong with the socket, and kqueue says that the socket is ready for reading
- OpenSSL read()s, which fails with ENOTCONN. But OpenSSL thinks this is a non-fatal error and simply asks to be notified again when something can be read
- goto 4
So, whose bug is it? OpenSSL's ENOTCONN handling probably makes sense for client connections where connect() hasn't finished yet. But then again, this is accept()ed connection where it typically should fail like that. Except I guess it might be correct behavior if read() is done after SYN-ACK but before receiving ACK.
While OSX is receiving ACK from the client, it shouldn't say that the fd is readable. It probably doesn't. But after it receives <something> it realizes that the socket is disconnected. So read() probably shouldn't be returning ENOTCONN anymore at this point, but instead ECONNRESET or ETIMEDOUT.
See if the attached patch helps.
On 29.11.2012, at 7.45, Erik A Johnson wrote:
Here's the log:
Nov 28 21:28:11 macbookpro-e17d.home dovecot[54139]: master: Dovecot v2.1.10 starting up (core dumps disabled) Nov 28 21:30:19 macbookpro-e17d.home dovecot[54141]: imap-login: Debug: ssl_step() Nov 28 21:30:19 macbookpro-e17d.home dovecot[54141]: imap-login: Debug: ssl_handshake: SSL_accept()=-1 Nov 28 21:30:19 macbookpro-e17d.home dovecot[54141]: imap-login: Debug: SSL_get_error() = 2 Nov 28 21:30:19 macbookpro-e17d.home dovecot[54141]: imap-login: Debug: - want_read Nov 28 21:30:19 macbookpro-e17d.home dovecot[54141]: imap-login: Debug: ssl_set_io(0) [last 5 lines are repeated until process is killed]
On Nov 26, 2012, as 11:38PM PST, Timo Sirainen tss@iki.fi wrote:
Could you try with the attached patch, and with only the problematic client running? What does it log (the beginning of the session until it starts repeating the same lines)?
On 10.11.2012, at 12.44, Erik A Johnson wrote:
imap-login processes are hanging (using 100% of CPU) when connected from a client that is partially blocked by a firewall. It appears that imap-login is stuck in a loop trying to complete an ssl handshake. imap-login is working fine for other clients not blocked by the firewall (including localhost).
This is dovecot 2.1.10 under Mac OS X 10.8.2 (compiled from sources); the firewall is Little Snitch 3.0.1 blocking port 993, which appears to let the connection initiate but then squashes and disconnects the socket during ssl handshaking.
gdb backtrace and Activity Monitor's "Sample Process" show that imap-login is stuck calling ioloop-kqueue's io_loop_handler_run -> io_loop_call_io -> ssl_step repeatedly; dtruss shows that it is repeatedly making system calls to kevent and read, the latter returning -1 with errno 57=ENOTCONN="Socket is not connected". (I also tried ./configure --with-ioloop=poll and --with-iopoll=select instead of the default best = kqueue but the results were the same; --with-iopoll=epoll didn't work because epoll is not available on this machine.) The client, initiated by the command "openssl s_client -connect SERVER:993", first responds "CONNECTED(00000003)" but then immediately the error "60278:error:140790E5:SSL routines:SSL23_WRITE:ssl handshake failure:/SourceCache/OpenSSL098/OpenSSL098-44/src/ssl/s23_lib.c:182:". The infinite loop is in src/lib/ioloop.c in the function "io_loop_run" where the statement "while (ioloop->running) io_loop_handler_run(ioloop)" is executed.
On 29.11.2012, at 17.42, Erik A Johnson wrote:
No, the test to bug out doesn't work because net_geterror(proxy->fd_ssl) returns 0 in the statement
if (!proxy->client_proxy && net_geterror(proxy->fd_ssl) == ENOTCONN) {
However, errno is indeed ENOTCONN. Changing the test to
if (!proxy->client_proxy && errno == ENOTCONN) {
This change isn't very reliable, since the previous call might not have been read().. I wonder if something like would work:
if (!proxy->client && read(proxy->fd_ssl, &err, 0) < 0 && errno == ENOTCONN) {
At 10PM +0200 on 29/11/12 Timo Sirainen wrote:
On 29.11.2012, at 17.42, Erik A Johnson wrote:
No, the test to bug out doesn't work because net_geterror(proxy->fd_ssl) returns 0 in the statement
if (!proxy->client_proxy && net_geterror(proxy->fd_ssl) == ENOTCONN) {
However, errno is indeed ENOTCONN. Changing the test to
if (!proxy->client_proxy && errno == ENOTCONN) {
This change isn't very reliable, since the previous call might not have been read().. I wonder if something like would work:
if (!proxy->client && read(proxy->fd_ssl, &err, 0) < 0 && errno == ENOTCONN) {
How about calling getpeername on fd_ssl? That should reliably tell you if the socket is connected or not. http://cr.yp.to/docs/connect.html suggests that read is not always a reliable test for that.
Ben
On 30.11.2012, at 0.12, Ben Morrow wrote:
This change isn't very reliable, since the previous call might not have been read().. I wonder if something like would work:
if (!proxy->client && read(proxy->fd_ssl, &err, 0) < 0 && errno == ENOTCONN) {
How about calling getpeername on fd_ssl? That should reliably tell you if the socket is connected or not. http://cr.yp.to/docs/connect.html suggests that read is not always a reliable test for that.
Yes, that sounds like it would work better:
if (!proxy->client && net_getpeername(proxy->fd_ssl, NULL, NULL) < 0 && errno == ENOTCONN) {
On Nov 29, 2012, at 7:42 AM, Erik A Johnson johnsone@usc.edu wrote:
No, the test to bug out doesn't work because net_geterror(proxy->fd_ssl) returns 0 in the statement
if (!proxy->client_proxy && net_geterror(proxy->fd_ssl) == ENOTCONN) {
I was wrong here: the FIRST time, net_geterror returns EBADF = "The argument socket is not a valid file descriptor" (and then 0 on subsequent calls).
On November 29, 2012 at 12:43:42 PM PST, Timo Sirainen tss@iki.fi wrote:
I wonder if something like would work:
if (!proxy->client && read(proxy->fd_ssl, &err, 0) < 0 && errno == ENOTCONN) {
On November 29, 2012 at 2:12:18 PM PST, Ben Morrow ben@morrow.me.uk wrote:
How about calling getpeername on fd_ssl? That should reliably tell you if the socket is connected or not. http://cr.yp.to/docs/connect.html suggests that read is not always a reliable test for that.
Thanks, Ben.
On November 29, 2012 at 2:39:51 PM PST, Timo Sirainen tss@iki.fi wrote:
Yes, that sounds like it would work better:
if (!proxy->client && net_getpeername(proxy->fd_ssl, NULL, NULL) < 0 && errno == ENOTCONN) {
Using getpeername or net_getpeername, errno is set to EINVAL = "socket has been shut down", so we could instead use
if (!proxy->client && net_getpeername(proxy->fd_ssl, NULL, NULL) < 0 && errno == EINVAL) {
So it seems that we have the following options:
- net_geterror(proxy->fd_ssl) == EBADF
- read(proxy->fd_ssl, &err, 0) < 0 && errno == ENOTCONN
- net_getpeername(proxy->fd_ssl, NULL, NULL) < 0 && errno == EINVAL
Which is preferable? Should the "#ifdef __APPLE__" remain? or would any of these tests be appropriate for other platforms as well?
On Nov 28, 2012, at 10:18PM PST, Timo Sirainen tss@iki.fi wrote:
This is either OSX bug or OpenSSL bug.. Apparently what happens is:
- Client sends SYN packet to Dovecot
- Dovecot accept()s the connection (sends SYN-ACK) and goes into OpenSSL code
- Client doesn't send ACK to Dovecot. Does it send RST or nothing or something else? I don't know.
- OSX notices anyway that something is wrong with the socket, and kqueue says that the socket is ready for reading
- OpenSSL read()s, which fails with ENOTCONN. But OpenSSL thinks this is a non-fatal error and simply asks to be notified again when something can be read
- goto 4
So, whose bug is it? OpenSSL's ENOTCONN handling probably makes sense for client connections where connect() hasn't finished yet. But then again, this is accept()ed connection where it typically should fail like that. Except I guess it might be correct behavior if read() is done after SYN-ACK but before receiving ACK.
While OSX is receiving ACK from the client, it shouldn't say that the fd is readable. It probably doesn't. But after it receives <something> it realizes that the socket is disconnected. So read() probably shouldn't be returning ENOTCONN anymore at this point, but instead ECONNRESET or ETIMEDOUT.
See if the attached patch helps.
On 29.11.2012, at 7.45, Erik A Johnson wrote:
Here's the log:
Nov 28 21:28:11 macbookpro-e17d.home dovecot[54139]: master: Dovecot v2.1.10 starting up (core dumps disabled) Nov 28 21:30:19 macbookpro-e17d.home dovecot[54141]: imap-login: Debug: ssl_step() Nov 28 21:30:19 macbookpro-e17d.home dovecot[54141]: imap-login: Debug: ssl_handshake: SSL_accept()=-1 Nov 28 21:30:19 macbookpro-e17d.home dovecot[54141]: imap-login: Debug: SSL_get_error() = 2 Nov 28 21:30:19 macbookpro-e17d.home dovecot[54141]: imap-login: Debug: - want_read Nov 28 21:30:19 macbookpro-e17d.home dovecot[54141]: imap-login: Debug: ssl_set_io(0) [last 5 lines are repeated until process is killed]
On Nov 26, 2012, as 11:38PM PST, Timo Sirainen tss@iki.fi wrote: > > Could you try with the attached patch, and with only the problematic > client running? What does it log (the beginning of the session until it > starts repeating the same lines)? > > On 10.11.2012, at 12.44, Erik A Johnson wrote: >> imap-login processes are hanging (using 100% of CPU) when connected from a client that is partially blocked by a firewall. It appears that imap-login is stuck in a loop trying to complete an ssl handshake. imap-login is working fine for other clients not blocked by the firewall (including localhost). >> >> This is dovecot 2.1.10 under Mac OS X 10.8.2 (compiled from sources); the firewall is Little Snitch 3.0.1 blocking port 993, which appears to let the connection initiate but then squashes and disconnects the socket during ssl handshaking. >> >> gdb backtrace and Activity Monitor's "Sample Process" show that imap-login is stuck calling ioloop-kqueue's io_loop_handler_run -> io_loop_call_io -> ssl_step repeatedly; dtruss shows that it is repeatedly making system calls to kevent and read, the latter returning -1 with errno 57=ENOTCONN="Socket is not connected". (I also tried ./configure --with-ioloop=poll and --with-iopoll=select instead of the default best = kqueue but the results were the same; --with-iopoll=epoll didn't work because epoll is not available on this machine.) The client, initiated by the command "openssl s_client -connect SERVER:993", first responds "CONNECTED(00000003)" but then immediately the error "60278:error:140790E5:SSL routines:SSL23_WRITE:ssl handshake failure:/SourceCache/OpenSSL098/OpenSSL098-44/src/ssl/s23_lib.c:182:". The infinite loop is in src/lib/ioloop.c in the function "io_loop_run" where the statement "while (ioloop->running) io_loop_handler_run(ioloop)" is executed.
At 10AM -0800 on 1/12/12 Erik A Johnson wrote:
On November 29, 2012 at 2:39:51 PM PST, Timo Sirainen tss@iki.fi wrote:
Yes, that sounds like it would work better:
if (!proxy->client && net_getpeername(proxy->fd_ssl, NULL, NULL) < 0 && errno == ENOTCONN) {
Using getpeername or net_getpeername, errno is set to EINVAL = "socket has been shut down", so we could instead use
if (!proxy->client && net_getpeername(proxy->fd_ssl, NULL, NULL) < 0 && errno == EINVAL) {
So it seems that we have the following options:
- net_geterror(proxy->fd_ssl) == EBADF
- read(proxy->fd_ssl, &err, 0) < 0 && errno == ENOTCONN
- net_getpeername(proxy->fd_ssl, NULL, NULL) < 0 && errno == EINVAL
Which is preferable?
I think the djb page I mentioned before comes down in favour of
if (!proxy->client && net_getpeername(...) < 0 &&
read(proxy->fd_ssl, &err, 1) < 0 && errno == ENOTCONN) {
where the read is of 1 byte rather than 0 since a read of 0 apparently sometimes succeeds even if the socket isn't connected, and the getpeername is to protect against actually reading from a connected socket. However, doing all that every time we try to read seems like a lot of wasted effort; if possible, it would be better to identify the circumstances when it might happen (for instance, is it true that if we've done at least one successful SSL read on this socket that this error won't occur?).
Should the "#ifdef __APPLE__" remain? or would any of these tests be appropriate for other platforms as well?
I had a go at reproducing this on FreeBSD and failed, but I don't believe we've seen a packet trace yet so I wasn't entirely sure what might provoke it. There is definitely a bug in the OS here somewhere, unless the socket never gets as far as SYN-SYN/ACK-ACK, since ENOTCONN should only be returned *before* the socket has connected successfully. An ordinary disconnected socket should simply return EOF from read, and a socket that got RST should return ECONNRESET.
Are you able to reproduce this and get a tcpdump packet trace (on the dovecot side of any firewalls)? Also, when this happens, does it happen straight away or is there a delay until the connection times out?
(I don't suppose you know if the source for the OSX network stack is online anywhere? I'd be interested to see how different it is from FreeBSD's.)
Ben
At 10AM -0800 on 1/12/12 Erik A Johnson wrote:
Should the "#ifdef __APPLE__" remain? or would any of these tests be appropriate for other platforms as well?
On 12/1/2012 at 11:07:36am PST, Ben Morrow ben@morrow.me.uk wrote:
I had a go at reproducing this on FreeBSD and failed, but I don't believe we've seen a packet trace yet so I wasn't entirely sure what might provoke it. There is definitely a bug in the OS here somewhere, unless the socket never gets as far as SYN-SYN/ACK-ACK, since ENOTCONN should only be returned *before* the socket has connected successfully. An ordinary disconnected socket should simply return EOF from read, and a socket that got RST should return ECONNRESET.
Are you able to reproduce this and get a tcpdump packet trace (on the dovecot side of any firewalls)?
Attached is the output of "sudo tcpdump -i en0 -vv -A host CLIENT" run on the SERVER.
I know enough about tcpdump to be dangerous; if there is additional output that would be helpful, please let me know what options to use for tcpdump (the OS X tcpdump man page is at http://developer.apple.com/library/mac/#documentation/Darwin/Reference/ManPa...).
Also, when this happens, does it happen straight away or is there a delay until the connection times out?
It happens straight away.
(I don't suppose you know if the source for the OSX network stack is online anywhere? I'd be interested to see how different it is from FreeBSD's.)
Would it be somewhere at http://opensource.apple.com/release/mac-os-x-1082/ ?
At 2AM -0800 on 3/12/12 you (Erik A Johnson) wrote:
At 10AM -0800 on 1/12/12 Erik A Johnson wrote:
Should the "#ifdef __APPLE__" remain? or would any of these tests be appropriate for other platforms as well?
On 12/1/2012 at 11:07:36am PST, Ben Morrow ben@morrow.me.uk wrote:
I had a go at reproducing this on FreeBSD and failed, but I don't believe we've seen a packet trace yet so I wasn't entirely sure what might provoke it. There is definitely a bug in the OS here somewhere, unless the socket never gets as far as SYN-SYN/ACK-ACK, since ENOTCONN should only be returned *before* the socket has connected successfully. An ordinary disconnected socket should simply return EOF from read, and a socket that got RST should return ECONNRESET.
Are you able to reproduce this and get a tcpdump packet trace (on the dovecot side of any firewalls)?
Attached is the output of "sudo tcpdump -i en0 -vv -A host CLIENT" run on the SERVER.
I know enough about tcpdump to be dangerous; if there is additional output that would be helpful, please let me know what options to use for tcpdump (the OS X tcpdump man page is at http://developer.apple.com/library/mac/#documentation/Darwin/Reference/ ManPages/man1/tcpdump.1.html).
Also, when this happens, does it happen straight away or is there a delay until the connection times out?
It happens straight away.
(I don't suppose you know if the source for the OSX network stack is online anywhere? I'd be interested to see how different it is from FreeBSD's.)
Would it be somewhere at http://opensource.apple.com/release/mac-os-x-1082/ ?
Well, it looks to me as though xnu/bsd/kern/uipc_socket.c:soreceive will indeed return ENOTCONN for a socket which was once successfully connected but has now been disconnected. This happens when the socket is in the DEFUNCT state, which is a state that doesn't exist in FreeBSD; it's not completely clear but I suspect firewalls may be able to put arbitrary sockets into that state.
If this happens, OpenSSL's assumption that ENOTCONN always means 'we haven't finished accepting the connection yet' will no longer be correct. That suggests to me that until this gets fixed in OpenSSL (or, ideally, the OSX kernel), Dovecot should attempt to test for this condition, under #ifdef __APPLE__. Alternatively, it might be worth investigating the SO_DEFUNCTOK socket option: it looks to me as though clearing this prevents a socket from going into this state, but only root is allowed to do that.
This trace seems to bear out that idea, at least as far as it makes sense. (I've taken out the packet dumps, since they don't tell me anything.)
02:40:58.721596 IP (tos 0x0, ttl 64, id 17809, offset 0, flags [DF], proto TCP (6), length 64) CLIENT.51819 > SERVER.imaps: Flags [S], cksum 0xe780 (correct), seq 2439614483, win 65535, options [mss 1460,nop,wscale 3,nop,nop,TS val 1051593167 ecr 0,sackOK,eol], length 0 02:40:58.721803 IP (tos 0x0, ttl 64, id 55036, offset 0, flags [DF], proto TCP (6), length 64) SERVER.imaps > CLIENT.51819: Flags [S.], cksum 0x277c (correct), seq 947774533, ack 2439614484, win 65535, options [mss 1460,nop,wscale 4,nop,nop,TS val 877424355 ecr 1051593167,sackOK,eol], length 0 02:40:58.726646 IP (tos 0x0, ttl 64, id 34733, offset 0, flags [DF], proto TCP (6), length 52) CLIENT.51819 > SERVER.imaps: Flags [.], cksum 0x6748 (correct), seq 1, ack 1, win 65535, options [nop,nop,TS val 1051593169 ecr 877424355], length 0
These three packets are the standard TCP 3-way handshake. At this point both ends ought to consider the connection successfully established.
02:40:58.726754 IP (tos 0x0, ttl 64, id 6073, offset 0, flags [DF], proto TCP (6), length 52) SERVER.imaps > CLIENT.51819: Flags [.], cksum 0x4718 (correct), seq 1, ack 1, win 8235, options [nop,nop,TS val 877424360 ecr 1051593169], length 0
This is a window update from the server, indicating that the previous window of 65535 should be expanded to 131760 (that is, 8235 from the 'win' field in this packet, shifted left by the 4 from the 'wscale' option in the server's SYN/ACK packet). This is normal.
02:40:58.732338 IP (tos 0x0, ttl 64, id 20960, offset 0, flags [DF], proto TCP (6), length 52) SERVER.imaps > CLIENT.51819: Flags [F.], cksum 0x4712 (correct), seq 1, ack 1, win 8235, options [nop,nop,TS val 877424365 ecr 1051593169], length 0
This is what's weird. It's a FIN packet from the server, which usually indicates the server deliberately closed the connection. I don't see any reason for Dovecot or OpenSSL to do that, but I suspect this is what happens if the firewall makes the socket DEFUNCT. (I haven't yet managed to follow the code well enough to prove that...)
02:40:58.746788 IP (tos 0x0, ttl 64, id 37781, offset 0, flags [DF], proto TCP (6), length 182) CLIENT.51819 > SERVER.imaps: Flags [P.], cksum 0xfb11 (correct), seq 1:131, ack 1, win 65535, options [nop,nop,TS val 1051593169 ecr 877424355], length 130 02:40:58.746793 IP (tos 0x0, ttl 64, id 50564, offset 0, flags [DF], proto TCP (6), length 52) CLIENT.51819 > SERVER.imaps: Flags [.], cksum 0x66ba (correct), seq 131, ack 2, win 65535, options [nop,nop,TS val 1051593170 ecr 877424365], length 0
This is a data packet from the client trying to initiate the SSL conversation (presumably it sent this before it saw the FIN) followed by an ACK acknowledging the server's FIN.
02:40:58.746842 IP (tos 0x0, ttl 64, id 49645, offset 0, flags [DF], proto TCP (6), length 40) SERVER.imaps > CLIENT.51819: Flags [R], cksum 0xad98 (correct), seq 947774534, win 0, length 0 02:40:58.746843 IP (tos 0x0, ttl 64, id 42942, offset 0, flags [DF], proto TCP (6), length 40) SERVER.imaps > CLIENT.51819: Flags [R], cksum 0xad97 (correct), seq 947774535, win 0, length 0
The server at this point is being rude: it sends RSTs to both those packets, which mean that the server doesn't think these packets belong to a connection it knows about. This supports the idea that the socket has been made DEFUNCT: an ordinary closed socket shouldn't do this.
02:40:58.747318 IP (tos 0x0, ttl 64, id 27131, offset 0, flags [DF], proto TCP (6), length 52) CLIENT.51819 > SERVER.imaps: Flags [F.], cksum 0x66b9 (correct), seq 131, ack 2, win 65535, options [nop,nop,TS val 1051593170 ecr 877424365], length 0 02:40:58.747335 IP (tos 0x0, ttl 64, id 48383, offset 0, flags [DF], proto TCP (6), length 40) SERVER.imaps > CLIENT.51819: Flags [R], cksum 0xad97 (correct), seq 947774535, win 0, length 0
The client continues trying to be polite, and sends its own FIN/ACK indicating it is closing the connection. The server sends another RST, since it still doesn't know about this connection any more.
So, it looks to me as though you have a firewall problem. You may be able to get more information by setting the kern.ipc.sodefunctlog sysctl to 1: this should make the kernel log to syslog (or wherever the OSX kernel logs go) when sockets are made DEFUNCT and when reads fail for that reason.
Ben
At 12PM +0000 on 4/12/12 Ben Morrow wrote:
Well, it looks to me as though xnu/bsd/kern/uipc_socket.c:soreceive will indeed return ENOTCONN for a socket which was once successfully connected but has now been disconnected. This happens when the socket is in the DEFUNCT state, which is a state that doesn't exist in FreeBSD; it's not completely clear but I suspect firewalls may be able to put arbitrary sockets into that state.
Investigating a little further, it should be possible to test for this situation directly. Assuming I'm correct about what's going on here, this should be both cleaner and safer than mucking about looking for ENOTCONN and guessing about what's happening. Erik, does this make the problem go away? I left out the proxy->client_proxy test, since AFAICT this is just as likely to happen on a client socket. Ben --- src/login-common/ssl-proxy-openssl.c~ 2012-07-28 17:56:31 +0100 +++ src/login-common/ssl-proxy-openssl.c 2012-12-04 14:12:03 +0000 @@ -407,6 +407,21 @@ switch (err) { case SSL_ERROR_WANT_READ: +#ifdef SO_ISDEFUNCT + /* Some Apple firewalls appear to be able to mark a + * socket DEFUNCT, at which point all reads return + * immediately with ENOTCONN. Since that is supposed to + * mean 'the socket hasn't finished connecting yet', + * OpenSSL keeps asking for another read and we go into + * an infinite loop. + */ + if (getsockopt(proxy->fd_ssl, SOL_SOCKET, SO_ISDEFUNCT, + (void *)&err, sizeof(err)) == 0 && err) { + errstr = t_strdup_printf( + "%s: socket is defunct", func_name); + break; + } +#endif ssl_set_io(proxy, SSL_ADD_INPUT); break; case SSL_ERROR_WANT_WRITE:
participants (4)
-
Ben Morrow
-
Erik A Johnson
-
Erik A Johnson
-
Timo Sirainen