Hi All, We are using dovecot 2.3.19.1 We created a account with more than 1000 mail folders in Maildir format to reproduce the issue. After weeks of testing, we have found a logic that may cause dsync to encounter the error - no activity for 900 seconds.... The function, dsync_ibc_stream_input, is the callback function after some data are ready for be read. This is part of what it does. o_stream_cork(ibc->output); ibc->ibc.io_callback(ibc->ibc.io_context); o_stream_uncork(ibc->output); Normally, ibc->ibc.io_callback(ibc->ibc.io_context) reads some data and then processes it. But when dsync connects over tcps, it uses function implementations in lib-ssl-iostream to send and receive data. And this simplified call stack would result in some data are read when calling o_stream_uncork o_stream_uncork => o_stream_flush => o_stream_ssl_flush_buffer => openssl_iostream_bio_sync => openssl_iostream_bio_input If some data arrive after ibc->ibc.io_callback(ibc->ibc.io_context) and before o_stre
am_uncork, o_stream_uncork would read the data and then return. After o_stream_uncork returns, dsync then waits for new data to be read or written. But because the data had been read in o_stream_uncork, and there may be no new data to be read, dsync may then wait until timeout is met. It may happen, but it is hard to reproduce. If you also encounter this situation, you may try to use dsync over tcp connection. It does not read data when calling o_stream_uncork. As a result, it may not stuck. Back to the error itself, Maybe openssl-stream should not read data when doing uncork(flush)? Song-Lin Yang
Hi,
But when dsync connects over tcps, it uses function implementations in lib-ssl-iostream to send and receive data. And this simplified call stack would result in some data are read when calling o_stream_uncork
o_stream_uncork => o_stream_flush => o_stream_ssl_flush_buffer => openssl_iostream_bio_sync => openssl_iostream_bio_input
If some data arrive after ibc->ibc.io_callback(ibc->ibc.io_context) and before o_stream_uncork, o_stream_uncork would read the data and then return. After o_stream_uncork returns, dsync then waits for new data to be read or written. But because the data had been read in o_stream_uncork, and there may be no new data to be read, dsync may then wait until timeout is met.
If this happens, it's supposed to mark the IO as having data "pending", which should prevent the timeout. The logic seems to me like it should work, although I'm now wondering if it can hang when there is some internal SSL state data that needs to be read even though there is no actual data being sent. So:
It may happen, but it is hard to reproduce.
Are you able to try if this patch happens to fix it? : diff --git a/src/lib-ssl-iostream/iostream-openssl.c b/src/lib-ssl-iostream/iostream-openssl.c index fa9afb22f7..094abaa56c 100644 --- a/src/lib-ssl-iostream/iostream-openssl.c +++ b/src/lib-ssl-iostream/iostream-openssl.c @@ -550,10 +550,7 @@ openssl_iostream_bio_input(struct ssl_iostream *ssl_io, } } if (bytes_read || i_stream_get_data_size(ssl_io->plain_input) > 0) { - if (i_stream_get_data_size(ssl_io->plain_input) > 0 || - (type != OPENSSL_IOSTREAM_SYNC_TYPE_FIRST_READ && - type != OPENSSL_IOSTREAM_SYNC_TYPE_CONTINUE_READ)) - i_stream_set_input_pending(ssl_io->ssl_input, TRUE); + i_stream_set_input_pending(ssl_io->ssl_input, TRUE); ssl_io->want_read = FALSE; } return (bytes_read ? 1 : 0);
On 2. Jan 2023, at 3.21, Timo Sirainen
It may happen, but it is hard to reproduce.
Are you able to try if this patch happens to fix it? :
Actually, looking at it slightly more, there may be another bug in it, although I'm not sure how it affects the behavior. So try this instead - I'll also run it through our CI to see if it causes any issues: diff --git a/src/lib-ssl-iostream/iostream-openssl.c b/src/lib-ssl-iostream/iostream-openssl.c index fa9afb22f7..0b22a07a84 100644 --- a/src/lib-ssl-iostream/iostream-openssl.c +++ b/src/lib-ssl-iostream/iostream-openssl.c @@ -548,14 +548,10 @@ openssl_iostream_bio_input(struct ssl_iostream *ssl_io, ssl_io->ostream_flush_waiting_input = FALSE; o_stream_set_flush_pending(ssl_io->plain_output, TRUE); } - } - if (bytes_read || i_stream_get_data_size(ssl_io->plain_input) > 0) { - if (i_stream_get_data_size(ssl_io->plain_input) > 0 || - (type != OPENSSL_IOSTREAM_SYNC_TYPE_FIRST_READ && - type != OPENSSL_IOSTREAM_SYNC_TYPE_CONTINUE_READ)) - i_stream_set_input_pending(ssl_io->ssl_input, TRUE); ssl_io->want_read = FALSE; } + if (bytes_read || i_stream_get_data_size(ssl_io->plain_input) > 0) + i_stream_set_input_pending(ssl_io->ssl_input, TRUE); return (bytes_read ? 1 : 0); }
On 2. Jan 2023, at 3.26, Timo Sirainen timo@sirainen.com wrote:
On 2. Jan 2023, at 3.21, Timo Sirainen timo@sirainen.com wrote:
It may happen, but it is hard to reproduce.
Are you able to try if this patch happens to fix it? :
Actually, looking at it slightly more, there may be another bug in it, although I'm not sure how it affects the behavior. So try this instead - I'll also run it through our CI to see if it causes any issues:
At least CI passed with this, so it's not badly broken. :)
We haven't found any specific bugs with lib-ssl-iostream so far, but we did find istream-multiplex bug that could cause hangs with doveadm-doveadm connections. Could be worth testing if it helps: https://github.com/dovecot/core/commit/bbe546bc637a6ac5c9e91fc8abefce62e4950...
On 30. Dec 2022, at 14.37, songliny songliny@synology.com wrote:
Hi All, We are using dovecot 2.3.19.1 We created a account with more than 1000 mail folders in Maildir format to reproduce the issue. After weeks of testing, we have found a logic that may cause dsync to encounter the error - no activity for 900 seconds.... The function, dsync_ibc_stream_input, is the callback function after some data are ready for be read. This is part of what it does. o_stream_cork(ibc->output); ibc->ibc.io_callback(ibc->ibc.io_context); o_stream_uncork(ibc->output); Normally, ibc->ibc.io_callback(ibc->ibc.io_context) reads some data and then processes it. But when dsync connects over tcps, it uses function implementations in lib-ssl-iostream to send and receive data. And this simplified call stack would result in some data are read when calling o_stream_uncork
o_stream_uncork => o_stream_flush => o_stream_ssl_flush_buffer => openssl_iostream_bio_sync => openssl_iostream_bio_input
If some data arrive after ibc->ibc.io_callback(ibc->ibc.io_context) and before o_stream_uncork, o_stream_uncork would read the data and then return. After o_stream_uncork returns, dsync then waits for new data to be read or written. But because the data had been read in o_stream_uncork, and there may be no new data to be read, dsync may then wait until timeout is met. It may happen, but it is hard to reproduce. If you also encounter this situation, you may try to use dsync over tcp connection. It does not read data when calling o_stream_uncork. As a result, it may not stuck. Back to the error itself, Maybe openssl-stream should not read data when doing uncork(flush)? Song-Lin Yang
participants (2)
-
songliny
-
Timo Sirainen