On 21/01/2021 15:10, Alexey Panov wrote:
In some cases (exact condition still unknown) dovecot sends binary data (attachments) to SOLR for indexing. This reduces index and overall FTS efficiency dramatically. 

In extreme condition (below an example of 20MB) dovecot’s hardwired timeout of 60s gets triggered during HTTP exchange with SOLR on just a single file. This results in an unfinished index which, by initial indexing, gets restarted over and over. With multiple affected mailboxes even on moderate usage this can cause an IO overload of the whole system.

Message example (doveadm fetch text): https://filebin.ca/5oy5Wc1QrBK3/fetch-text.obfuscated.txt
Corresponding raw log data: https://filebin.ca/5oy6yqLSCr3H/rawlog.obfuscated.txt

(Both files were processed with perl doveadm-obfuscate.pl; the script doesn’t replace non-latin characters so they were replaced with ‘R’ manually)

Workaround: there is a useful patch by John Fawcett  that allows to set the FTS indexing message body maximum size. It works perfectly, but affected messages are getting completely ignored by FTS.

This bug report is a summarised result of this discussion

Alexey

just a couple of questions. I am expecting that the messages with sizes exceeding the configurable limit introduced by my patch submission are not completely ignored, but that headers are getting indexed. I don't have time to check it now, but I'm pretty sure about it. Do you have evidence that the messages are not being indexed at all. The desired behaviour of my patch fts_max_size configuration was to bypass only message body indexing not bypass indexing completely.

Are you requesting a different behaviour to the one provided by the patch? I imagine that people would find it useful to still parse the message body up to the limit. That would be a little more trickly, but potentially a good idea for a further enhancement.

Thanks

John