Hello,
I have configure dovecot with solr and I wanted to let solr index content of attachments. For testing I have used biabam command line tool to generate emails with attachments.
I have found that dovecot with fts_decoder incorrectly decodes these attachments from biabam and therefore pdftotext has reported corrupted PDF.
The problem is that biabam generates header with charset=binary and dovecot message decoder tries to process it as UTF8 or non-UTF8 data.
============================================================ --biabam.ZxWVLybiabam.ZxWVLy Content-Type: application/pdf; charset=binary Content-Disposition: attachment; filename="bacula-jobs.pdf" Content-Transfer-Encoding: base64
JVBERi0xLjQKJcfsj6IKNSAwIG9iago8PC9MZW5ndGggNiAwIFIvRmlsdGVy IC9GbGF0ZURlY29kZT4+CnN0cmVhbQp4nF2PT0+EMBDF7/0U7yYYWdqFXdbe 1vgnMXpQezMeClSoQNltweh+egvLyczhN3kz703mCLpioFMtLDoSv2aoHKGo .....
This PDF begins orginal with
============================================================ 0000000: 2550 4446 2d31 2e34 0a25 c7ec 8fa2 0a35 %PDF-1.4.%.....5 0000010: 2030 206f 626a 0a3c 3c2f 4c65 6e67 7468 0 obj.<</Length 0000020: 2036 2030 2052 2f46 696c 7465 7220 2f46 6 0 R/Filter /F 0000030: 6c61 7465 4465 636f 6465 3e3e 0a73 7472 lateDecode>>.str
But the dovecot pass following data to fts_decoder script:
============================================================ 0000000: 2550 4446 2d31 2e34 0a25 c3a4 c3bc c3b6 %PDF-1.4.%...... 0000010: c39f 0a32 2030 206f 626a 0a3c 3c2f 4c65 ...2 0 obj.<</Le 0000020: 6e67 7468 2033 2030 2052 2f46 696c 7465 ngth 3 0 R/Filte 0000030: 722f 466c 6174 6544 6563 6f64 653e 3e0a r/FlateDecode>>.
As you can see binary data are mangled.
Alpine and Thunderbird do not write charset=binary to content-type header and searching works perfect.
I have searched in source code and I have found one place. If I replace the following code in file dovecot-2.1.7/src/lib-mail/message-decoder.c on line 241 with new one, the dovecots message decoder decodes message correctly and pdftotext can convert attached PDF.
Original code:
241: ctx->binary_input = ctx->content_charset == NULL && 242: (ctx->flags & MESSAGE_DECODER_FLAG_RETURN_BINARY) != 0 && 243: (part->flags & (MESSAGE_PART_FLAG_TEXT | 244: MESSAGE_PART_FLAG_MESSAGE_RFC822)) == 0;
My update:
241 ctx->binary_input = ((ctx->content_charset != NULL) && (strcasecmp(ctx->content_charset, "binary") == 0)) || (ctx->content_charset == NULL && 242 (ctx->flags & MESSAGE_DECODER_FLAG_RETURN_BINARY) != 0 && 243 (part->flags & (MESSAGE_PART_FLAG_TEXT | 244 MESSAGE_PART_FLAG_MESSAGE_RFC822)) == 0);
This will set ctx->binary_input for the attachment with charset set to "binary".
I don't know if this is correct update, but the searching works with this update for biabam binary attachments too.
Could you please verify this problem and maybe update the code?
Thank you very much.
# dovecot --version 2.1.7
Config:
plugin { fts = solr fts_solr = url=http://localhost:8080/solr/ fts_decoder = decode2text }
service decode2text { executable = script /etc/dovecot/scripts/decode2text.sh user = dovecot unix_listener decode2text { mode = 0666 } }
Regards,
Robert Wolf.