Hello,
I have configure dovecot with solr and I wanted to let solr index content of attachments. For testing I have used biabam command line tool to generate emails with attachments.
I have found that dovecot with fts_decoder incorrectly decodes these attachments from biabam and therefore pdftotext has reported corrupted PDF.
The problem is that biabam generates header with charset=binary and dovecot message decoder tries to process it as UTF8 or non-UTF8 data.
============================================================ --biabam.ZxWVLybiabam.ZxWVLy Content-Type: application/pdf; charset=binary Content-Disposition: attachment; filename="bacula-jobs.pdf" Content-Transfer-Encoding: base64
JVBERi0xLjQKJcfsj6IKNSAwIG9iago8PC9MZW5ndGggNiAwIFIvRmlsdGVy IC9GbGF0ZURlY29kZT4+CnN0cmVhbQp4nF2PT0+EMBDF7/0U7yYYWdqFXdbe 1vgnMXpQezMeClSoQNltweh+egvLyczhN3kz703mCLpioFMtLDoSv2aoHKGo .....
This PDF begins orginal with
============================================================ 0000000: 2550 4446 2d31 2e34 0a25 c7ec 8fa2 0a35 %PDF-1.4.%.....5 0000010: 2030 206f 626a 0a3c 3c2f 4c65 6e67 7468 0 obj.<>.str
But the dovecot pass following data to fts_decoder script:
============================================================ 0000000: 2550 4446 2d31 2e34 0a25 c3a4 c3bc c3b6 %PDF-1.4.%...... 0000010: c39f 0a32 2030 206f 626a 0a3c 3c2f 4c65 ...2 0 obj.<>.
As you can see binary data are mangled.
Alpine and Thunderbird do not write charset=binary to content-type header and searching works perfect.
I have searched in source code and I have found one place. If I replace the following code in file dovecot-2.1.7/src/lib-mail/message-decoder.c on line 241 with new one, the dovecots message decoder decodes message correctly and pdftotext can convert attached PDF.
Original code:
241: ctx->binary_input = ctx->content_charset == NULL && 242: (ctx->flags & MESSAGE_DECODER_FLAG_RETURN_BINARY) != 0 && 243: (part->flags & (MESSAGE_PART_FLAG_TEXT | 244: MESSAGE_PART_FLAG_MESSAGE_RFC822)) == 0;
My update:
241 ctx->binary_input = ((ctx->content_charset != NULL) && (strcasecmp(ctx->content_charset, "binary") == 0)) || (ctx->content_charset == NULL && 242 (ctx->flags & MESSAGE_DECODER_FLAG_RETURN_BINARY) != 0 && 243 (part->flags & (MESSAGE_PART_FLAG_TEXT | 244 MESSAGE_PART_FLAG_MESSAGE_RFC822)) == 0);
This will set ctx->binary_input for the attachment with charset set to "binary".
I don't know if this is correct update, but the searching works with this update for biabam binary attachments too.
Could you please verify this problem and maybe update the code?
Thank you very much.
# dovecot --version 2.1.7
Config:
plugin { fts = solr fts_solr = url=http://localhost:8080/solr/ fts_decoder = decode2text }
service decode2text { executable = script /etc/dovecot/scripts/decode2text.sh user = dovecot unix_listener decode2text { mode = 0666 } }
Regards,
Robert Wolf.