Yes , once again : output of the decoder is fine, I also put log inide the dovecot core to check whether data is properly transmitted, and result is that it is (i.e. dovecot core receives the proper output of pdftotext via the decoder
Now, that data is the /not/ the one sent from dovecot core to the fts plugin (and this is the same issue for solr and all other plugins)
Of course, the stemming will show a good results (as PDF content will be stemmed) but the problem does remain.
How to make sure the data sent to the FTS plugins (xapian, solr, whatever...) is the the output of the decoder and /not/ the original data ?
On 2021-02-08 21:11, Stuart Henderson wrote:
On 2021-02-08, Joan Moreau <jom@grosjo.net> wrote:Well, in the function xxx_build_more of FTS plugin, the data received in
the original PDF, not the output of pdftotext
Can you clarify where do you put your log in the solr plugin , so I can
check the situation in the xapian plugin ?
The log is particular to fts_solr, you set it with e.g.
"fts_solr = url=http://127.0.0.1:8983/solr/dovecot/ rawlog_dir=/tmp/solr"
Confirmed it works for me, i.e. passes text from inside the pdf, and not
the whole pdf itself.
Did you check that decode2text.sh works ok on your system (when running
as the relevant uid)?
cat foo.pdf | sudo -u dovecot /usr/libexec/dovecot/decode2text.sh application/pdf