Hello
Checking further, and putting logs a bit every where in the dovecot code, the core is sending FIRST the initial document (not decoded) then SECOND the decoded version
Thisi is really weird, and the indexer then indexes a lot of binary crap
I am struggling to find where in the code this double call is made.
Anyone knows ?
On 2021-02-10 00:05, John Fawcett wrote:
On 09/02/2021 15:33, Joan Moreau wrote:
If I place the following code in the plugin fts_backend_xxx_update_build_more function (lucene, squat and xapian, as solr refuses to work properly on my setup)
{ char * s = i_strdup("EMPTY"); if(data != NULL) { i_free(s); s = i_strndup(data,20); } i_info("fts_backend_update_build_more: data like '%s'",s); i_free(s); }
and if I send a PDF by email, the data shown in the log is "%PDF-1.7 "
so it does mean the decoder data is not properly transmitted to the plugin
Something is wrong in the data transmission
Joan
I too see something similar with fts_solr. I do see the raw %PDF string and PDF binary data being passed through to fts_backend_xxx_update_build_more function but I disagree with the conclusion you draw from it.
After the raw data I also see the decoded data, so at least in my case it is possible to see both the raw and decoded data in fts_backend_xxx_update_build_more function. In the rawlog I no longer see the binary data (but some blank lines), so something is filtering it. I do see the decoded data in the rawlog. I do get hits on the solr search for the decoded text.
John