Hello,
I am trying to deal properly with email attachements in fts-xapian plugins.
I tried the default script with a PDF file.
The data I receive in the fts plugin part ("xxx_build_more") is the original document, no the output of the pdftotext
Is there anything I am missing ?
Here my config:
plugin { plugin = fts_xapian managesieve sieve
fts = xapian
fts_xapian = partial=2 full=20 verbose=1 attachments=1
fts_autoindex = yes
fts_enforced = yes
fts_autoindex_exclude = \Trash
fts_autoindex_exclude2 = \Drafts
fts_decoder = decode2text
sieve = /data/mail/%d/%n/local.sieve
sieve_after = /data/mail/after.sieve
sieve_before = /data/mail/before.sieve
sieve_dir = /data/mail/%d/%n/sieve
sieve_global_dir = /data/mail
sieve_global_path = /data/mail/global.sieve
}
...
service decode2text { executable = script /usr/libexec/dovecot/decode2text.sh user = dovecot unix_listener decode2text { mode = 0666 } }
Thank you
a bit more on this, adding log in the decode2text.sh, I can see that pdftotext output the right data, but that data is /not/ transmitted to the fts plugin for indexing (only the original pdf code is)
On 2021-02-07 17:00, Joan Moreau wrote:
Hello,
I am trying to deal properly with email attachements in fts-xapian plugins.
I tried the default script with a PDF file.
The data I receive in the fts plugin part ("xxx_build_more") is the original document, no the output of the pdftotext
Is there anything I am missing ?
Here my config:
plugin { plugin = fts_xapian managesieve sieve
fts = xapian fts_xapian = partial=2 full=20 verbose=1 attachments=1
fts_autoindex = yes fts_enforced = yes fts_autoindex_exclude = \Trash fts_autoindex_exclude2 = \Drafts
fts_decoder = decode2text
sieve = /data/mail/%d/%n/local.sieve sieve_after = /data/mail/after.sieve sieve_before = /data/mail/before.sieve sieve_dir = /data/mail/%d/%n/sieve sieve_global_dir = /data/mail sieve_global_path = /data/mail/global.sieve }
...
service decode2text { executable = script /usr/libexec/dovecot/decode2text.sh user = dovecot unix_listener decode2text { mode = 0666 } }
Thank you
more info : I am running dovecot git version
On 2021-02-07 17:15, Joan Moreau wrote:
a bit more on this, adding log in the decode2text.sh, I can see that pdftotext output the right data, but that data is /not/ transmitted to the fts plugin for indexing (only the original pdf code is)
On 2021-02-07 17:00, Joan Moreau wrote:
Hello,
I am trying to deal properly with email attachements in fts-xapian plugins.
I tried the default script with a PDF file.
The data I receive in the fts plugin part ("xxx_build_more") is the original document, no the output of the pdftotext
Is there anything I am missing ?
Here my config:
plugin { plugin = fts_xapian managesieve sieve
fts = xapian fts_xapian = partial=2 full=20 verbose=1 attachments=1
fts_autoindex = yes fts_enforced = yes fts_autoindex_exclude = \Trash fts_autoindex_exclude2 = \Drafts
fts_decoder = decode2text
sieve = /data/mail/%d/%n/local.sieve sieve_after = /data/mail/after.sieve sieve_before = /data/mail/before.sieve sieve_dir = /data/mail/%d/%n/sieve sieve_global_dir = /data/mail sieve_global_path = /data/mail/global.sieve }
...
service decode2text { executable = script /usr/libexec/dovecot/decode2text.sh user = dovecot unix_listener decode2text { mode = 0666 } }
Thank you
more info : the function fts_parser_script_more in plugins/fts/fts-parser.c properly read the output of the script
still, the data is not sent to the FTS pligins (xapian or any other)
On 2021-02-07 17:37, Joan Moreau wrote:
more info : I am running dovecot git version
On 2021-02-07 17:15, Joan Moreau wrote:
a bit more on this, adding log in the decode2text.sh, I can see that pdftotext output the right data, but that data is /not/ transmitted to the fts plugin for indexing (only the original pdf code is)
On 2021-02-07 17:00, Joan Moreau wrote:
Hello,
I am trying to deal properly with email attachements in fts-xapian plugins.
I tried the default script with a PDF file.
The data I receive in the fts plugin part ("xxx_build_more") is the original document, no the output of the pdftotext
Is there anything I am missing ?
Here my config:
plugin { plugin = fts_xapian managesieve sieve
fts = xapian fts_xapian = partial=2 full=20 verbose=1 attachments=1
fts_autoindex = yes fts_enforced = yes fts_autoindex_exclude = \Trash fts_autoindex_exclude2 = \Drafts
fts_decoder = decode2text
sieve = /data/mail/%d/%n/local.sieve sieve_after = /data/mail/after.sieve sieve_before = /data/mail/before.sieve sieve_dir = /data/mail/%d/%n/sieve sieve_global_dir = /data/mail sieve_global_path = /data/mail/global.sieve }
...
service decode2text { executable = script /usr/libexec/dovecot/decode2text.sh user = dovecot unix_listener decode2text { mode = 0666 } }
Thank you
On 07/02/2021 18:51, Joan Moreau wrote:
more info : the function fts_parser_script_more in plugins/fts/fts-parser.c properly read the output of the script
still, the data is not sent to the FTS pligins (xapian or any other)
On 2021-02-07 17:37, Joan Moreau wrote:
more info : I am running dovecot git version
On 2021-02-07 17:15, Joan Moreau wrote:
a bit more on this, adding log in the decode2text.sh, I can see that pdftotext output the right data, but that data is /not/ transmitted to the fts plugin for indexing (only the original pdf code is) On 2021-02-07 17:00, Joan Moreau wrote: Hello, I am trying to deal properly with email attachements in fts-xapian plugins. I tried the default script with a PDF file. The data I receive in the fts plugin part ("xxx_build_more") is the original document, no the output of the pdftotext Is there anything I am missing ? Here my config: plugin { plugin = fts_xapian managesieve sieve fts = xapian fts_xapian = partial=2 full=20 verbose=1 attachments=1 fts_autoindex = yes fts_enforced = yes fts_autoindex_exclude = \Trash fts_autoindex_exclude2 = \Drafts fts_decoder = decode2text sieve = /data/mail/%d/%n/local.sieve sieve_after = /data/mail/after.sieve sieve_before = /data/mail/before.sieve sieve_dir = /data/mail/%d/%n/sieve sieve_global_dir = /data/mail sieve_global_path = /data/mail/global.sieve } ... service decode2text { executable = script /usr/libexec/dovecot/decode2text.sh user = dovecot unix_listener decode2text { mode = 0666 } } Thank you
Joan
I'm not sure I can be much use for xapian, but looking at your configuration I did notice some differences with the documentation. I don't know if they are relevant to the issue you're seeing.
First of all I don't see
|mail_plugins = fts|
plugin = fts
settings which are both mentioned in the xapian documentation.
Also the documentation states that attachments=1 can only index text attachments. Maybe you should be using attachments=0 and let fts_decode handle the attachments.
Failing that, I can only advise to turn on some debugging and see what that brings.
best regards
John
Well, thank you for the answer, but the actual issue is that data sent by the decoder (stipulated in the conf file) is properly collected by dovecot core, but /not/ sent to the plugin : the plugin receives the original data.
This is not linked to a particular plugin (xapian, solr, squat, etc..) but seems to be a general issue of dovecot core
On 2021-02-08 01:03, John Fawcett wrote:
On 07/02/2021 18:51, Joan Moreau wrote:
more info : the function fts_parser_script_more in plugins/fts/fts-parser.c properly read the output of the script
still, the data is not sent to the FTS pligins (xapian or any other)
On 2021-02-07 17:37, Joan Moreau wrote:
more info : I am running dovecot git version
On 2021-02-07 17:15, Joan Moreau wrote:
a bit more on this, adding log in the decode2text.sh, I can see that pdftotext output the right data, but that data is /not/ transmitted to the fts plugin for indexing (only the original pdf code is)
On 2021-02-07 17:00, Joan Moreau wrote:
Hello,
I am trying to deal properly with email attachements in fts-xapian plugins.
I tried the default script with a PDF file.
The data I receive in the fts plugin part ("xxx_build_more") is the original document, no the output of the pdftotext
Is there anything I am missing ?
Here my config:
plugin { plugin = fts_xapian managesieve sieve
fts = xapian fts_xapian = partial=2 full=20 verbose=1 attachments=1
fts_autoindex = yes fts_enforced = yes fts_autoindex_exclude = \Trash fts_autoindex_exclude2 = \Drafts
fts_decoder = decode2text
sieve = /data/mail/%d/%n/local.sieve sieve_after = /data/mail/after.sieve sieve_before = /data/mail/before.sieve sieve_dir = /data/mail/%d/%n/sieve sieve_global_dir = /data/mail sieve_global_path = /data/mail/global.sieve }
...
service decode2text { executable = script /usr/libexec/dovecot/decode2text.sh user = dovecot unix_listener decode2text { mode = 0666 } }
Thank you
Joan
I'm not sure I can be much use for xapian, but looking at your configuration I did notice some differences with the documentation. I don't know if they are relevant to the issue you're seeing.
First of all I don't see
mail_plugins = fts
plugin = fts
settings which are both mentioned in the xapian documentation.
Also the documentation states that attachments=1 can only index text attachments. Maybe you should be using attachments=0 and let fts_decode handle the attachments.
Failing that, I can only advise to turn on some debugging and see what that brings.
best regards
John
On 08/02/2021 15:22, Joan Moreau wrote:
Well, thank you for the answer, but the actual issue is that data sent by the decoder (stipulated in the conf file) is properly collected by dovecot core, but /not/ sent to the plugin : the plugin receives the original data.
This is not linked to a particular plugin (xapian, solr, squat, etc..) but seems to be a general issue of dovecot core
Hi Joan
as far as I can see there's not a general issue in the dovecot core with using the decoder. It works for me. I see the text extracted from PDF sent to solr (I enable raw_log feature to see the actual data going over ) Also when I query solr I get a search hit for attachment text.
John
Well, in the function xxx_build_more of FTS plugin, the data received in the original PDF, not the output of pdftotext
Can you clarify where do you put your log in the solr plugin , so I can check the situation in the xapian plugin ?
On 2021-02-08 17:34, John Fawcett wrote:
On 08/02/2021 15:22, Joan Moreau wrote:
Well, thank you for the answer, but the actual issue is that data sent by the decoder (stipulated in the conf file) is properly collected by dovecot core, but /not/ sent to the plugin : the plugin receives the original data.
This is not linked to a particular plugin (xapian, solr, squat, etc..) but seems to be a general issue of dovecot core
Hi Joan
as far as I can see there's not a general issue in the dovecot core with using the decoder. It works for me. I see the text extracted from PDF sent to solr (I enable raw_log feature to see the actual data going over ) Also when I query solr I get a search hit for attachment text.
John
On 2021-02-08, Joan Moreau <jom@grosjo.net> wrote:
Well, in the function xxx_build_more of FTS plugin, the data received in the original PDF, not the output of pdftotext
Can you clarify where do you put your log in the solr plugin , so I can check the situation in the xapian plugin ?
The log is particular to fts_solr, you set it with e.g.
"fts_solr = url=http://127.0.0.1:8983/solr/dovecot/ rawlog_dir=/tmp/solr"
Confirmed it works for me, i.e. passes text from inside the pdf, and not the whole pdf itself.
Did you check that decode2text.sh works ok on your system (when running as the relevant uid)?
cat foo.pdf | sudo -u dovecot /usr/libexec/dovecot/decode2text.sh application/pdf
Yes , once again : output of the decoder is fine, I also put log inide the dovecot core to check whereas data is properly transmitted and it is (i.e. dovecot core receives the proper output of pdftotext via the decoder
Now, that data is the /not/ the once ent from dovecot core to the fts plugin (and this is the same issue for solr and all other plugins)
Of course, the stemming will show a good results abut the problem does remain.
How to make sure the data sent to the FTS plugins (xapian, solr, whatever...) is the the output of the decoder and /not/ the original data ?
On 2021-02-08 21:11, Stuart Henderson wrote:
On 2021-02-08, Joan Moreau <jom@grosjo.net> wrote:
Well, in the function xxx_build_more of FTS plugin, the data received in the original PDF, not the output of pdftotext
Can you clarify where do you put your log in the solr plugin , so I can check the situation in the xapian plugin ?
The log is particular to fts_solr, you set it with e.g.
"fts_solr = url=http://127.0.0.1:8983/solr/dovecot/ rawlog_dir=/tmp/solr"
Confirmed it works for me, i.e. passes text from inside the pdf, and not the whole pdf itself.
Did you check that decode2text.sh works ok on your system (when running as the relevant uid)?
cat foo.pdf | sudo -u dovecot /usr/libexec/dovecot/decode2text.sh application/pdf
Yes , once again : output of the decoder is fine, I also put log inide the dovecot core to check whether data is properly transmitted, and result is that it is (i.e. dovecot core receives the proper output of pdftotext via the decoder
Now, that data is the /not/ the one sent from dovecot core to the fts plugin (and this is the same issue for solr and all other plugins)
Of course, the stemming will show a good results (as PDF content will be stemmed) but the problem does remain.
How to make sure the data sent to the FTS plugins (xapian, solr, whatever...) is the the output of the decoder and /not/ the original data ?
On 2021-02-08 21:11, Stuart Henderson wrote:
On 2021-02-08, Joan Moreau <jom@grosjo.net> wrote:
Well, in the function xxx_build_more of FTS plugin, the data received in the original PDF, not the output of pdftotext
Can you clarify where do you put your log in the solr plugin , so I can check the situation in the xapian plugin ?
The log is particular to fts_solr, you set it with e.g.
"fts_solr = url=http://127.0.0.1:8983/solr/dovecot/ rawlog_dir=/tmp/solr"
Confirmed it works for me, i.e. passes text from inside the pdf, and not the whole pdf itself.
Did you check that decode2text.sh works ok on your system (when running as the relevant uid)?
cat foo.pdf | sudo -u dovecot /usr/libexec/dovecot/decode2text.sh application/pdf
On 2021/02/08 21:33, Joan Moreau wrote:
Yes , once again : output of the decoder is fine, I also put log inide the dovecot core to check whether data is properly transmitted, and result is that it is (i.e. dovecot core receives the proper output of pdftotext via the decoder
Now, that data is the /not/ the one sent from dovecot core to the fts plugin (and this is the same issue for solr and all other plugins)
Seems that something is different with your setup than John's and mine then, as fts_solr rawlog (which is just the http request split into .in and .out files) has the decoded file for us.
Did you try with the actual fts_solr plugin so it's a direct comparison with what we see? There is no need for a real solr server, just point it at any http server (or I guess netcat listening on a port will also do) with
mail_plugins = fts fts_solr
plugin { fts_autoindex = yes fts = solr fts_solr = url=http://127.0.0.1:80/ rawlog_dir=/tmp/solr }
If that is not showing decoded for you then I suppose there's some problem on the way into/through fts. And if it does show as decoded then perhaps fts_solr is doing something slightly different than the places you're examining in fts and your plugin, and that might give a point to work backwards from.
On 08/02/2021 23:05, Stuart Henderson wrote:
Yes , once again : output of the decoder is fine, I also put log inide the dovecot core to check whether data is properly transmitted, and result is that it is (i.e. dovecot core receives the proper output of pdftotext via the decoder
Now, that data is the /not/ the one sent from dovecot core to the fts plugin (and this is the same issue for solr and all other plugins) Seems that something is different with your setup than John's and mine
On 2021/02/08 21:33, Joan Moreau wrote: then, as fts_solr rawlog (which is just the http request split into .in and .out files) has the decoded file for us.
Did you try with the actual fts_solr plugin so it's a direct comparison with what we see? There is no need for a real solr server, just point it at any http server (or I guess netcat listening on a port will also do) with
mail_plugins = fts fts_solr
plugin { fts_autoindex = yes fts = solr fts_solr = url=http://127.0.0.1:80/ rawlog_dir=/tmp/solr }
If that is not showing decoded for you then I suppose there's some problem on the way into/through fts. And if it does show as decoded then perhaps fts_solr is doing something slightly different than the places you're examining in fts and your plugin, and that might give a point to work backwards from.
I'd also recommend Joan to look into some of the potential configuration issues I mentioned in my first reply and if the problem persists, post some clear evidence.
John
If I place the following code in the plugin fts_backend_xxx_update_build_more function (lucene, squat and xapian, as solr refuses to work properly on my setup)
{
char * s = i_strdup("EMPTY");
if(data != NULL) { i_free(s); s = i_strndup(data,20); }
i_info("fts_backend_update_build_more: data like
'%s'",s); i_free(s); }
and if I send a PDF by email, the data shown in the log is "%PDF-1.7 "
so it does mean the decoder data is not properly transmitted to the plugin
Something is wrong in the data transmission
On 2021-02-09 11:58, John Fawcett wrote:
On 08/02/2021 23:05, Stuart Henderson wrote: On 2021/02/08 21:33, Joan Moreau wrote: Yes , once again : output of the decoder is fine, I also put log inide the dovecot core to check whether data is properly transmitted, and result is that it is (i.e. dovecot core receives the proper output of pdftotext via the decoder
Now, that data is the /not/ the one sent from dovecot core to the fts plugin (and this is the same issue for solr and all other plugins) Seems that something is different with your setup than John's and mine then, as fts_solr rawlog (which is just the http request split into .in and .out files) has the decoded file for us.
Did you try with the actual fts_solr plugin so it's a direct comparison with what we see? There is no need for a real solr server, just point it at any http server (or I guess netcat listening on a port will also do) with
mail_plugins = fts fts_solr
plugin { fts_autoindex = yes fts = solr fts_solr = url=http://127.0.0.1:80/ rawlog_dir=/tmp/solr }
If that is not showing decoded for you then I suppose there's some problem on the way into/through fts. And if it does show as decoded then perhaps fts_solr is doing something slightly different than the places you're examining in fts and your plugin, and that might give a point to work backwards from. I'd also recommend Joan to look into some of the potential configuration issues I mentioned in my first reply and if the problem persists, post some clear evidence.
John
On 09/02/2021 15:33, Joan Moreau wrote:
If I place the following code in the plugin fts_backend_xxx_update_build_more function (lucene, squat and xapian, as solr refuses to work properly on my setup)
{ char * s = i_strdup("EMPTY"); if(data != NULL) { i_free(s); s = i_strndup(data,20); } i_info("fts_backend_update_build_more: data like '%s'",s); i_free(s); }
and if I send a PDF by email, the data shown in the log is "%PDF-1.7 "
so it does mean the decoder data is not properly transmitted to the plugin
Something is wrong in the data transmission
Joan
I too see something similar with fts_solr. I do see the raw %PDF string and PDF binary data being passed through to fts_backend_xxx_update_build_more function but I disagree with the conclusion you draw from it.
After the raw data I also see the decoded data, so at least in my case it is possible to see both the raw and decoded data in fts_backend_xxx_update_build_more function. In the rawlog I no longer see the binary data (but some blank lines), so something is filtering it. I do see the decoded data in the rawlog. I do get hits on the solr search for the decoded text.
John
Hello
Checking further, and putting logs a bit every where in the dovecot code, the core is sending FIRST the initial document (not decoded) then SECOND the decoded version
Thisi is really weird, and the indexer then indexes a lot of binary crap
I am struggling to find where in the code this double call is made.
Anyone knows ?
On 2021-02-10 00:05, John Fawcett wrote:
On 09/02/2021 15:33, Joan Moreau wrote:
If I place the following code in the plugin fts_backend_xxx_update_build_more function (lucene, squat and xapian, as solr refuses to work properly on my setup)
{ char * s = i_strdup("EMPTY"); if(data != NULL) { i_free(s); s = i_strndup(data,20); } i_info("fts_backend_update_build_more: data like '%s'",s); i_free(s); }
and if I send a PDF by email, the data shown in the log is "%PDF-1.7 "
so it does mean the decoder data is not properly transmitted to the plugin
Something is wrong in the data transmission
Joan
I too see something similar with fts_solr. I do see the raw %PDF string and PDF binary data being passed through to fts_backend_xxx_update_build_more function but I disagree with the conclusion you draw from it.
After the raw data I also see the decoded data, so at least in my case it is possible to see both the raw and decoded data in fts_backend_xxx_update_build_more function. In the rawlog I no longer see the binary data (but some blank lines), so something is filtering it. I do see the decoded data in the rawlog. I do get hits on the solr search for the decoded text.
John
Created a PR
https://github.com/dovecot/core/pull/155
On 2021-02-11 13:25, Joan Moreau wrote:
Hello
Checking further, and putting logs a bit every where in the dovecot code, the core is sending FIRST the initial document (not decoded) then SECOND the decoded version
Thisi is really weird, and the indexer then indexes a lot of binary crap
I am struggling to find where in the code this double call is made.
Anyone knows ?
On 2021-02-10 00:05, John Fawcett wrote:
On 09/02/2021 15:33, Joan Moreau wrote:
If I place the following code in the plugin fts_backend_xxx_update_build_more function (lucene, squat and xapian, as solr refuses to work properly on my setup)
{ char * s = i_strdup("EMPTY"); if(data != NULL) { i_free(s); s = i_strndup(data,20); } i_info("fts_backend_update_build_more: data like '%s'",s); i_free(s); }
and if I send a PDF by email, the data shown in the log is "%PDF-1.7 "
so it does mean the decoder data is not properly transmitted to the plugin
Something is wrong in the data transmission
Joan
I too see something similar with fts_solr. I do see the raw %PDF string and PDF binary data being passed through to fts_backend_xxx_update_build_more function but I disagree with the conclusion you draw from it.
After the raw data I also see the decoded data, so at least in my case it is possible to see both the raw and decoded data in fts_backend_xxx_update_build_more function. In the rawlog I no longer see the binary data (but some blank lines), so something is filtering it. I do see the decoded data in the rawlog. I do get hits on the solr search for the decoded text.
John
On 11/02/2021 14:25, Joan Moreau wrote:
Hello
Checking further, and putting logs a bit every where in the dovecot code, the core is sending FIRST the initial document (not decoded) then SECOND the decoded version
Thisi is really weird, and the indexer then indexes a lot of binary crap
I am struggling to find where in the code this double call is made.
Anyone knows ?
Joan
I didn't get round to working out where it happens. But your observation is in line with what I see for solr plugin. Only difference is that as far as I can see, the raw data does not make it to solr. That the rawlog does not contain the data is a good indication, but the proof is that searching for PDF string on solr does not get a hit on the messages.
John
On 08/02/2021 21:35, Joan Moreau wrote:
Well, in the function xxx_build_more of FTS plugin, the data received in the original PDF, not the output of pdftotext
Can you clarify where do you put your log in the solr plugin , so I can check the situation in the xapian plugin ?
I used the following setting in fts_solr parameter
rawlog_dir=<directory>
and made sure the directory was writeable by dovecot (777 just to be sure)
John
participants (3)
-
Joan Moreau
-
John Fawcett
-
Stuart Henderson