Inaccurate results while searching for a phrase in subject (fts-flatcurve)
Hi,
I had been using the lucene FTS plugin since a decade now and it has done me well. Thought of upgrading to the new & current stuff and came across the flatcurve plugin which seems very promising (xapian on the other hand was creating indexes larger than my mailboxes themselves). I am using following configuration in dovecot.conf:
fts = flatcurve fts_filters_en = lowercase english-possessive stopwords fts_languages = en fts_tokenizers = generic email-address fts_autoindex = no fts_enforced = yes
A search command like this:
doveadm -D search -u john@doe.com mailbox INBOX SUBJECT "/home/johndoe/render.php"
should show the messages with subject: "CRON: /home/johndoe/render.php OK" but produces a lot of extra undesired results and I think the second line in this debug output indicates the reason:
May 23 07:44:13 doveadm(john@doe.com): Debug: fts-flatcurve(INBOX): Query (hdr_subject:/home/johndoe/render.php*) matches=0 uids= May 23 07:44:13 doveadm(john@doe.com): Debug: fts-flatcurve(INBOX): Query (hdr_subject:php* AND hdr_subject:render* AND hdr_subject:johndoe* AND hdr_subject:home*) matches=272 uids=67041,67085,67188,67223,67257,67290,67323,67355,67395,67564,67770,67817,67863,67985,68819,69512,69572,69635,69737,70017,70058,70086,70125,70147,70191,70296,70304,70331,70340,70350,70354,70375,70407,70417,70427,70449,70499,70521:70522,70535:70550,70555,70561:70563,70591,70597:70599,70662,70685,70702,70708,70718:70719,70724,70727:70728,70730:70733,70735,70746:70747,70754,70775,70777,70794,70811:70812,70822,70866,70942,70948,70971,71017,71021,71040,71042,71075,71079,71084,71113,71128:71129,71131,71152,71160,71184,71188,71208,71214,71225,71255,71269,71297,71300,71331,71375,71422,71449,71457,71467,71469,71495,71515,71605,71626,71632,71649,71672,71681:71682,71689,71692,71699,71716,71757,71770,71777,71782:71785,71790,71795,71797,71814,71818:71819,71828,71838:71842,71845,71859:71860,71937,71947,71954,71960,71963:71964,71977,71990,72014,72021:72022,72030,72034:72042,72045:72046,72049,72056,72061,72063,72073:72074,72083,72088,72090,72092,72101,72108,72129,72131:72132,72134,72136:72140,72159,72163,72172:72173,72186,72212,72218:72223,72237,72239,72246,72267,72288,72387,72410,72446,72469,72476:72477,72514,72541,72543,72568:72569,72572:72574,72598,72604,72606,72609,72644,72674,72687,72691,72694,72734,72772,72791,72797,72799,72803,72832:72833,72835:72841,72856:72857,72866:72867,72873:72874,72901,72930,72938,72948,72960,72965,72976,73018,73037,73071,73081,73116,73158,73249,73307,73352,73392,73466,73533,73601,73670,73733,73775,73784:73786,73804,73807,73811,73815,73819,73823,73825,73831,73842,73846,74005,74199,74390,74540,74684,74854,75017,75192,75354,75525,75710,75839:75843,75845,75903,75984:75985,76091,76263,76447,76624,76816,76989,77091:77092,77097,77119,77155,77293,77460,77608,77761,77908,78066,78218,78393,78400:78401,78522:78523,78560,78728,78921,79104,79298,79504,79555,79898,80027,80031:80032,80034:80035,80037,80056,80071,80073,80077:80079,80082:80084,80086,80089
I tried rebuilding the indexes with "fts_flatcurve_substring_search = yes" too but that didn't change anything. It works as expected with lucene plugin because in that case header search is performed via dovecot indexes instead of FTS. May be I am not doing something right in configuring this new FTS? Will really appreciate some pointers here.
Thanks, Sam
See below.
On 05/23/2023 2:14 AM MDT ss17@fea.st wrote:
I had been using the lucene FTS plugin since a decade now and it has done me well. Thought of upgrading to the new & current stuff and came across the flatcurve plugin which seems very promising (xapian on the other hand was creating indexes larger than my mailboxes themselves). I am using following configuration in dovecot.conf:
fts = flatcurve fts_filters_en = lowercase english-possessive stopwords fts_languages = en fts_tokenizers = generic email-address
^^^ FTS input is being tokenized, so the phrase "/home/johndoe/render.php" will be indexed not as a full string but instead separately as "home", "johndoe", "render", and "php".
See: https://doc.dovecot.org/settings/plugin/fts-plugin/#plugin_setting-fts-fts_t...
This has nothing to do with flatcurve (or any FTS driver) - Dovecot will never send the full "/home/johndoe/render.php" to the driver to be indexed.
fts_autoindex = no fts_enforced = yes
A search command like this:
doveadm -D search -u john@doe.com mailbox INBOX SUBJECT "/home/johndoe/render.php"
should show the messages with subject: "CRON: /home/johndoe/render.php OK" but produces a lot of extra undesired results and I think the second line in this debug output indicates the reason:
May 23 07:44:13 doveadm(john@doe.com): Debug: fts-flatcurve(INBOX): Query (hdr_subject:/home/johndoe/render.php*) matches=0 uids=
This is correct, since "/home/johndoe/render.php" was not indexed so there should be zero results.
May 23 07:44:13 doveadm(john@doe.com): Debug: fts-flatcurve(INBOX): Query (hdr_subject:php* AND hdr_subject:render* AND hdr_subject:johndoe* AND hdr_subject:home*) matches=272
And this is also correct, as the search phrase is attempted by searching both its full string and also all of its tokenized components. (Both the original text and all search terms are processed through the tokenizer before passing to a FTS driver.)
I tried rebuilding the indexes with "fts_flatcurve_substring_search = yes" too but that didn't change anything. It works as expected with lucene plugin because in that case header search is performed via dovecot indexes instead of FTS. May be I am not doing something right in configuring this new FTS?
I'm not a lucene expert... but with the old lucene plugin, you were almost certainly using it without Dovecot tokenization support, since the plugin predates it (I think) - using Dovecot tokenization would have required 'use_libfts' to be present in the fts_lucene setting (which I doubt was ever documented). I believe Dovecot was just doing simple white-space tokenization instead, so lucene code/library was likely receiving the full string and doing internal tokenization.
michael
Thanks Michael for that explanation. So with the addition of tokenization has Dovecot lost the ability to search phrases, irrespective of FTS engine. That would be a real bummer if true.
participants (2)
-
Michael Slusarz
-
ss17@fea.st