Re: FTS Tokenization filters normalizer-icu vs lowercase

20 Jan 2022


      ...
On 01/20/2022 9:20 AM Alessio Cecchi <alessio@skye.it> wrote:
I'm trying to setup fts-flatcurve with tokenization.
What are the differences/benefits with "fts_filters = normalizer-icu" vs "fts_filters = lowercase"?
Reading the Doc I found about normalizer-icu "This is potentially very resource intensive." and about lowercase "Supports UTF8, when compiled with libicu".
So, using lowercase is almost the same that normalizer-icu but faster?
No, these are 2 different actions.
Lowercase tries to use language rules to map characters to a "lowercase" equivalent, which is character/language dependent.
Normalization tries to take a string and reduce it to a unique, normalized form, that can be directly compared to other normalized strings.  UTF, for example, can have strings that display the same to the user but contain very different byte data.  For example, it is possible to create more complicated glyphs by either using a specific code-point (i.e., a 4 byte UTF element) or by using a combination of UTF sequences that, when combined, create an identical display of the character.
Normalization is a very complicated topic.  https://en.wikipedia.org/wiki/Unicode_equivalence might help with further understanding.
The ICU library deals with general internationalization support, and these two filters are using different parts of that library to do different things.  They are not replacements for each other, they are complimentary - you could normalize a string and then lowercase it, for example.
michael
...
FYI
for using fts-flatcurve with dovecot RPM packages from repo.dovecot.org you have to rebuild with --with-icu --with-stemmer --with-textcat and related library.
Thanks
--
Alessio Cecchi
Postmaster @ http://www.qboxmail.it
https://www.linkedin.com/in/alessice

Re: FTS Tokenization filters normalizer-icu vs lowercase

Michael Slusarz