[Dovecot] search and UTF-8 normalization forms (NFD)
Hello,
on a system with dovecot 2.2 I've got a mailbox containing multiple mails from a person called Krüger, but From: header encoded differently. Some are encoded in UTF-8 normalization form decomposed (as used by Mac OSX), that is u and umlaut accent as sperate combined codepoints instead of one ü:
From: =?utf-8?Q?replaced_Kru=CC=88ger?= <krueger@some.domain>
Searching within roundcube webmail for "krüger" as sender missis this mails.
Roundcube sends (dovecot rawlog): A0003 UID THREAD REFS UTF-8 ALL HEADER FROM {7+}krüger
Is this supposed to work? Haven't done any more debugging (other search variants) or read RFCs. As a user I would expect Unicode equivalence rules be applied (see http://en.wikipedia.org/wiki/Unicode_equivalence)
Regards, Lutz
On 25.4.2013, at 16.39, Lutz Preßler <Lutz.Pressler@SerNet.DE> wrote:
on a system with dovecot 2.2 I've got a mailbox containing multiple mails from a person called Krüger, but From: header encoded differently. Some are encoded in UTF-8 normalization form decomposed (as used by Mac OSX), that is u and umlaut accent as sperate combined codepoints instead of one ü:
From: =?utf-8?Q?replaced_Kru=CC=88ger?= <krueger@some.domain>
Searching within roundcube webmail for "krüger" as sender missis this mails.
Roundcube sends (dovecot rawlog): A0003 UID THREAD REFS UTF-8 ALL HEADER FROM {7+}krüger
Is this supposed to work? Haven't done any more debugging (other search variants) or read RFCs. As a user I would expect Unicode equivalence rules be applied (see http://en.wikipedia.org/wiki/Unicode_equivalence)
IMAP requires using i;unicode-casemap by default, as specified by RFC 5051. Then again, others could be supported as well, and it's not really a requirement that the search can't handle more flexible searches.. Anyway, that's what Dovecot currently has implemented, and I guess it doesn't do what you want it to do. But there is a partial solution for this:
http://dovecot.org/patches/2.1/icu-1.2.tar.gz
It probably does what you want, but it only works with fts-lucene.
Hello Timo, On Thu, 02 May 2013, Timo Sirainen wrote:
IMAP requires using i;unicode-casemap by default, as specified by RFC 5051. Then again, others could be supported as well, and it's not really a requirement that the search can't handle more flexible searches.. Anyway, that's what Dovecot currently has implemented, and I guess it doesn't do what you want it to do. But there is a partial solution for this:
http://dovecot.org/patches/2.1/icu-1.2.tar.gz
It probably does what you want, but it only works with fts-lucene. I'm trying to test it with the 2.2.1 installation, but have a problem doing so: after seemingly smooth compilation and installation, I get
May 10 14:15:18 host dovecot: imap: Error: Module is for different ABI version 2.2.1 (we have 2.2.ABIv0(2.2.1)): /usr/lib/dovecot/modules/lib20_icu_plugin.so May 10 14:15:18 host dovecot: imap: Fatal: Couldn't load required plugins
Any idea?
Greetings, Lutz
Am 02.05.2013 17:53, schrieb Timo Sirainen:
On 25.4.2013, at 16.39, Lutz Preßler <Lutz.Pressler@SerNet.DE> wrote:
on a system with dovecot 2.2 I've got a mailbox containing multiple mails from a person called Krüger, but From: header encoded differently. Some are encoded in UTF-8 normalization form decomposed (as used by Mac OSX), that is u and umlaut accent as sperate combined codepoints instead of one ü:
From: =?utf-8?Q?replaced_Kru=CC=88ger?= <krueger@some.domain>
Searching within roundcube webmail for "krüger" as sender missis this mails.
Roundcube sends (dovecot rawlog): A0003 UID THREAD REFS UTF-8 ALL HEADER FROM {7+}krüger
Is this supposed to work? Haven't done any more debugging (other search variants) or read RFCs. As a user I would expect Unicode equivalence rules be applied (see http://en.wikipedia.org/wiki/Unicode_equivalence)
IMAP requires using i;unicode-casemap by default, as specified by RFC 5051. Then again, others could be supported as well, and it's not really a requirement that the search can't handle more flexible searches.. Anyway, that's what Dovecot currently has implemented, and I guess it doesn't do what you want it to do. But there is a partial solution for this:
http://dovecot.org/patches/2.1/icu-1.2.tar.gz
It probably does what you want, but it only works with fts-lucene.
Could you elaborate a bit why you think i;unicode-casemap does not handle this case?
Is it only applied to the query, but not the header, or vice versa? It seems to me that Step 2 should map both inputs to LATIN CAPITAL LETTER U + COMBINING DIAERESIS.
Regards, Florian
Am 10.05.2013 15:24, schrieb Florian Zeitz:
Could you elaborate a bit why you think i;unicode-casemap does not handle this case?
Is it only applied to the query, but not the header, or vice versa? It seems to me that Step 2 should map both inputs to LATIN CAPITAL LETTER U + COMBINING DIAERESIS.
Regards, Florian
So... I had a look at this. Turns out that the current implementation of Unicode decomposition (Step 2(b) in i;unicode-casemap) in Dovecot is broken. It only handles decomposition properties that include a tag. I've attached a hg export that fixes this.
On 11.5.2013, at 18.13, Florian Zeitz <florob@babelmonkeys.de> wrote:
Am 10.05.2013 15:24, schrieb Florian Zeitz:
Could you elaborate a bit why you think i;unicode-casemap does not handle this case?
Is it only applied to the query, but not the header, or vice versa? It seems to me that Step 2 should map both inputs to LATIN CAPITAL LETTER U + COMBINING DIAERESIS.
Regards, Florian
So... I had a look at this. Turns out that the current implementation of Unicode decomposition (Step 2(b) in i;unicode-casemap) in Dovecot is broken. It only handles decomposition properties that include a tag. I've attached a hg export that fixes this.
Thanks, added to v2.1 and v2.2 hg.
On Mi, 15 Mai 2013, Timo Sirainen wrote:
On 11.5.2013, at 18.13, Florian Zeitz <florob at babelmonkeys.de> wrote:
So... I had a look at this. Turns out that the current implementation of Unicode decomposition (Step 2(b) in i;unicode-casemap) in Dovecot is broken. It only handles decomposition properties that include a tag. I've attached a hg export that fixes this.
Thanks, added to v2.1 and v2.2 hg.
Thanks, but there seems to be still a problem left. Sender search yields all Krüger mails without fts_lucene. But with fts_lucene enabled - and files in lucene-indexes/ existing - it's not. (If I delete the lucene-index files and search for sender, result is correct - but only until they are recreated.)
Lutz
On 21.5.2013, at 14.41, Lutz Preßler <Lutz.Pressler@SerNet.DE> wrote:
On Mi, 15 Mai 2013, Timo Sirainen wrote:
On 11.5.2013, at 18.13, Florian Zeitz <florob at babelmonkeys.de> wrote:
So... I had a look at this. Turns out that the current implementation of Unicode decomposition (Step 2(b) in i;unicode-casemap) in Dovecot is broken. It only handles decomposition properties that include a tag. I've attached a hg export that fixes this.
Thanks, added to v2.1 and v2.2 hg.
Thanks, but there seems to be still a problem left. Sender search yields all Krüger mails without fts_lucene. But with fts_lucene enabled - and files in lucene-indexes/ existing - it's not. (If I delete the lucene-index files and search for sender, result is correct - but only until they are recreated.)
Fixed finally: http://hg.dovecot.org/dovecot-2.2/rev/7e54af474ea4
Add plugin { fts_lucene = normalize no_snowball } setting (NOTE: this change causes all the existing lucene indexes to be rebuilt).
This fts-lucene is getting rather annoying. I wonder if all of this is somehow magically solved in Solr.
participants (3)
-
Florian Zeitz
-
Lutz Preßler
-
Timo Sirainen