[Dovecot] search and UTF-8 normalization forms (NFD)

newer
[Dovecot] Speed up mail retrieval...

Lutz Preßler

25 Apr 2013 25 Apr '13

4:39 p.m.

Hello,

on a system with dovecot 2.2 I've got a mailbox containing multiple mails from a person called Krüger, but From: header encoded differently. Some are encoded in UTF-8 normalization form decomposed (as used by Mac OSX), that is u and umlaut accent as sperate combined codepoints instead of one ü:

From: =?utf-8?Q?replaced_Kru=CC=88ger?= <krueger@some.domain>

Searching within roundcube webmail for "krüger" as sender missis this mails.

Roundcube sends (dovecot rawlog): A0003 UID THREAD REFS UTF-8 ALL HEADER FROM {7+}krüger

Is this supposed to work? Haven't done any more debugging (other search variants) or read RFCs. As a user I would expect Unicode equivalence rules be applied (see http://en.wikipedia.org/wiki/Unicode_equivalence)

Regards, Lutz

Show replies by date

Timo Sirainen

2 May 2 May

6:53 p.m.

On 25.4.2013, at 16.39, Lutz Preßler <Lutz.Pressler@SerNet.DE> wrote:

...

on a system with dovecot 2.2 I've got a mailbox containing multiple mails from a person called Krüger, but From: header encoded differently. Some are encoded in UTF-8 normalization form decomposed (as used by Mac OSX), that is u and umlaut accent as sperate combined codepoints instead of one ü:

From: =?utf-8?Q?replaced_Kru=CC=88ger?= <krueger@some.domain>

Searching within roundcube webmail for "krüger" as sender missis this mails.

Roundcube sends (dovecot rawlog): A0003 UID THREAD REFS UTF-8 ALL HEADER FROM {7+}krüger

Is this supposed to work? Haven't done any more debugging (other search variants) or read RFCs. As a user I would expect Unicode equivalence rules be applied (see http://en.wikipedia.org/wiki/Unicode_equivalence)

IMAP requires using i;unicode-casemap by default, as specified by RFC 5051. Then again, others could be supported as well, and it's not really a requirement that the search can't handle more flexible searches.. Anyway, that's what Dovecot currently has implemented, and I guess it doesn't do what you want it to do. But there is a partial solution for this:

http://dovecot.org/patches/2.1/icu-1.2.tar.gz

It probably does what you want, but it only works with fts-lucene.

Lutz Preßler

10 May 10 May

3:21 p.m.

Hello Timo, On Thu, 02 May 2013, Timo Sirainen wrote:

...

IMAP requires using i;unicode-casemap by default, as specified by RFC 5051. Then again, others could be supported as well, and it's not really a requirement that the search can't handle more flexible searches.. Anyway, that's what Dovecot currently has implemented, and I guess it doesn't do what you want it to do. But there is a partial solution for this:

http://dovecot.org/patches/2.1/icu-1.2.tar.gz

It probably does what you want, but it only works with fts-lucene. I'm trying to test it with the 2.2.1 installation, but have a problem doing so: after seemingly smooth compilation and installation, I get

May 10 14:15:18 host dovecot: imap: Error: Module is for different ABI version 2.2.1 (we have 2.2.ABIv0(2.2.1)): /usr/lib/dovecot/modules/lib20_icu_plugin.so May 10 14:15:18 host dovecot: imap: Fatal: Couldn't load required plugins

Any idea?

Greetings, Lutz

Florian Zeitz

4:24 p.m.

Am 02.05.2013 17:53, schrieb Timo Sirainen:

...

On 25.4.2013, at 16.39, Lutz Preßler <Lutz.Pressler@SerNet.DE> wrote:

...
on a system with dovecot 2.2 I've got a mailbox containing multiple mails from a person called Krüger, but From: header encoded differently. Some are encoded in UTF-8 normalization form decomposed (as used by Mac OSX), that is u and umlaut accent as sperate combined codepoints instead of one ü:

From: =?utf-8?Q?replaced_Kru=CC=88ger?= <krueger@some.domain>

Searching within roundcube webmail for "krüger" as sender missis this mails.

Roundcube sends (dovecot rawlog): A0003 UID THREAD REFS UTF-8 ALL HEADER FROM {7+}krüger

Is this supposed to work? Haven't done any more debugging (other search variants) or read RFCs. As a user I would expect Unicode equivalence rules be applied (see http://en.wikipedia.org/wiki/Unicode_equivalence)

IMAP requires using i;unicode-casemap by default, as specified by RFC 5051. Then again, others could be supported as well, and it's not really a requirement that the search can't handle more flexible searches.. Anyway, that's what Dovecot currently has implemented, and I guess it doesn't do what you want it to do. But there is a partial solution for this:

http://dovecot.org/patches/2.1/icu-1.2.tar.gz

It probably does what you want, but it only works with fts-lucene.

Could you elaborate a bit why you think i;unicode-casemap does not handle this case?

Is it only applied to the query, but not the header, or vice versa? It seems to me that Step 2 should map both inputs to LATIN CAPITAL LETTER U + COMBINING DIAERESIS.

Regards, Florian

Florian Zeitz

11 May 11 May

6:13 p.m.

Am 10.05.2013 15:24, schrieb Florian Zeitz:

...

Could you elaborate a bit why you think i;unicode-casemap does not handle this case?

Is it only applied to the query, but not the header, or vice versa? It seems to me that Step 2 should map both inputs to LATIN CAPITAL LETTER U + COMBINING DIAERESIS.

Regards, Florian

So... I had a look at this. Turns out that the current implementation of Unicode decomposition (Step 2(b) in i;unicode-casemap) in Dovecot is broken. It only handles decomposition properties that include a tag. I've attached a hg export that fixes this.

Timo Sirainen

15 May 15 May

1:16 p.m.

On 11.5.2013, at 18.13, Florian Zeitz <florob@babelmonkeys.de> wrote:

...

Am 10.05.2013 15:24, schrieb Florian Zeitz:

...
Could you elaborate a bit why you think i;unicode-casemap does not handle this case?

Is it only applied to the query, but not the header, or vice versa? It seems to me that Step 2 should map both inputs to LATIN CAPITAL LETTER U + COMBINING DIAERESIS.

Regards, Florian

So... I had a look at this. Turns out that the current implementation of Unicode decomposition (Step 2(b) in i;unicode-casemap) in Dovecot is broken. It only handles decomposition properties that include a tag. I've attached a hg export that fixes this.

Thanks, added to v2.1 and v2.2 hg.

Lutz Preßler

21 May 21 May

2:41 p.m.

On Mi, 15 Mai 2013, Timo Sirainen wrote:

...

On 11.5.2013, at 18.13, Florian Zeitz <florob at babelmonkeys.de> wrote:

...
So... I had a look at this. Turns out that the current implementation of Unicode decomposition (Step 2(b) in i;unicode-casemap) in Dovecot is broken. It only handles decomposition properties that include a tag. I've attached a hg export that fixes this.

Thanks, added to v2.1 and v2.2 hg.

Thanks, but there seems to be still a problem left. Sender search yields all Krüger mails without fts_lucene. But with fts_lucene enabled - and files in lucene-indexes/ existing - it's not. (If I delete the lucene-index files and search for sender, result is correct - but only until they are recreated.)

Lutz

Timo Sirainen

9 Jun 9 Jun

3:14 a.m.

On 21.5.2013, at 14.41, Lutz Preßler <Lutz.Pressler@SerNet.DE> wrote:

...

On Mi, 15 Mai 2013, Timo Sirainen wrote:

...
On 11.5.2013, at 18.13, Florian Zeitz <florob at babelmonkeys.de> wrote:

...
So... I had a look at this. Turns out that the current implementation of Unicode decomposition (Step 2(b) in i;unicode-casemap) in Dovecot is broken. It only handles decomposition properties that include a tag. I've attached a hg export that fixes this.

Thanks, added to v2.1 and v2.2 hg.

Thanks, but there seems to be still a problem left. Sender search yields all Krüger mails without fts_lucene. But with fts_lucene enabled - and files in lucene-indexes/ existing - it's not. (If I delete the lucene-index files and search for sender, result is correct - but only until they are recreated.)

Fixed finally: http://hg.dovecot.org/dovecot-2.2/rev/7e54af474ea4

Add plugin { fts_lucene = normalize no_snowball } setting (NOTE: this change causes all the existing lucene indexes to be rebuilt).

This fts-lucene is getting rather annoying. I wonder if all of this is somehow magically solved in Solr.

4448

Age (days ago)

4493

Last active (days ago)

List overview

7 comments

3 participants

participants (3)

Florian Zeitz
Lutz Preßler
Timo Sirainen

[Dovecot] search and UTF-8 normalization forms (NFD)

tags

participants (3)