[Dovecot] search and UTF-8 normalization forms (NFD)

Fri May 10 16:24:55 EEST 2013

Am 02.05.2013 17:53, schrieb Timo Sirainen:
> On 25.4.2013, at 16.39, Lutz Preßler <Lutz.Pressler at SerNet.DE> wrote:
> 
>> on a system with dovecot 2.2 I've got a mailbox containing multiple mails
>> from a person called Krüger, but From: header encoded differently.
>> Some are encoded in UTF-8 normalization form decomposed (as used by Mac OSX),
>> that is u and umlaut accent as sperate combined codepoints
>> instead of one ü:
>>
>>  From: =?utf-8?Q?replaced_Kru=CC=88ger?= <krueger at some.domain>
>>
>> Searching within roundcube webmail for "krüger" as sender
>> missis this mails.
>>
>> Roundcube sends (dovecot rawlog):
>> A0003 UID THREAD REFS UTF-8 ALL HEADER FROM {7+}krüger
>>
>> Is this supposed to work? Haven't done any more debugging
>> (other search variants) or read RFCs. As a user I would expect
>> Unicode equivalence rules be applied (see 
>> http://en.wikipedia.org/wiki/Unicode_equivalence)
> 
> IMAP requires using i;unicode-casemap by default, as specified by RFC 5051. Then again, others could be supported as well, and it's not really a requirement that the search can't handle more flexible searches.. Anyway, that's what Dovecot currently has implemented, and I guess it doesn't do what you want it to do. But there is a partial solution for this:
> 
> http://dovecot.org/patches/2.1/icu-1.2.tar.gz
> 
> It probably does what you want, but it only works with fts-lucene.
> 
Could you elaborate a bit why you think i;unicode-casemap does not
handle this case?

Is it only applied to the query, but not the header, or vice versa?
It seems to me that Step 2 should map both inputs to LATIN CAPITAL
LETTER U + COMBINING DIAERESIS.

Regards,
Florian