charset-specific searches, and continuation lines

Michael M Slusarz slusarz at curecanti.org
Fri Sep 12 07:24:55 UTC 2014


Quoting Eric Abrahamsen <eric at ericabrahamsen.net>:

> Michael M Slusarz <slusarz at curecanti.org> writes:
>
>> Quoting Eric Abrahamsen <eric at ericabrahamsen.net>:
>>
>>> While I've got you here, I hope you'll answer one more question: what's
>>> the format for searching multiple terms with non-ascii strings? Is it
>>> possible in one run to find a utf-8 encoded subject, and a utf-8 encoded
>>> body?
>>
>> IMAP interaction would look like this:
>>
>> C: . UID SEARCH CHARSET UTF-8 SUBJECT {4}
>> S: +OK
>> C: aéb BODY {4}
>> S: +OK
>> C: aéb
>> S: * SEARCH XXX
>> S: . OK
>>
>> Even better... if the server supports LITERAL+, you don't have to wait
>> for the synchronizing literal which prevents the need to wait for 2
>> round-trips from the server:
>>
>> C: . UID SEARCH CHARSET UTF-8 SUBJECT {4+}
>> C: aéb BODY {4+}
>> C: aéb[CRLF]
>> S: * SEARCH XXX
>> S: . OK
>>
>> michael
>
> One other question:
>
> I've set up full text search indexing via Lucene, and it works great.
> But how is this index encoded? Specifically, if I use the above method
> to search for non-ascii strings, am I still benefiting from the speedups
> of the search index?
>
> I know that some people who are indexing non-ascii, non-UTF-8 messages
> are running them through some sort of decoder to force them into UTF-8,
> so that Lucene can index them properly. Is this still necessary if I'm
> using the method above?

I have no insight on Lucene internals.

michael



More information about the dovecot mailing list