charset-specific searches, and continuation lines

Thu Sep 11 08:37:12 UTC 2014

Michael M Slusarz <slusarz at curecanti.org> writes:

> Quoting Eric Abrahamsen <eric at ericabrahamsen.net>:
>
>> While I've got you here, I hope you'll answer one more question: what's
>> the format for searching multiple terms with non-ascii strings? Is it
>> possible in one run to find a utf-8 encoded subject, and a utf-8 encoded
>> body?
>
> IMAP interaction would look like this:
>
> C: . UID SEARCH CHARSET UTF-8 SUBJECT {4}
> S: +OK
> C: aéb BODY {4}
> S: +OK
> C: aéb
> S: * SEARCH XXX
> S: . OK
>
> Even better... if the server supports LITERAL+, you don't have to wait
> for the synchronizing literal which prevents the need to wait for 2
> round-trips from the server:
>
> C: . UID SEARCH CHARSET UTF-8 SUBJECT {4+}
> C: aéb BODY {4+}
> C: aéb[CRLF]
> S: * SEARCH XXX
> S: . OK
>
> michael

One other question:

I've set up full text search indexing via Lucene, and it works great.
But how is this index encoded? Specifically, if I use the above method
to search for non-ascii strings, am I still benefiting from the speedups
of the search index?

I know that some people who are indexing non-ascii, non-UTF-8 messages
are running them through some sort of decoder to force them into UTF-8,
so that Lucene can index them properly. Is this still necessary if I'm
using the method above?

Thanks!
Eric