Re: charset-specific searches, and continuation lines

12 Sep 2014


      Quoting Eric Abrahamsen <eric@ericabrahamsen.net>:
...
Michael M Slusarz <slusarz@curecanti.org> writes:
...
Quoting Eric Abrahamsen <eric@ericabrahamsen.net>:
...
While I've got you here, I hope you'll answer one more question: what's
the format for searching multiple terms with non-ascii strings? Is it
possible in one run to find a utf-8 encoded subject, and a utf-8 encoded
body?
IMAP interaction would look like this:
C: . UID SEARCH CHARSET UTF-8 SUBJECT {4}
S: +OK
C: aéb BODY {4}
S: +OK
C: aéb
S: * SEARCH XXX
S: . OK
Even better... if the server supports LITERAL+, you don't have to wait
for the synchronizing literal which prevents the need to wait for 2
round-trips from the server:
C: . UID SEARCH CHARSET UTF-8 SUBJECT {4+}
C: aéb BODY {4+}
C: aéb[CRLF]
S: * SEARCH XXX
S: . OK
michael
One other question:
I've set up full text search indexing via Lucene, and it works great.
But how is this index encoded? Specifically, if I use the above method
to search for non-ascii strings, am I still benefiting from the speedups
of the search index?
I know that some people who are indexing non-ascii, non-UTF-8 messages
are running them through some sort of decoder to force them into UTF-8,
so that Lucene can index them properly. Is this still necessary if I'm
using the method above?
I have no insight on Lucene internals.
michael