Quoting Eric Abrahamsen <eric@ericabrahamsen.net>:
Michael M Slusarz <slusarz@curecanti.org> writes:
Quoting Eric Abrahamsen <eric@ericabrahamsen.net>:
While I've got you here, I hope you'll answer one more question: what's the format for searching multiple terms with non-ascii strings? Is it possible in one run to find a utf-8 encoded subject, and a utf-8 encoded body?
IMAP interaction would look like this:
C: . UID SEARCH CHARSET UTF-8 SUBJECT {4} S: +OK C: aéb BODY {4} S: +OK C: aéb S: * SEARCH XXX S: . OK
Even better... if the server supports LITERAL+, you don't have to wait for the synchronizing literal which prevents the need to wait for 2 round-trips from the server:
C: . UID SEARCH CHARSET UTF-8 SUBJECT {4+} C: aéb BODY {4+} C: aéb[CRLF] S: * SEARCH XXX S: . OK
michael
One other question:
I've set up full text search indexing via Lucene, and it works great. But how is this index encoded? Specifically, if I use the above method to search for non-ascii strings, am I still benefiting from the speedups of the search index?
I know that some people who are indexing non-ascii, non-UTF-8 messages are running them through some sort of decoder to force them into UTF-8, so that Lucene can index them properly. Is this still necessary if I'm using the method above?
I have no insight on Lucene internals.
michael