charset-specific searches, and continuation lines
Hi there,
I'm looking into improving IMAP search support for the Gnus Emacs mail client, and trying to add the ability to search non-ascii characters. So far as I know, I start this invocation with something like:
. UID SEARCH CHARSET UTF-8 TEXT {NNN}
Where NNN is the number of bytes in my search string. Dovecot then responds with:
- OK
So... what do I do then? I don't actually know what the next statement is, to provide the actual search string itself. Googling has proved unhelpful, as most of the examples online don't actually show this "+ OK" response. Can someone just briefly outline what's meant to happen next? I've tried including the search string immediately after the byte-size, separated by various combinations of \n\r, but that always gives me a "Missing LF after literal size" error.
I'm using the Archlinux dovecot package, which reports version 2.2.13-1.
Thanks! Eric
Quoting Eric Abrahamsen <eric@ericabrahamsen.net>:
Hi there,
I'm looking into improving IMAP search support for the Gnus Emacs mail client, and trying to add the ability to search non-ascii characters. So far as I know, I start this invocation with something like:
. UID SEARCH CHARSET UTF-8 TEXT {NNN}
Where NNN is the number of bytes in my search string. Dovecot then responds with:
- OK
So... what do I do then? I don't actually know what the next statement is, to provide the actual search string itself. Googling has proved unhelpful, as most of the examples online don't actually show this "+ OK" response. Can someone just briefly outline what's meant to happen next? I've tried including the search string immediately after the byte-size, separated by various combinations of \n\r, but that always gives me a "Missing LF after literal size" error.
Your example, assuming your search text is "aéb":
. UID SEARCH CHARSET UTF-8 TEXT {4} +OK aéb[CRLF]
- SEARCH XXX . OK
Literal length is the number of octets in the string - not the number
of characters - so not sure if that was tripping you up.
michael
Michael M Slusarz <slusarz@curecanti.org> writes:
Quoting Eric Abrahamsen <eric@ericabrahamsen.net>:
Hi there,
I'm looking into improving IMAP search support for the Gnus Emacs mail client, and trying to add the ability to search non-ascii characters. So far as I know, I start this invocation with something like:
. UID SEARCH CHARSET UTF-8 TEXT {NNN}
Where NNN is the number of bytes in my search string. Dovecot then responds with:
- OK
So... what do I do then? I don't actually know what the next statement is, to provide the actual search string itself. Googling has proved unhelpful, as most of the examples online don't actually show this "+ OK" response. Can someone just briefly outline what's meant to happen next? I've tried including the search string immediately after the byte-size, separated by various combinations of \n\r, but that always gives me a "Missing LF after literal size" error.
Your example, assuming your search text is "aéb":
. UID SEARCH CHARSET UTF-8 TEXT {4} +OK aéb[CRLF]
- SEARCH XXX . OK
Literal length is the number of octets in the string - not the number of characters - so not sure if that was tripping you up.
Hi Michael,
Well that's embarrassing, I could have sworn that was the first thing I tried. I knew about the octets, and had tried inputting a\303\251b as the search string, but was sure I'd also tried the plain old search string. Thanks!
While I've got you here, I hope you'll answer one more question: what's the format for searching multiple terms with non-ascii strings? Is it possible in one run to find a utf-8 encoded subject, and a utf-8 encoded body?
Thanks again, Eric
Quoting Eric Abrahamsen <eric@ericabrahamsen.net>:
While I've got you here, I hope you'll answer one more question: what's the format for searching multiple terms with non-ascii strings? Is it possible in one run to find a utf-8 encoded subject, and a utf-8 encoded body?
IMAP interaction would look like this:
C: . UID SEARCH CHARSET UTF-8 SUBJECT {4} S: +OK C: aéb BODY {4} S: +OK C: aéb S: * SEARCH XXX S: . OK
Even better... if the server supports LITERAL+, you don't have to wait
for the synchronizing literal which prevents the need to wait for 2
round-trips from the server:
C: . UID SEARCH CHARSET UTF-8 SUBJECT {4+} C: aéb BODY {4+} C: aéb[CRLF] S: * SEARCH XXX S: . OK
michael
Michael M Slusarz <slusarz@curecanti.org> writes:
Quoting Eric Abrahamsen <eric@ericabrahamsen.net>:
While I've got you here, I hope you'll answer one more question: what's the format for searching multiple terms with non-ascii strings? Is it possible in one run to find a utf-8 encoded subject, and a utf-8 encoded body?
IMAP interaction would look like this:
C: . UID SEARCH CHARSET UTF-8 SUBJECT {4} S: +OK C: aéb BODY {4} S: +OK C: aéb S: * SEARCH XXX S: . OK
Even better... if the server supports LITERAL+, you don't have to wait for the synchronizing literal which prevents the need to wait for 2 round-trips from the server:
C: . UID SEARCH CHARSET UTF-8 SUBJECT {4+} C: aéb BODY {4+} C: aéb[CRLF] S: * SEARCH XXX S: . OK
Brilliant, thanks a lot! Not something I would have guessed on my own, and surprisingly hard to find online -- I'm learning to read the RFCs...
Thanks again, Eric
Michael M Slusarz <slusarz@curecanti.org> writes:
Quoting Eric Abrahamsen <eric@ericabrahamsen.net>:
While I've got you here, I hope you'll answer one more question: what's the format for searching multiple terms with non-ascii strings? Is it possible in one run to find a utf-8 encoded subject, and a utf-8 encoded body?
IMAP interaction would look like this:
C: . UID SEARCH CHARSET UTF-8 SUBJECT {4} S: +OK C: aéb BODY {4} S: +OK C: aéb S: * SEARCH XXX S: . OK
Even better... if the server supports LITERAL+, you don't have to wait for the synchronizing literal which prevents the need to wait for 2 round-trips from the server:
C: . UID SEARCH CHARSET UTF-8 SUBJECT {4+} C: aéb BODY {4+} C: aéb[CRLF] S: * SEARCH XXX S: . OK
michael
One other question:
I've set up full text search indexing via Lucene, and it works great. But how is this index encoded? Specifically, if I use the above method to search for non-ascii strings, am I still benefiting from the speedups of the search index?
I know that some people who are indexing non-ascii, non-UTF-8 messages are running them through some sort of decoder to force them into UTF-8, so that Lucene can index them properly. Is this still necessary if I'm using the method above?
Thanks! Eric
Quoting Eric Abrahamsen <eric@ericabrahamsen.net>:
Michael M Slusarz <slusarz@curecanti.org> writes:
Quoting Eric Abrahamsen <eric@ericabrahamsen.net>:
While I've got you here, I hope you'll answer one more question: what's the format for searching multiple terms with non-ascii strings? Is it possible in one run to find a utf-8 encoded subject, and a utf-8 encoded body?
IMAP interaction would look like this:
C: . UID SEARCH CHARSET UTF-8 SUBJECT {4} S: +OK C: aéb BODY {4} S: +OK C: aéb S: * SEARCH XXX S: . OK
Even better... if the server supports LITERAL+, you don't have to wait for the synchronizing literal which prevents the need to wait for 2 round-trips from the server:
C: . UID SEARCH CHARSET UTF-8 SUBJECT {4+} C: aéb BODY {4+} C: aéb[CRLF] S: * SEARCH XXX S: . OK
michael
One other question:
I've set up full text search indexing via Lucene, and it works great. But how is this index encoded? Specifically, if I use the above method to search for non-ascii strings, am I still benefiting from the speedups of the search index?
I know that some people who are indexing non-ascii, non-UTF-8 messages are running them through some sort of decoder to force them into UTF-8, so that Lucene can index them properly. Is this still necessary if I'm using the method above?
I have no insight on Lucene internals.
michael
participants (2)
-
Eric Abrahamsen
-
Michael M Slusarz