[Dovecot] better body and text search?
(this is a resend, sorry if duplicate but my first post didn't get through).
I was wondering what plans dovecot has for text and body search?
This is one of those issues that *no* imap implementation ever seems to document :(. Not cyrus, courier, bincimap, or dovecot....
i speed-read some of the sources and found src/lib-mail/message-body-search.c After a quick scan, it seems:
- it uses no text index at all, but does a linear search over the folder.
- it searches only messages with content-type starting with "text/" or "message/"
- it does no special parsing of "text/html", so tags and attributes would match
- it appears to exclude any mime messages, since it skips "multipart/mixed" for example.
- it does deal with content encoding and charset.
There is a lucene-based text indexing utility for IMAP stores here: http://www.tropo.com/techno/java/lucene/imap.html
But ideally there would be pluggable text indexing builtin....
-mda
On Fri, 2003-08-29 at 08:24, Mark Anderson wrote:
(this is a resend, sorry if duplicate but my first post didn't get through).
Looks like it came just now :)
I was wondering what plans dovecot has for text and body search?
This is one of those issues that *no* imap implementation ever seems to document :(. Not cyrus, courier, bincimap, or dovecot....
Well, because it's already pretty strictly documented in IMAP4 RFC.
i speed-read some of the sources and found src/lib-mail/message-body-search.c After a quick scan, it seems:
- it uses no text index at all, but does a linear search over the folder.
Only RFC-compatible index (you need _exact_ matching) would be what Cyrus squat does. And not many people seem to use it, so it seems like it's not worth the trouble either. I might try it some day though. One problem with Cyrus code was that it didn't have incremental updates.
- it searches only messages with content-type starting with "text/" or "message/"
For now, yes. Makes it much faster :)
- it does no special parsing of "text/html", so tags and attributes would match
I think doing that wouldn't be RFC-compatible..
- it appears to exclude any mime messages, since it skips "multipart/mixed" for example.
No, it shouldn't. It skips the multipart/mixed body itself (which doesn't really exist anyway), but not it's children.
- it does deal with content encoding and charset.
But it doesn't do case-insensitive matching for non-ASCII characters. I'd need utf8_toupper() function..
There is a lucene-based text indexing utility for IMAP stores here: http://www.tropo.com/techno/java/lucene/imap.html
I'm not really sure how that is supposed to be useful.. There's a few header fields which Dovecot also has in it's indexes. Then there's this "contents" string which looks like it's all the text data in the message? There's not much point in copying the whole mailbox data to index file.
But ideally there would be pluggable text indexing builtin....
Yes.
- could optionally support scanning inside file attachments and use
plugins to extract text out of them (word, excel, pdf, etc. etc.)
- use a trie index for fast text searching, like cyrus squat?
- Create our own extension: When searching with TEXT/BODY, return
the message text surrounding the keywords just like web search engines
do. like: SEARCH X-PRINT-MATCHES TEXT "hello" -> * SEARCH 1 "He said:
Hello world!" 2 "Hello, I'm ...". This would be especially useful with
the above attachment scanning.
And a less strict search command extension would be useful as well..
Timo Sirainen wrote:
On Fri, 2003-08-29 at 08:24, Mark Anderson wrote:
(this is a resend, sorry if duplicate but my first post didn't get through).
Looks like it came just now :)
I think there might be a race condition regarding the mailman confirmation. I sent my post the instant that i got the confirmation message. Anyhow, not too important.
This is one of those issues that *no* imap implementation ever seems to document :(. Not cyrus, courier, bincimap, or dovecot....
Well, because it's already pretty strictly documented in IMAP4 RFC.
Well, what it should do is in the RFC. How it is implemented is another matter; having text indexing is kind of a big implementation detail.
i speed-read some of the sources and found src/lib-mail/message-body-search.c After a quick scan, it seems:
- it uses no text index at all, but does a linear search over the folder.
Only RFC-compatible index (you need _exact_ matching) would be what Cyrus squat does. And not many people seem to use it, so it seems like it's not worth the trouble either. I might try it some day though. One problem with Cyrus code was that it didn't have incremental updates.
[aside: Understanding cyrus squat also requires source examination, it seems :(. i must say i find your code a lot more readable than cyrus. And of course uw-imap is impossible.]
exact matching is fine for me.
this feature is useful for using IMAP in front of large mailing list archives, for example.
or those of us with INBOX folders containing every message we've ever received in the past 10 years :).
quite a few email clients with local stores create text indexes (local storage either because on unix they have direct access to mailboxes, or local storage made from keeping copies of what is read). but those clients lose that feature when connecting to IMAP when configured to not make local copies.
it makes a big difference in user experience to get back search results in subsecond time, versus having to wait a minute or more, which is my experience with every IMAP server i've tried on my larger stores.
- it appears to exclude any mime messages, since it skips "multipart/mixed" for example.
No, it shouldn't. It skips the multipart/mixed body itself (which doesn't really exist anyway), but not it's children.
ah, i missed the loop in message_body_search_ctx(). it is not recursive, so it would deal only with one level of MIME nesting, but this would cover almost all MIME messages, for practical purposes.
- it does deal with content encoding and charset.
But it doesn't do case-insensitive matching for non-ASCII characters. I'd need utf8_toupper() function..
ah. probably more an issue for you guys on the other side of the pond :).
There is a lucene-based text indexing utility for IMAP stores here: http://www.tropo.com/techno/java/lucene/imap.html
I'm not really sure how that is supposed to be useful.. There's a few header fields which Dovecot also has in it's indexes. Then there's this "contents" string which looks like it's all the text data in the message? There's not much point in copying the whole mailbox data to index file.
well, for example a web-based email application might complement the IMAP server with that external text index, for search operations.
what sort of header field indexing does dovecot have, if any? glancing at index-mail.c and mail-index.c it appears that:
- there is no header field indexing, in the sense of hashing or database lookup.
- the persistent "index" file per folder is mmap'd. it has a record per message, which contains a linked list of all headers in the message, and some summary information about the message.
- to perform header search, it iterates through the headers in the mmap'd records for the folder.
-mda
On Friday, Aug 29, 2003, at 19:10 Europe/Helsinki, Mark Anderson wrote:
Well, what it should do is in the RFC. How it is implemented is another matter; having text indexing is kind of a big implementation detail.
Yes, but the implementation can't really optimize it that much if it wants to stay RFC-compatible..
Only RFC-compatible index (you need _exact_ matching) would be what Cyrus squat does. And not many people seem to use it, so it seems like it's not worth the trouble either. I might try it some day though. One problem with Cyrus code was that it didn't have incremental updates.
[aside: Understanding cyrus squat also requires source examination, it seems :(.
I think the comment at the beginning of the squat.c (or something) was enough to understand it. It basically generates all possible combinations of letters up to 4 characters (IIRC) that are in the messages and stores a list of UIDs where they're found. It's quite a large file since each message has a lot of 4 letter combinations in it..
I'm not sure how much this would really help searching. Maybe enough to make it useful..
it makes a big difference in user experience to get back search results in subsecond time, versus having to wait a minute or more, which is my experience with every IMAP server i've tried on my larger stores.
It'd be much nicer if I was able to create the text indexes with binary trees. So that you could only search from the beginning of the word, not from the middle of it. But IMAP's search command doesn't allow that kind of indexes.
ah, i missed the loop in message_body_search_ctx(). it is not recursive, so it would deal only with one level of MIME nesting, but this would cover almost all MIME messages, for practical purposes.
Oh? I guess that's a bug then.
what sort of header field indexing does dovecot have, if any?
There's ENVELOPE caching which contains several headers. CVS version has real header caching. It caches whatever headers you specifically ask with FETCH or SEARCH commands.
glancing at index-mail.c and mail-index.c it appears that:
- there is no header field indexing, in the sense of hashing or database lookup.
Right. That wouldn't be really useful for IMAP's searching since you have to be able to match substrings as well. It doesn't help much to be able to say fast which messages definitely match if you still have to check the others the slow way.
- the persistent "index" file per folder is mmap'd. it has a record per message, which contains a linked list of all headers
Hmm. Not really a linked list. I'm not sure if you're talking about CVS or 0.99.10. I don't think .10 had linked lists at all. In CVS it's a linked list of "cache records" which may contain multiple cached fields including some headers.
- to perform header search, it iterates through the headers in the mmap'd records for the folder.
Yep.
participants (2)
-
Mark Anderson
-
Timo Sirainen