On Fri, 2003-08-29 at 08:24, Mark Anderson wrote:
(this is a resend, sorry if duplicate but my first post didn't get through).
Looks like it came just now :)
I was wondering what plans dovecot has for text and body search?
This is one of those issues that *no* imap implementation ever seems to document :(. Not cyrus, courier, bincimap, or dovecot....
Well, because it's already pretty strictly documented in IMAP4 RFC.
i speed-read some of the sources and found src/lib-mail/message-body-search.c After a quick scan, it seems:
- it uses no text index at all, but does a linear search over the folder.
Only RFC-compatible index (you need _exact_ matching) would be what Cyrus squat does. And not many people seem to use it, so it seems like it's not worth the trouble either. I might try it some day though. One problem with Cyrus code was that it didn't have incremental updates.
- it searches only messages with content-type starting with "text/" or "message/"
For now, yes. Makes it much faster :)
- it does no special parsing of "text/html", so tags and attributes would match
I think doing that wouldn't be RFC-compatible..
- it appears to exclude any mime messages, since it skips "multipart/mixed" for example.
No, it shouldn't. It skips the multipart/mixed body itself (which doesn't really exist anyway), but not it's children.
- it does deal with content encoding and charset.
But it doesn't do case-insensitive matching for non-ASCII characters. I'd need utf8_toupper() function..
There is a lucene-based text indexing utility for IMAP stores here: http://www.tropo.com/techno/java/lucene/imap.html
I'm not really sure how that is supposed to be useful.. There's a few header fields which Dovecot also has in it's indexes. Then there's this "contents" string which looks like it's all the text data in the message? There's not much point in copying the whole mailbox data to index file.
But ideally there would be pluggable text indexing builtin....
Yes.
- could optionally support scanning inside file attachments and use
plugins to extract text out of them (word, excel, pdf, etc. etc.)
- use a trie index for fast text searching, like cyrus squat?
- Create our own extension: When searching with TEXT/BODY, return
the message text surrounding the keywords just like web search engines
do. like: SEARCH X-PRINT-MATCHES TEXT "hello" -> * SEARCH 1 "He said:
Hello world!" 2 "Hello, I'm ...". This would be especially useful with
the above attachment scanning.
And a less strict search command extension would be useful as well..