[Dovecot] better body and text search?
Timo Sirainen
tss at iki.fi
Fri Aug 29 11:33:47 EEST 2003
On Fri, 2003-08-29 at 08:24, Mark Anderson wrote:
> (this is a resend, sorry if duplicate but my first post didn't get through).
Looks like it came just now :)
> I was wondering what plans dovecot has for text and body search?
>
> This is one of those issues that *no* imap implementation ever
> seems to document :(.
> Not cyrus, courier, bincimap, or dovecot....
Well, because it's already pretty strictly documented in IMAP4 RFC.
> i speed-read some of the sources and found src/lib-mail/message-body-search.c
> After a quick scan, it seems:
> - it uses no text index at all, but does a linear search over the folder.
Only RFC-compatible index (you need _exact_ matching) would be what
Cyrus squat does. And not many people seem to use it, so it seems like
it's not worth the trouble either. I might try it some day though. One
problem with Cyrus code was that it didn't have incremental updates.
> - it searches only messages with content-type starting with "text/" or "message/"
For now, yes. Makes it much faster :)
> - it does no special parsing of "text/html", so tags and attributes would match
I think doing that wouldn't be RFC-compatible..
> - it appears to exclude any mime messages, since it skips "multipart/mixed" for example.
No, it shouldn't. It skips the multipart/mixed body itself (which
doesn't really exist anyway), but not it's children.
> - it does deal with content encoding and charset.
But it doesn't do case-insensitive matching for non-ASCII characters.
I'd need utf8_toupper() function..
> There is a lucene-based text indexing utility for IMAP stores here:
> http://www.tropo.com/techno/java/lucene/imap.html
I'm not really sure how that is supposed to be useful.. There's a few
header fields which Dovecot also has in it's indexes. Then there's this
"contents" string which looks like it's all the text data in the
message? There's not much point in copying the whole mailbox data to
index file.
> But ideally there would be pluggable text indexing builtin....
Yes.
- could optionally support scanning inside file attachments and use
plugins to extract text out of them (word, excel, pdf, etc. etc.)
- use a trie index for fast text searching, like cyrus squat?
- Create our own extension: When searching with TEXT/BODY, return
the message text surrounding the keywords just like web search engines
do. like: SEARCH X-PRINT-MATCHES TEXT "hello" -> * SEARCH 1 "He said:
Hello world!" 2 "Hello, I'm ...". This would be especially useful with
the above attachment scanning.
And a less strict search command extension would be useful as well..
More information about the dovecot
mailing list