On Friday, Aug 29, 2003, at 19:10 Europe/Helsinki, Mark Anderson wrote:
Well, what it should do is in the RFC. How it is implemented is another matter; having text indexing is kind of a big implementation detail.
Yes, but the implementation can't really optimize it that much if it wants to stay RFC-compatible..
Only RFC-compatible index (you need _exact_ matching) would be what Cyrus squat does. And not many people seem to use it, so it seems like it's not worth the trouble either. I might try it some day though. One problem with Cyrus code was that it didn't have incremental updates.
[aside: Understanding cyrus squat also requires source examination, it seems :(.
I think the comment at the beginning of the squat.c (or something) was enough to understand it. It basically generates all possible combinations of letters up to 4 characters (IIRC) that are in the messages and stores a list of UIDs where they're found. It's quite a large file since each message has a lot of 4 letter combinations in it..
I'm not sure how much this would really help searching. Maybe enough to make it useful..
it makes a big difference in user experience to get back search results in subsecond time, versus having to wait a minute or more, which is my experience with every IMAP server i've tried on my larger stores.
It'd be much nicer if I was able to create the text indexes with binary trees. So that you could only search from the beginning of the word, not from the middle of it. But IMAP's search command doesn't allow that kind of indexes.
ah, i missed the loop in message_body_search_ctx(). it is not recursive, so it would deal only with one level of MIME nesting, but this would cover almost all MIME messages, for practical purposes.
Oh? I guess that's a bug then.
what sort of header field indexing does dovecot have, if any?
There's ENVELOPE caching which contains several headers. CVS version has real header caching. It caches whatever headers you specifically ask with FETCH or SEARCH commands.
glancing at index-mail.c and mail-index.c it appears that:
- there is no header field indexing, in the sense of hashing or database lookup.
Right. That wouldn't be really useful for IMAP's searching since you have to be able to match substrings as well. It doesn't help much to be able to say fast which messages definitely match if you still have to check the others the slow way.
- the persistent "index" file per folder is mmap'd. it has a record per message, which contains a linked list of all headers
Hmm. Not really a linked list. I'm not sure if you're talking about CVS or 0.99.10. I don't think .10 had linked lists at all. In CVS it's a linked list of "cache records" which may contain multiple cached fields including some headers.
- to perform header search, it iterates through the headers in the mmap'd records for the folder.
Yep.