[Dovecot] better body and text search?

Fri Aug 29 19:31:42 EEST 2003

On Friday, Aug 29, 2003, at 19:10 Europe/Helsinki, Mark Anderson wrote:

> Well, what it should do is in the RFC.
> How it is implemented is another matter; having text indexing is
> kind of a big implementation detail.

Yes, but the implementation can't really optimize it that much if it 
wants to stay RFC-compatible..

>> Only RFC-compatible index (you need _exact_ matching) would be what
>> Cyrus squat does. And not many people seem to use it, so it seems like
>> it's not worth the trouble either. I might try it some day though. One
>> problem with Cyrus code was that it didn't have incremental updates.
>
> [aside:
> Understanding cyrus squat also requires source examination, it seems 
> :(.

I think the comment at the beginning of the squat.c (or something) was 
enough to understand it. It basically generates all possible 
combinations of letters up to 4 characters (IIRC) that are in the 
messages and stores a list of UIDs where they're found. It's quite a 
large file since each message has a lot of 4 letter combinations in it..

I'm not sure how much this would really help searching. Maybe enough to 
make it useful..

> it makes a big difference in user experience to get back search results
> in subsecond time, versus having to wait a minute or more, which
> is my experience with every IMAP server i've tried on my larger stores.

It'd be much nicer if I was able to create the text indexes with binary 
trees. So that you could only search from the beginning of the word, 
not from the middle of it. But IMAP's search command doesn't allow that 
kind of indexes.

> ah, i missed the loop in message_body_search_ctx().
> it is not recursive, so it would deal only with one level of MIME
> nesting, but this would cover almost all MIME messages, for practical
> purposes.

Oh? I guess that's a bug then.

> what sort of header field indexing does dovecot have, if any?

There's ENVELOPE caching which contains several headers. CVS version 
has real header caching. It caches whatever headers you specifically 
ask with FETCH or SEARCH commands.

> glancing at index-mail.c and mail-index.c it appears that:
> - there is no header field indexing, in the sense of hashing or 
> database lookup.

Right. That wouldn't be really useful for IMAP's searching since you 
have to be able to match substrings as well. It doesn't help much to be 
able to say fast which messages definitely match if you still have to 
check the others the slow way.

> - the persistent "index" file per folder is mmap'd.
>   it has a record per message, which contains a linked list of all 
> headers

Hmm. Not really a linked list. I'm not sure if you're talking about CVS 
or 0.99.10. I don't think .10 had linked lists at all. In CVS it's a 
linked list of "cache records" which may contain multiple cached fields 
including some headers.

> - to perform header search, it iterates through the headers in the 
> mmap'd records
>   for the folder.

Yep.