Re: [Dovecot] better body and text search?

29 Aug 2003 · _exact_

      On Friday, Aug 29, 2003, at 19:10 Europe/Helsinki, Mark Anderson wrote:
...
Well, what it should do is in the RFC.
How it is implemented is another matter; having text indexing is
kind of a big implementation detail.
Yes, but the implementation can't really optimize it that much if it
wants to stay RFC-compatible..
...
...
Only RFC-compatible index (you need _exact_ matching) would be what
Cyrus squat does. And not many people seem to use it, so it seems like
it's not worth the trouble either. I might try it some day though. One
problem with Cyrus code was that it didn't have incremental updates.
[aside:
Understanding cyrus squat also requires source examination, it seems
:(.
I think the comment at the beginning of the squat.c (or something) was
enough to understand it. It basically generates all possible
combinations of letters up to 4 characters (IIRC) that are in the
messages and stores a list of UIDs where they're found. It's quite a
large file since each message has a lot of 4 letter combinations in it..
I'm not sure how much this would really help searching. Maybe enough to
make it useful..
...
it makes a big difference in user experience to get back search results
in subsecond time, versus having to wait a minute or more, which
is my experience with every IMAP server i've tried on my larger stores.
It'd be much nicer if I was able to create the text indexes with binary
trees. So that you could only search from the beginning of the word,
not from the middle of it. But IMAP's search command doesn't allow that
kind of indexes.
...
ah, i missed the loop in message_body_search_ctx().
it is not recursive, so it would deal only with one level of MIME
nesting, but this would cover almost all MIME messages, for practical
purposes.
Oh? I guess that's a bug then.
...
what sort of header field indexing does dovecot have, if any?
There's ENVELOPE caching which contains several headers. CVS version
has real header caching. It caches whatever headers you specifically
ask with FETCH or SEARCH commands.
...
glancing at index-mail.c and mail-index.c it appears that:

there is no header field indexing, in the sense of hashing or
database lookup.

Right. That wouldn't be really useful for IMAP's searching since you
have to be able to match substrings as well. It doesn't help much to be
able to say fast which messages definitely match if you still have to
check the others the slow way.
...

the persistent "index" file per folder is mmap'd.
it has a record per message, which contains a linked list of all
headers

Hmm. Not really a linked list. I'm not sure if you're talking about CVS
or 0.99.10. I don't think .10 had linked lists at all. In CVS it's a
linked list of "cache records" which may contain multiple cached fields
including some headers.
...

to perform header search, it iterates through the headers in the
mmap'd records
for the folder.

Yep.

Re: [Dovecot] better body and text search?

Timo Sirainen