[Dovecot] better body and text search?

Mark Anderson mda at discerning.com
Fri Aug 29 19:10:14 EEST 2003


Timo Sirainen wrote:
> On Fri, 2003-08-29 at 08:24, Mark Anderson wrote:
> 
>>(this is a resend, sorry if duplicate but my first post didn't get through).
> 
> 
> Looks like it came just now :)

I think there might be a race condition regarding the mailman confirmation.
I sent my post the instant that i got the confirmation message.
Anyhow, not too important.

>>
>>This is one of those issues that *no* imap implementation ever
>>seems to document :(.
>>Not cyrus, courier, bincimap, or dovecot....
> 
> 
> Well, because it's already pretty strictly documented in IMAP4 RFC.

Well, what it should do is in the RFC.
How it is implemented is another matter; having text indexing is
kind of a big implementation detail.

>>i speed-read some of the sources and found src/lib-mail/message-body-search.c
>>After a quick scan, it seems:
>>- it uses no text index at all, but does a linear search over the folder.
> 
> 
> Only RFC-compatible index (you need _exact_ matching) would be what
> Cyrus squat does. And not many people seem to use it, so it seems like
> it's not worth the trouble either. I might try it some day though. One
> problem with Cyrus code was that it didn't have incremental updates.

[aside:
Understanding cyrus squat also requires source examination, it seems :(.
i must say i find your code a lot more readable than cyrus.
And of course uw-imap is impossible.]

exact matching is fine for me.

this feature is useful for using IMAP in front of large mailing list archives,
for example.

or those of us with INBOX folders containing every message we've ever
received in the past 10 years :).

quite a few email clients with local stores create text indexes
(local storage either because on unix they have direct access to mailboxes,
or local storage made from keeping copies of what is read).
but those clients lose that feature when connecting to IMAP when configured
to not make local copies.

it makes a big difference in user experience to get back search results
in subsecond time, versus having to wait a minute or more, which
is my experience with every IMAP server i've tried on my larger stores.


>>- it appears to exclude any mime messages, since it skips "multipart/mixed" for example.
> 
> 
> No, it shouldn't. It skips the multipart/mixed body itself (which
> doesn't really exist anyway), but not it's children.

ah, i missed the loop in message_body_search_ctx().
it is not recursive, so it would deal only with one level of MIME
nesting, but this would cover almost all MIME messages, for practical
purposes.


>>- it does deal with content encoding and charset.
> 
> 
> But it doesn't do case-insensitive matching for non-ASCII characters.
> I'd need utf8_toupper() function..

ah. probably more an issue for you guys on the other side of the pond :).

> 
> 
>>There is a lucene-based text indexing utility for IMAP stores here:
>>http://www.tropo.com/techno/java/lucene/imap.html
> 
> 
> I'm not really sure how that is supposed to be useful.. There's a few
> header fields which Dovecot also has in it's indexes. Then there's this
> "contents" string which looks like it's all the text data in the
> message? There's not much point in copying the whole mailbox data to
> index file.

well, for example a web-based email application might complement the
IMAP server with that external text index, for search operations.

what sort of header field indexing does dovecot have, if any?
glancing at index-mail.c and mail-index.c it appears that:
- there is no header field indexing, in the sense of hashing or database lookup.
- the persistent "index" file per folder is mmap'd.
   it has a record per message, which contains a linked list of all headers
   in the message, and some summary information about the message.
- to perform header search, it iterates through the headers in the mmap'd records
   for the folder.

-mda




More information about the dovecot mailing list