[Dovecot] Full text search indexing

Timo Sirainen tss at iki.fi
Wed Apr 12 13:39:16 EEST 2006


On Wed, 2006-04-12 at 11:18 +0200, Jens Laas wrote:
> >> I tested the above with indexing raw body-parts from a fairly large mailbox 
> >> (72 MB/13000 mails).
> >
> > Which part of the body? If indexing an html email you would presumably want 
> > to strip all the html tags for indexing as that is not meaningful. What about 
> > if there is a mime multipart/alternative text or html part - index both or 
> > just the text part?

Doing any of that would again mean losing IMAP RFC compatibility.

> > And rebuilding the index doesn't mean reprocessing all of the emails just 
> > reading in the old index processing new messages discarding deleted ones then 
> > writing the index back to disk.
> 
> I was thinking of the case when the whole dovecot-index had to be rebuilt 
> anyway. Which currently occurs when dovecot decides the index is out of 
> sync with the mailbox (timestamp and size i think).

No, Dovecot doesn't really rebuild the indexes even then. Assuming mbox
file it just then goes through the whole mbox file and checks that
everything is still in place.

Full index rebuilds are done only when the index gets corrupted or when
the mailbox's UIDVALIDITY changes (practically never unless you go
manually doing something weird).

> >> Im sorry for my incomplete IMAP knowledge. Is the server required convert 
> >> the searchstring and/or mimepart to the same character set for string 
> >> searching?
> >
> > Probably this indexing method would be optimised for various character sets 
> > by different mappings from characters -> int 0-31. (I haven't thought this 
> > last comment through much ... does each 32*32 bit array want a character set 
> > id attached to it?)
> 
> That might be possibly. Thinking of different character sets makes my head 
> ache :-).

Another problem is that with UTF-8 the two characters may describe only
a single character (or not even that), which increases the false
positives a lot if the language uses a lot of non-ascii.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : http://dovecot.org/pipermail/dovecot/attachments/20060412/a0b84dde/attachment.pgp


More information about the dovecot mailing list