[Dovecot] Full text search indexing

Timo Sirainen tss at iki.fi
Wed Apr 12 11:20:35 EEST 2006


On Wed, 2006-04-12 at 09:40 +0200, Jens Laas wrote:
> I made an even squatter index with just character couples (couplets).
> - Each character is recoded to an integer between 0-63.
> - A map of 64*64 bits is created for each mail.
> - For each character couple indexed from the mail the corresponding
>    couplet bit is set: map[C1][C2] = 1.
> - The search is then linear through all the maps generated for the
>    mailbox. Each map is tested for the presence of all the couplets
>    contained in the searchstring. This generates a number of candidate
>    mails that may match the searchstring (false positives occur).
> - Candidate mails are the bruteforce searched for searchstring.
> 
> I tested the above with indexing raw body-parts from a fairly large 
> mailbox (72 MB/13000 mails).
> The index is then ~6MB.
> Each map in the index takes 512 bytes: (64*64)/8. (Slightly larger if you 
> also store the UID).
> 
> Advantage with this type if index is that its simple and very fast to 
> build (rebuilding indexed will probably occur often if youre not using 
> dovecot-lda).
> 
> The number of false positives as candidates vary alot depending on the 
> searchstring.

Did you test what the rate was approximately with some somewhat sensible
search strings?

The problem I see with this is that each map still has to be searched,
so a 6MB index needs to be completely read and checked. So it could take
10x less time, but it's still O(N)..

But you're right that this is at least much simpler than the trie
approach.

> > Yes, but they have to be translated to unicode in any case, because
> > searching will have to work regardless of what character set is used
> > when doing the search. So the choices would be pretty much:
> >
> > 1) Use 4 characters long UTF-16 strings
> > 2) Use 4 bytes long UTF-8 strings
> > 3) Use 4 character long UTF-8 strings
> 
> Im sorry for my incomplete IMAP knowledge. Is the server required convert 
> the searchstring and/or mimepart to the same character set for string 
> searching?

Yes. Otherwise it's a bit difficult to search non-ASCII text in the
parts of world where multiple character sets are used :)

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 191 bytes
Desc: This is a digitally signed message part
Url : http://dovecot.org/pipermail/dovecot/attachments/20060412/01a411b3/attachment.pgp


More information about the dovecot mailing list