[Dovecot] Full text search indexing

Rob Middleton robm-dovecot at centenary.org.au
Wed Apr 12 11:48:51 EEST 2006


>> On Thu, 2006-04-06 at 09:33 +0200, Jens Laas wrote:
>>>> So, the one usable engine which I know of is Cyrus's squat indexing. It
>>>> works basically by indexing the mails in 4 byte blocks and storing them
>>>> in a trie. So if there exists a string "he llo", it indexes "he l",
>>>> "e ll" and " llo" strings. When searching it then gets the first 4 
>>>> bytes
>>>> of the search keyword, looks them up from the squat index and then once
>>>> it gets a list of matched mails, it reads the mails and does a proper
>>>> search for only them. So the squat index doesn't give exact results, 
>>>> but
>>>> only narrows down the number of mails which have to be searched.
> 
> This squat index thingy gave me an idea (which Ive also partly tested).
> 
> I made an even squatter index with just character couples (couplets).
> - Each character is recoded to an integer between 0-63.
> - A map of 64*64 bits is created for each mail.

I like this idea - and will be interested in statistics on real 
mailboxes and real searches carried out by users.

I would think that reducing the index to 32*32 bits thus reducing the IO 
  on the index by a factor of 4 would give fairly comparable usefulness. 
(ie - the index should be case insensitive as case is not likely to be 
nearly as discriminatory as the actual letter of alphabet ... but I am 
not a natural language expert so don't have statistics to back that 
belief up.)

> - For each character couple indexed from the mail the corresponding
>   couplet bit is set: map[C1][C2] = 1.
> - The search is then linear through all the maps generated for the
>   mailbox. Each map is tested for the presence of all the couplets
>   contained in the searchstring. This generates a number of candidate
>   mails that may match the searchstring (false positives occur).
> - Candidate mails are the bruteforce searched for searchstring.
> 
> I tested the above with indexing raw body-parts from a fairly large 
> mailbox (72 MB/13000 mails).

Which part of the body? If indexing an html email you would presumably 
want to strip all the html tags for indexing as that is not meaningful. 
What about if there is a mime multipart/alternative text or html part - 
index both or just the text part?

> The index is then ~6MB.
> Each map in the index takes 512 bytes: (64*64)/8. (Slightly larger if 
> you also store the UID).
> 
> Advantage with this type if index is that its simple and very fast to 
> build (rebuilding indexed will probably occur often if youre not using 
> dovecot-lda).

And rebuilding the index doesn't mean reprocessing all of the emails 
just reading in the old index processing new messages discarding deleted 
ones then writing the index back to disk.

> 
> The number of false positives as candidates vary alot depending on the 
> searchstring.
> 
> IO-wise this should be a clear improvement since the index is quite a 
> bit smaller than the mailbox. Our mailserver is very IO-bound. CPU 
> cycles are plenty.
> 
>> Yes, but they have to be translated to unicode in any case, because
>> searching will have to work regardless of what character set is used
>> when doing the search. So the choices would be pretty much:
>>
>> 1) Use 4 characters long UTF-16 strings
>> 2) Use 4 bytes long UTF-8 strings
>> 3) Use 4 character long UTF-8 strings
> 
> Im sorry for my incomplete IMAP knowledge. Is the server required 
> convert the searchstring and/or mimepart to the same character set for 
> string searching?

Probably this indexing method would be optimised for various character 
sets by different mappings from characters -> int 0-31. (I haven't 
thought this last comment through much ... does each 32*32 bit array 
want a character set id attached to it?)

Regards,
Rob.

> 
>>> So the index maps:
>>> suffix -> word -> list ( id1 .. idN )
>>
>> That just would create even larger indexes than the 4-byte-trie-index..
> 
> I tried it and it resulted in a huge index. Also costly to build. I agree.
> 
> Cheers
> Jens



More information about the dovecot mailing list