[Dovecot] Full text search indexing

Wed Apr 12 12:18:20 EEST 2006

(06.04.12 kl.18:48) Rob Middleton skrev följande till Jens Laas and...:

>> This squat index thingy gave me an idea (which Ive also partly tested).
>> 
>> I made an even squatter index with just character couples (couplets).
>> - Each character is recoded to an integer between 0-63.
>> - A map of 64*64 bits is created for each mail.
>
> I like this idea - and will be interested in statistics on real mailboxes and 
> real searches carried out by users.
>
> I would think that reducing the index to 32*32 bits thus reducing the IO  on 
> the index by a factor of 4 would give fairly comparable usefulness. (ie - the 
> index should be case insensitive as case is not likely to be nearly as 
> discriminatory as the actual letter of alphabet ... but I am not a natural 
> language expert so don't have statistics to back that belief up.)

I converted the (assumed ascii) data to all lowercase before indexing it.
With just 32 significant characters you cant even cover 'a-z' and '0-9'.
So the number of false positives will increase.

>> I tested the above with indexing raw body-parts from a fairly large mailbox 
>> (72 MB/13000 mails).
>
> Which part of the body? If indexing an html email you would presumably want 
> to strip all the html tags for indexing as that is not meaningful. What about 
> if there is a mime multipart/alternative text or html part - index both or 
> just the text part?

Maybe I described it badly. By body I meant everything in the mail but the 
rfc822 headers (the top header of the mail). I did no MIME processing at 
all. Just indexed everything as is. (This was simplest to just test the 
concept).

>
>> The index is then ~6MB.
>> Each map in the index takes 512 bytes: (64*64)/8. (Slightly larger if you 
>> also store the UID).
>> 
>> Advantage with this type if index is that its simple and very fast to build 
>> (rebuilding indexed will probably occur often if youre not using 
>> dovecot-lda).
>
> And rebuilding the index doesn't mean reprocessing all of the emails just 
> reading in the old index processing new messages discarding deleted ones then 
> writing the index back to disk.

I was thinking of the case when the whole dovecot-index had to be rebuilt 
anyway. Which currently occurs when dovecot decides the index is out of 
sync with the mailbox (timestamp and size i think).

But its true that theoretically you could update the index with only new 
or changed mails.

>> Im sorry for my incomplete IMAP knowledge. Is the server required convert 
>> the searchstring and/or mimepart to the same character set for string 
>> searching?
>
> Probably this indexing method would be optimised for various character sets 
> by different mappings from characters -> int 0-31. (I haven't thought this 
> last comment through much ... does each 32*32 bit array want a character set 
> id attached to it?)

That might be possibly. Thinking of different character sets makes my head 
ache :-).

Cheers,
Jens

>
> Regards,
> Rob.

-----------------------------------------------------------------------
     'Old C programmers don't die ... they're just cast into void*'
-----------------------------------------------------------------------
     Jens Låås                              Email: jens.laas at data.slu.se
     Department of Computer Services, SLU   Phone: +46 18 67 35 15
     Vindbrovägen 1
     P.O. Box 7079
     S-750 07 Uppsala
     SWEDEN
-----------------------------------------------------------------------