[Dovecot] Full text search indexing
Jens Laas
jens.laas at data.slu.se
Wed Apr 12 12:18:20 EEST 2006
(06.04.12 kl.18:48) Rob Middleton skrev följande till Jens Laas and...:
>> This squat index thingy gave me an idea (which Ive also partly tested).
>>
>> I made an even squatter index with just character couples (couplets).
>> - Each character is recoded to an integer between 0-63.
>> - A map of 64*64 bits is created for each mail.
>
> I like this idea - and will be interested in statistics on real mailboxes and
> real searches carried out by users.
>
> I would think that reducing the index to 32*32 bits thus reducing the IO on
> the index by a factor of 4 would give fairly comparable usefulness. (ie - the
> index should be case insensitive as case is not likely to be nearly as
> discriminatory as the actual letter of alphabet ... but I am not a natural
> language expert so don't have statistics to back that belief up.)
I converted the (assumed ascii) data to all lowercase before indexing it.
With just 32 significant characters you cant even cover 'a-z' and '0-9'.
So the number of false positives will increase.
>> I tested the above with indexing raw body-parts from a fairly large mailbox
>> (72 MB/13000 mails).
>
> Which part of the body? If indexing an html email you would presumably want
> to strip all the html tags for indexing as that is not meaningful. What about
> if there is a mime multipart/alternative text or html part - index both or
> just the text part?
Maybe I described it badly. By body I meant everything in the mail but the
rfc822 headers (the top header of the mail). I did no MIME processing at
all. Just indexed everything as is. (This was simplest to just test the
concept).
>
>> The index is then ~6MB.
>> Each map in the index takes 512 bytes: (64*64)/8. (Slightly larger if you
>> also store the UID).
>>
>> Advantage with this type if index is that its simple and very fast to build
>> (rebuilding indexed will probably occur often if youre not using
>> dovecot-lda).
>
> And rebuilding the index doesn't mean reprocessing all of the emails just
> reading in the old index processing new messages discarding deleted ones then
> writing the index back to disk.
I was thinking of the case when the whole dovecot-index had to be rebuilt
anyway. Which currently occurs when dovecot decides the index is out of
sync with the mailbox (timestamp and size i think).
But its true that theoretically you could update the index with only new
or changed mails.
>> Im sorry for my incomplete IMAP knowledge. Is the server required convert
>> the searchstring and/or mimepart to the same character set for string
>> searching?
>
> Probably this indexing method would be optimised for various character sets
> by different mappings from characters -> int 0-31. (I haven't thought this
> last comment through much ... does each 32*32 bit array want a character set
> id attached to it?)
That might be possibly. Thinking of different character sets makes my head
ache :-).
Cheers,
Jens
>
> Regards,
> Rob.
-----------------------------------------------------------------------
'Old C programmers don't die ... they're just cast into void*'
-----------------------------------------------------------------------
Jens Låås Email: jens.laas at data.slu.se
Department of Computer Services, SLU Phone: +46 18 67 35 15
Vindbrovägen 1
P.O. Box 7079
S-750 07 Uppsala
SWEDEN
-----------------------------------------------------------------------
More information about the dovecot
mailing list