Re: [Dovecot] Full text search indexing

12 Apr 2006

      (06.04.12 kl.18:48) Rob Middleton skrev följande till Jens Laas and...:
...
...
This squat index thingy gave me an idea (which Ive also partly tested).
I made an even squatter index with just character couples (couplets).

Each character is recoded to an integer between 0-63.
A map of 64*64 bits is created for each mail.

I like this idea - and will be interested in statistics on real mailboxes and
real searches carried out by users.
I would think that reducing the index to 32*32 bits thus reducing the IO  on
the index by a factor of 4 would give fairly comparable usefulness. (ie - the
index should be case insensitive as case is not likely to be nearly as
discriminatory as the actual letter of alphabet ... but I am not a natural
language expert so don't have statistics to back that belief up.)
I converted the (assumed ascii) data to all lowercase before indexing it.
With just 32 significant characters you cant even cover 'a-z' and '0-9'.
So the number of false positives will increase.
...
...
I tested the above with indexing raw body-parts from a fairly large mailbox
(72 MB/13000 mails).
Which part of the body? If indexing an html email you would presumably want
to strip all the html tags for indexing as that is not meaningful. What about
if there is a mime multipart/alternative text or html part - index both or
just the text part?
Maybe I described it badly. By body I meant everything in the mail but the
rfc822 headers (the top header of the mail). I did no MIME processing at
all. Just indexed everything as is. (This was simplest to just test the
concept).
...
...
The index is then ~6MB.
Each map in the index takes 512 bytes: (64*64)/8. (Slightly larger if you
also store the UID).
Advantage with this type if index is that its simple and very fast to build
(rebuilding indexed will probably occur often if youre not using
dovecot-lda).
And rebuilding the index doesn't mean reprocessing all of the emails just
reading in the old index processing new messages discarding deleted ones then
writing the index back to disk.
I was thinking of the case when the whole dovecot-index had to be rebuilt
anyway. Which currently occurs when dovecot decides the index is out of
sync with the mailbox (timestamp and size i think).
But its true that theoretically you could update the index with only new
or changed mails.
...
...
Im sorry for my incomplete IMAP knowledge. Is the server required convert
the searchstring and/or mimepart to the same character set for string
searching?
Probably this indexing method would be optimised for various character sets
by different mappings from characters -> int 0-31. (I haven't thought this
last comment through much ... does each 32*32 bit array want a character set
id attached to it?)
That might be possibly. Thinking of different character sets makes my head
ache :-).
Cheers,
Jens
...
Regards,
Rob.

 'Old C programmers don't die ... they're just cast into void*'

 Jens Låås                              Email: jens.laas@data.slu.se
 Department of Computer Services, SLU   Phone: +46 18 67 35 15
 Vindbrovägen 1
 P.O. Box 7079
 S-750 07 Uppsala
 SWEDEN