(06.04.12 kl.18:48) Rob Middleton skrev följande till Jens Laas and...:
This squat index thingy gave me an idea (which Ive also partly tested).
I made an even squatter index with just character couples (couplets).
- Each character is recoded to an integer between 0-63.
- A map of 64*64 bits is created for each mail.
I like this idea - and will be interested in statistics on real mailboxes and real searches carried out by users.
I would think that reducing the index to 32*32 bits thus reducing the IO on the index by a factor of 4 would give fairly comparable usefulness. (ie - the index should be case insensitive as case is not likely to be nearly as discriminatory as the actual letter of alphabet ... but I am not a natural language expert so don't have statistics to back that belief up.)
I converted the (assumed ascii) data to all lowercase before indexing it. With just 32 significant characters you cant even cover 'a-z' and '0-9'. So the number of false positives will increase.
I tested the above with indexing raw body-parts from a fairly large mailbox (72 MB/13000 mails).
Which part of the body? If indexing an html email you would presumably want to strip all the html tags for indexing as that is not meaningful. What about if there is a mime multipart/alternative text or html part - index both or just the text part?
Maybe I described it badly. By body I meant everything in the mail but the rfc822 headers (the top header of the mail). I did no MIME processing at all. Just indexed everything as is. (This was simplest to just test the concept).
The index is then ~6MB. Each map in the index takes 512 bytes: (64*64)/8. (Slightly larger if you also store the UID).
Advantage with this type if index is that its simple and very fast to build (rebuilding indexed will probably occur often if youre not using dovecot-lda).
And rebuilding the index doesn't mean reprocessing all of the emails just reading in the old index processing new messages discarding deleted ones then writing the index back to disk.
I was thinking of the case when the whole dovecot-index had to be rebuilt anyway. Which currently occurs when dovecot decides the index is out of sync with the mailbox (timestamp and size i think).
But its true that theoretically you could update the index with only new or changed mails.
Im sorry for my incomplete IMAP knowledge. Is the server required convert the searchstring and/or mimepart to the same character set for string searching?
Probably this indexing method would be optimised for various character sets by different mappings from characters -> int 0-31. (I haven't thought this last comment through much ... does each 32*32 bit array want a character set id attached to it?)
That might be possibly. Thinking of different character sets makes my head ache :-).
Cheers, Jens
Regards, Rob.
'Old C programmers don't die ... they're just cast into void*'
Jens Låås Email: jens.laas@data.slu.se
Department of Computer Services, SLU Phone: +46 18 67 35 15
Vindbrovägen 1
P.O. Box 7079
S-750 07 Uppsala
SWEDEN