[Dovecot] Full text search indexing

Jens Laas jens.laas at data.slu.se
Wed Apr 12 12:09:38 EEST 2006


(06.04.12 kl.11:20) Timo Sirainen skrev följande till Jens Laas:

> On Wed, 2006-04-12 at 09:40 +0200, Jens Laas wrote:
>> I made an even squatter index with just character couples (couplets).
>> - Each character is recoded to an integer between 0-63.
>> - A map of 64*64 bits is created for each mail.
>> - For each character couple indexed from the mail the corresponding
>>    couplet bit is set: map[C1][C2] = 1.
>> - The search is then linear through all the maps generated for the
>>    mailbox. Each map is tested for the presence of all the couplets
>>    contained in the searchstring. This generates a number of candidate
>>    mails that may match the searchstring (false positives occur).
>> - Candidate mails are the bruteforce searched for searchstring.
>>
>> I tested the above with indexing raw body-parts from a fairly large
>> mailbox (72 MB/13000 mails).
>> The index is then ~6MB.
>> Each map in the index takes 512 bytes: (64*64)/8. (Slightly larger if you
>> also store the UID).
>>
>> Advantage with this type if index is that its simple and very fast to
>> build (rebuilding indexed will probably occur often if youre not using
>> dovecot-lda).
>>
>> The number of false positives as candidates vary alot depending on the
>> searchstring.
>
> Did you test what the rate was approximately with some somewhat sensible
> search strings?

I'll try. Sensible searchstrings are not that easy to come up with though.
The data indexed are the bodyparts of an UML (linux) archive.

First line is nr of candidates. Second line is result of a grep -i in the 
rawdata. (Ouf of a total of almost 13000 maps).

jensl:~/project/jelindex> ./search.sh "jeff"
3652
3861
linux-gnu, Wed Apr 12 10:55:22 mobilus.data.slu.se
jensl:~/project/jelindex> ./search.sh "startx"
251
36
linux-gnu, Wed Apr 12 10:55:30 mobilus.data.slu.se
jensl:~/project/jelindex> ./search.sh "uml builder"
2415
77
linux-gnu, Wed Apr 12 10:55:38 mobilus.data.slu.se
jensl:~/project/jelindex> ./search.sh "xnest"
82
86
jensl:~/project/jelindex> ./search.sh "kernel panic"
4692
1100

Some worse cases:
jensl:~/project/jelindex> ./search.sh "management"
5098
341
jensl:~/project/jelindex> ./search.sh "Timo Sirainen"
494
0

Keep in mind:
The grep may report multiple counts from the same mail.
The indexed data contains a few fairly large base64 encoded attachements. 
I suspect the base64 data will fill those maps pretty well.

>
> The problem I see with this is that each map still has to be searched,
> so a 6MB index needs to be completely read and checked. So it could take
> 10x less time, but it's still O(N)..

Yes. But if it's faster, it's faster..
Generally I think it's true that the better and faster the index the more 
time it takes to build. Good algorithms mitigate this of course.

To improve accuracy while potentially sacrificing IO we could make the 
maps larger. More map-data to be read but maybe less bruteforce searching 
of mails.

>
> But you're right that this is at least much simpler than the trie
> approach.
>

I think something along these lines is well worth to keep in mind 
atleast.

-----------------------------------------------------------------------
     'Old C programmers don't die ... they're just cast into void*'
-----------------------------------------------------------------------
     Jens Låås                              Email: jens.laas at data.slu.se
     Department of Computer Services, SLU   Phone: +46 18 67 35 15
     Vindbrovägen 1
     P.O. Box 7079
     S-750 07 Uppsala
     SWEDEN
-----------------------------------------------------------------------


More information about the dovecot mailing list