[Dovecot] Full text search indexing

Jens Laas jens.laas at data.slu.se
Wed Apr 12 14:53:17 EEST 2006


(06.04.12 kl.13:39) Timo Sirainen skrev följande till Jens Laas:

>>>> Im sorry for my incomplete IMAP knowledge. Is the server required convert
>>>> the searchstring and/or mimepart to the same character set for string
>>>> searching?
>>>
>>> Probably this indexing method would be optimised for various character sets
>>> by different mappings from characters -> int 0-31. (I haven't thought this
>>> last comment through much ... does each 32*32 bit array want a character set
>>> id attached to it?)
>>
>> That might be possibly. Thinking of different character sets makes my head
>> ache :-).
>
> Another problem is that with UTF-8 the two characters may describe only
> a single character (or not even that), which increases the false
> positives a lot if the language uses a lot of non-ascii.

Hmm.
The map should then be for couplets of UTF-8 characters.
Then we just have to decide how we map the whole UTF-8 space to just 64 
instances.

The way I did it in the test was to first see if the character was in an 
array of the "most common" characters, and if so use that index directly. 
If it was not found in the array I just used the lower bits.

We could decide which UTF-8 characters are most common. We could also 
figure out what part of the UTF-8 character is most significant and use it 
for the index (maybe the last byte?).

I think this would work pretty well atleast with western languages.
I dont even want to think about chinese languages where one Unicode 
code-point is a whole word.

Any idea of how big and costly the squat index is ?

Cheers,
Jens

-----------------------------------------------------------------------
     'Old C programmers don't die ... they're just cast into void*'
-----------------------------------------------------------------------
     Jens Låås                              Email: jens.laas at data.slu.se
     Department of Computer Services, SLU   Phone: +46 18 67 35 15
     Vindbrovägen 1
     P.O. Box 7079
     S-750 07 Uppsala
     SWEDEN
-----------------------------------------------------------------------


More information about the dovecot mailing list