[Dovecot] Full text search indexing
Jens Laas
jens.laas at data.slu.se
Thu Apr 6 10:33:43 EEST 2006
(06.04.03 kl.10:44) Timo Sirainen skrev följande till dovecot at dovecot.org:
> SEARCH is probably the slowest command that Dovecot has, especially when
> searching texts from message bodies. There exists a lot of text search
> engines, but unfortunately as far as I know there exists only one that
> can be used with IMAP.
>
> The reason why IMAP is special is because it requires that when
> searching eg. "ello", it must match also mails which contain eg. word
> "hello". Most search engines support only matching either complete words
> or matching from the beginning of the words.
>
> So, the one usable engine which I know of is Cyrus's squat indexing. It
> works basically by indexing the mails in 4 byte blocks and storing them
> in a trie. So if there exists a string "he llo", it indexes "he l",
> "e ll" and " llo" strings. When searching it then gets the first 4 bytes
> of the search keyword, looks them up from the squat index and then once
> it gets a list of matched mails, it reads the mails and does a proper
> search for only them. So the squat index doesn't give exact results, but
> only narrows down the number of mails which have to be searched.
>
How would you search for '*lo' in 'hello' ? Only 'ello' would be indexed.
> Unicode characters might also be a problem. If the characters are stored
> in 16bit characters, it would waste space. If they're stored in UTF-8
> the 4 bytes might not be enough to give a useful index (as a single
> character might take even 5 bytes). One thing I was thinking was to not
> index in 4 byte blocks, but 4 character blocks in UTF-8, so the trie
> could be 4..20 levels deep depending on what unicode characters are used
> (in finnish text it would probably be max. 7 levels deep).
Using UTF-8 in the index sounds good, but how hard would it be in
practise? The non-ascii characters are already encoded someway in the
mails arent they?
>
> So, now would be a good time to tell if you can think of some better
> ideas how to do this, especially if you know about some new algorithms
> specifically designed for this kind of searching. :) Pretty much all the
> algorithms I've found about this were suffix trees/arrays and this basic
> trie thing.
>
I am very interested in this area.
The best full text search method I've yet to come across is the suffix
index.
You index every suffix of a word.
For 'hello':
'hello'
'ello'
'llo'
'lo'
'o'
So the index maps:
suffix -> word -> list ( id1 .. idN )
You can also map word to a phrase as an intermediate step.
I feel certain there must be better methods out there.
Cheers
Jens
-----------------------------------------------------------------------
'Old C programmers don't die ... they're just cast into void*'
-----------------------------------------------------------------------
Jens Låås Email: jens.laas at data.slu.se
Department of Computer Services, SLU Phone: +46 18 67 35 15
Vindbrovägen 1
P.O. Box 7079
S-750 07 Uppsala
SWEDEN
-----------------------------------------------------------------------
More information about the dovecot
mailing list