[Dovecot] Full text search indexing

Thu Apr 6 17:39:27 EEST 2006

On Thu, 2006-04-06 at 09:33 +0200, Jens Laas wrote:
> > So, the one usable engine which I know of is Cyrus's squat indexing. It
> > works basically by indexing the mails in 4 byte blocks and storing them
> > in a trie. So if there exists a string "he llo", it indexes "he l",
> > "e ll" and " llo" strings. When searching it then gets the first 4 bytes
> > of the search keyword, looks them up from the squat index and then once
> > it gets a list of matched mails, it reads the mails and does a proper
> > search for only them. So the squat index doesn't give exact results, but
> > only narrows down the number of mails which have to be searched.
> >
> 
> How would you search for '*lo' in 'hello' ? Only 'ello' would be indexed.

There are no wildcards in IMAP's search, but "lo" search would do the
same thing.

I think Cyrus simply fallbacks to searching the hard way if the search
key is less than 4 bytes. I was thinking about also keeping result lists
for smaller search keys, so the indexes could still be used.

So if the mail contains "hello world", then "lo w" is indexed and the
mail's UID is stored to all "lo w", "lo ", "lo" and "l" nodes. And if
the mail ends with "hello", then I'd index also "llo", "lo" and "o" for
it. Hopefully it doesn't become too big this way.. :) At least one
optimization is to allow the mail result list to contain "all mails" and
negaive entries, so eg. "lo" could exist in "*,-5" meaning in all mails
but mail with UID 5.

> > Unicode characters might also be a problem. If the characters are stored
> > in 16bit characters, it would waste space. If they're stored in UTF-8
> > the 4 bytes might not be enough to give a useful index (as a single
> > character might take even 5 bytes). One thing I was thinking was to not
> > index in 4 byte blocks, but 4 character blocks in UTF-8, so the trie
> > could be 4..20 levels deep depending on what unicode characters are used
> > (in finnish text it would probably be max. 7 levels deep).
> 
> Using UTF-8 in the index sounds good, but how hard would it be in 
> practise? The non-ascii characters are already encoded someway in the 
> mails arent they?

Yes, but they have to be translated to unicode in any case, because
searching will have to work regardless of what character set is used
when doing the search. So the choices would be pretty much:

1) Use 4 characters long UTF-16 strings
2) Use 4 bytes long UTF-8 strings
3) Use 4 character long UTF-8 strings

> > So, now would be a good time to tell if you can think of some better
> > ideas how to do this, especially if you know about some new algorithms
> > specifically designed for this kind of searching. :) Pretty much all the
> > algorithms I've found about this were suffix trees/arrays and this basic
> > trie thing.
> >
> 
> I am very interested in this area.
> The best full text search method I've yet to come across is the suffix 
> index.
> You index every suffix of a word.
> For 'hello':
> 'hello'
> 'ello'
> 'llo'
> 'lo'
> 'o'
> 
> So the index maps:
> suffix -> word -> list ( id1 .. idN )

That just would create even larger indexes than the 4-byte-trie-index..
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : http://dovecot.org/pipermail/dovecot/attachments/20060406/78b7b984/attachment.pgp