[Dovecot] Full text search indexing

10 Dec 2006

      There are now two full text search backends in CVS HEAD:

squat: My own full text search index based on ideas from Cyrus Squat
indexes. Supports substring searches.
lucene: Uses CLucene library.

It should be pretty easy to add support for more backends.
As I've explained a couple of times already, IMAP RFC says that searches
are done by matching substrings, so the Lucene backend isn't fully RFC
compliant. Currently it's anyway used to optimize standard SEARCH
BODY/TEXT rules, but I'll probably make it optional and instead add new
X-NONEXACT-BODY/TEXT rules. Then it will be possible to use both Lucene
and Squat backends to optimize different searches (eg. Lucene for your
modified webmail and Squat for standard IMAP clients).
The Squat index finally worked better than I thought. It has two
limitations:

Search strings must be at least 4 characters long
Those 4 characters can't contain spaces

So eg. searching for "ip add" doesn't work, but "ip addr" does. By
"doesn't work" I mean it fallbacks to the slow old way of going through
all the mails. The space limitation exists only because I thought that
such searches are going to be rare, and this reduces the index size by
20-50%.
So the Squat indexing works by looking at 4 character blocks. So when
searching for "address":

Get a list of UIDs for "ress"
Filter out UIDs which don't appear for "dres"
The same for "ddre" and "addr"
Open all the messages that are left in the UID list and drop out the
messages where the whole "address" word doesn't exist.

I think searching the words backwards usually finds fewer hits and makes
the search faster.
As for the index sizes, indexing my dovecot mbox with 7495 messages and
size of 32479980 bytes it produced:
dovecot.index.search : 2166526 bytes
dovecot.index.search.uids : 6074155 bytes
So that's about 25% of the mailbox size. Last I checked, Cyrus Squat
took about 120% (with different messages). Anyway as you can see the
uids file is larger than the search trie file, so the number of messages
is more important in determining the index size than the number of
bytes.
The initial index build is pretty slow, but adding new messages to the
existing index should be pretty fast. The new messages aren't indexed
until client does a SEARCH BODY/TEXT. I'm not sure if the messages
should be optionally indexed immediately as they're seen (or even added
by Dovecot LDA).
Oh and another difference to Cyrus Squat is that Dovecot's version
indexes 16bit characters and not bytes. So non-English messages get
indexed much more nicely.
If you want to try this yourself, you'll need to add:
protocol imap {
..
mail_plugins = fts fts_squat
}
plugin {
fts = squat
}

[Dovecot] Full text search indexing

Timo Sirainen