There are now two full text search backends in CVS HEAD:
- squat: My own full text search index based on ideas from Cyrus Squat indexes. Supports substring searches.
- lucene: Uses CLucene library.
It should be pretty easy to add support for more backends.
As I've explained a couple of times already, IMAP RFC says that searches are done by matching substrings, so the Lucene backend isn't fully RFC compliant. Currently it's anyway used to optimize standard SEARCH BODY/TEXT rules, but I'll probably make it optional and instead add new X-NONEXACT-BODY/TEXT rules. Then it will be possible to use both Lucene and Squat backends to optimize different searches (eg. Lucene for your modified webmail and Squat for standard IMAP clients).
The Squat index finally worked better than I thought. It has two limitations:
- Search strings must be at least 4 characters long
- Those 4 characters can't contain spaces
So eg. searching for "ip add" doesn't work, but "ip addr" does. By "doesn't work" I mean it fallbacks to the slow old way of going through all the mails. The space limitation exists only because I thought that such searches are going to be rare, and this reduces the index size by 20-50%.
So the Squat indexing works by looking at 4 character blocks. So when searching for "address":
- Get a list of UIDs for "ress"
- Filter out UIDs which don't appear for "dres"
- The same for "ddre" and "addr"
- Open all the messages that are left in the UID list and drop out the messages where the whole "address" word doesn't exist.
I think searching the words backwards usually finds fewer hits and makes the search faster.
As for the index sizes, indexing my dovecot mbox with 7495 messages and size of 32479980 bytes it produced:
dovecot.index.search : 2166526 bytes dovecot.index.search.uids : 6074155 bytes
So that's about 25% of the mailbox size. Last I checked, Cyrus Squat took about 120% (with different messages). Anyway as you can see the uids file is larger than the search trie file, so the number of messages is more important in determining the index size than the number of bytes.
The initial index build is pretty slow, but adding new messages to the existing index should be pretty fast. The new messages aren't indexed until client does a SEARCH BODY/TEXT. I'm not sure if the messages should be optionally indexed immediately as they're seen (or even added by Dovecot LDA).
Oh and another difference to Cyrus Squat is that Dovecot's version indexes 16bit characters and not bytes. So non-English messages get indexed much more nicely.
If you want to try this yourself, you'll need to add:
protocol imap { .. mail_plugins = fts fts_squat }
plugin { fts = squat }