FTS indexing is something I hear quite often nowadays. I’ve added some hacks to make it work better for some installations, but it’s about time to think about the whole design and how it could be improved for everyone in future. Here are some of my initial thoughts.
Currently Dovecot supports 3 full text search engines: Solr, CLucene and Dovecot Squat. CLucene plugin has various features built in, which should have been built in a generic way to work with all the engines (although Solr has most of those already built-in). Squat was abandoned a few years ago in favor of Solr/CLucene, but perhaps it could be brought back to life, since it looks like its index sizes could be smaller than Lucene's.
Here's a list of things that should be added to generic Dovecot FTS code to improve all the backends:
Support for multiple languages. Use textcat while indexing to guess the language of the indexed data. (Perhaps run it separately for each paragraph to handle multi-language mails? Or at least many emails begin/end with different language than the text in the middle, e.g. "Foo Bar wrote:" is often in various languages.) Index the data using the detected language's stemming and other features. Keep track of which languages have been used in the index, and when searching stem the search words to all the used languages. Since each added language requires additional searches and there's the possibility of wrong detection, the list of allowed languages could be configurable. See also http://ntextcat.codeplex.com/ or at least change textcat to use UTF8.
Word stemming. This can be done for many languages with Snowball library. Solr has also implemented several other languages, perhaps its code can be somehow automatically translated to C(++) for use with Dovecot?
Don't index language-specific stopwords. We can get the word lists from e.g. Solr.
Try to detect compound words and index each part separately for languages that use them. http://wiki.apache.org/solr/LanguageAnalysis#Decompounding suggests two possible ways to do it.
Normalize words (e.g. drop diacritics). libicu can be used for this.
Drop (Unicode) characters that don't belong to the language? Or especially don't index most of the weird Unicode characters. This would avoid filling the index with unnecessary garbage.
Don't index non-text data? For example if there is large block of base64 data or something else that definitely doesn't look like text, it's pretty useless to index it. Then again, we do want to index all kinds of IDs that someone might want to search. This could be a bit difficult to implement well.
Index attachments separately, so it would be possible to search only attachments. (Should "SEARCH BODY word1 BODY word2" return matches if word1 and word2 are in different attachments?)
Attachments can be translated to indexable UTF-8 text already with fts_decoder setting by doing it via a conversion script. This could also support Apache Tika server directly.
It should be configurable which fields are indexed. Body and header would always be separately indexed. Optionally there could be also at least: attachments, From, To, Cc, Bcc and Subject. The From/To/Cc/Bcc could also be indexed together in one "addresses" field. The more fields there are, the larger the index, but better/faster search results.
Each indexed mail should have metadata: Mailbox GUID, mail UID and the language the mail was indexed with. For attachments there should also be the MIME part number. When matching results, drop results if returned language doesn't match the query language.
Squat
Currently Squat index consists of a trie containing all the words and pointer to a file listing all the message UIDs that contain them. Each node in the trie has a pointer to the UIDs, so e.g. with "abc" the "a" node will contain UIDs of all mails that contain the "a" letter (e.g. 1,3-5,10). "ab" node will contain mails that have the "ab" substring. Since the "ab" is a subset of "a", the "ab" won't contain UIDs directly but instead it contains indexes to the "a" list to get a better compression (e.g. UID 3-5,10 -> 2-4 indexes in the "1,3-5,10" list). The "abc" node then similarly refers to the "ab" node's indexes.
It's configurable how long words Squat will index. Also substring matching is configurable. By default both are 4 letters, so words longer than 4 letters will be split to 4 letter pieces which are indexed (e.g. "dovecot" -> "dove", "ovec", "veco", "ecot"). When searching these pieces are looked up and the results are merged.
It's pretty pointless to do a search for 1-2 letter substrings. Most likely the user wants to find 1-2 letter word instead. Perhaps this is true also for 3 letters? The Squat index could be changed to only add results for the first 1-2 (or 1-3?) letters only for full words, not to word prefixes. This of course would mean that the "ab" referring to "a" UID list would no longer work for the first nodes.
Substring searching likely wouldn't work very nicely for stemmed words. So Squat should probably index the full stemmed word and then also index the unstemmed word in the small 4 letter pieces. It should be possible to also disable substring searching entirely.
Squat already attempts to reduce disk space by encoding the common characters with less bits than other characters. This is hardcoded for English language though. Each index compression could analyze the most common letters and dynamically use those. Perhaps different languages could even use separate Squat index files to get the most advantage of this? Although if there are a lot of languages it's a bit annoying to have many different index files.
Squat currently indexes each Unicode character as one letter, so there can be quite a huge number of different 4 letter words. This substring indexing should probably be disabled for words that contain letters not used by the current language.
The UID lists are not compressed very efficiently. See http://www2009.eprints.org/41/1/p401.pdf for alternatives.
There is currently one Squat index per mailbox. There should be one per user. This requires adding a separate ID -> { mailbox GUID, UID, language, mime part } mapping file.
Squat indexes should be updated in two parts. For old mails there are the large trie+uidlist base files in optimized format. For new mails there are small trie+uidlist files in a format where they are just appended to. When the trie or uidlist becomes large enough, it's first internally sorted/optimized, and then merged with the base files by recreating new optimized base files. This fixes the scalability problem with current large Squat indexes where with large unsorted indexes the sorting could have taken forever.
Since substring searches in Squat produce only "maybe matches" UID lists, the mails still need to be opened and searched through. This step could probably also do the same language detection + stemming as is done during indexing to improve the search results. (And this feature could be enabled even if no FTS index is enabled.)