[Dovecot] Full text search improvements
FTS indexing is something I hear quite often nowadays. I’ve added some hacks to make it work better for some installations, but it’s about time to think about the whole design and how it could be improved for everyone in future. Here are some of my initial thoughts.
Currently Dovecot supports 3 full text search engines: Solr, CLucene and Dovecot Squat. CLucene plugin has various features built in, which should have been built in a generic way to work with all the engines (although Solr has most of those already built-in). Squat was abandoned a few years ago in favor of Solr/CLucene, but perhaps it could be brought back to life, since it looks like its index sizes could be smaller than Lucene's.
Here's a list of things that should be added to generic Dovecot FTS code to improve all the backends:
Support for multiple languages. Use textcat while indexing to guess the language of the indexed data. (Perhaps run it separately for each paragraph to handle multi-language mails? Or at least many emails begin/end with different language than the text in the middle, e.g. "Foo Bar wrote:" is often in various languages.) Index the data using the detected language's stemming and other features. Keep track of which languages have been used in the index, and when searching stem the search words to all the used languages. Since each added language requires additional searches and there's the possibility of wrong detection, the list of allowed languages could be configurable. See also http://ntextcat.codeplex.com/ or at least change textcat to use UTF8.
Word stemming. This can be done for many languages with Snowball library. Solr has also implemented several other languages, perhaps its code can be somehow automatically translated to C(++) for use with Dovecot?
Don't index language-specific stopwords. We can get the word lists from e.g. Solr.
Try to detect compound words and index each part separately for languages that use them. http://wiki.apache.org/solr/LanguageAnalysis#Decompounding suggests two possible ways to do it.
Normalize words (e.g. drop diacritics). libicu can be used for this.
Drop (Unicode) characters that don't belong to the language? Or especially don't index most of the weird Unicode characters. This would avoid filling the index with unnecessary garbage.
Don't index non-text data? For example if there is large block of base64 data or something else that definitely doesn't look like text, it's pretty useless to index it. Then again, we do want to index all kinds of IDs that someone might want to search. This could be a bit difficult to implement well.
Index attachments separately, so it would be possible to search only attachments. (Should "SEARCH BODY word1 BODY word2" return matches if word1 and word2 are in different attachments?)
Attachments can be translated to indexable UTF-8 text already with fts_decoder setting by doing it via a conversion script. This could also support Apache Tika server directly.
It should be configurable which fields are indexed. Body and header would always be separately indexed. Optionally there could be also at least: attachments, From, To, Cc, Bcc and Subject. The From/To/Cc/Bcc could also be indexed together in one "addresses" field. The more fields there are, the larger the index, but better/faster search results.
Each indexed mail should have metadata: Mailbox GUID, mail UID and the language the mail was indexed with. For attachments there should also be the MIME part number. When matching results, drop results if returned language doesn't match the query language.
Squat
Currently Squat index consists of a trie containing all the words and pointer to a file listing all the message UIDs that contain them. Each node in the trie has a pointer to the UIDs, so e.g. with "abc" the "a" node will contain UIDs of all mails that contain the "a" letter (e.g. 1,3-5,10). "ab" node will contain mails that have the "ab" substring. Since the "ab" is a subset of "a", the "ab" won't contain UIDs directly but instead it contains indexes to the "a" list to get a better compression (e.g. UID 3-5,10 -> 2-4 indexes in the "1,3-5,10" list). The "abc" node then similarly refers to the "ab" node's indexes.
It's configurable how long words Squat will index. Also substring matching is configurable. By default both are 4 letters, so words longer than 4 letters will be split to 4 letter pieces which are indexed (e.g. "dovecot" -> "dove", "ovec", "veco", "ecot"). When searching these pieces are looked up and the results are merged.
It's pretty pointless to do a search for 1-2 letter substrings. Most likely the user wants to find 1-2 letter word instead. Perhaps this is true also for 3 letters? The Squat index could be changed to only add results for the first 1-2 (or 1-3?) letters only for full words, not to word prefixes. This of course would mean that the "ab" referring to "a" UID list would no longer work for the first nodes.
Substring searching likely wouldn't work very nicely for stemmed words. So Squat should probably index the full stemmed word and then also index the unstemmed word in the small 4 letter pieces. It should be possible to also disable substring searching entirely.
Squat already attempts to reduce disk space by encoding the common characters with less bits than other characters. This is hardcoded for English language though. Each index compression could analyze the most common letters and dynamically use those. Perhaps different languages could even use separate Squat index files to get the most advantage of this? Although if there are a lot of languages it's a bit annoying to have many different index files.
Squat currently indexes each Unicode character as one letter, so there can be quite a huge number of different 4 letter words. This substring indexing should probably be disabled for words that contain letters not used by the current language.
The UID lists are not compressed very efficiently. See http://www2009.eprints.org/41/1/p401.pdf for alternatives.
There is currently one Squat index per mailbox. There should be one per user. This requires adding a separate ID -> { mailbox GUID, UID, language, mime part } mapping file.
Squat indexes should be updated in two parts. For old mails there are the large trie+uidlist base files in optimized format. For new mails there are small trie+uidlist files in a format where they are just appended to. When the trie or uidlist becomes large enough, it's first internally sorted/optimized, and then merged with the base files by recreating new optimized base files. This fixes the scalability problem with current large Squat indexes where with large unsorted indexes the sorting could have taken forever.
Since substring searches in Squat produce only "maybe matches" UID lists, the mails still need to be opened and searched through. This step could probably also do the same language detection + stemming as is done during indexing to improve the search results. (And this feature could be enabled even if no FTS index is enabled.)
how [FTS indexing] could be improved for everyone in future
For sites which set client_limit > 1 it would help performance not to stall for INDEXER_WAIT_MSECS when polling the indexer for input. Currently dovecot unwinds back out to the main command loop repeatedly to allow other clients to use the process but it also stalls the whole process for INDEXER_WAIT_MSECS every time it finds no input from the indexer, which hurts responsiveness for those other clients. This can be avoided by removing the client's I/O from the main ioloop and adding the indexer's instead, or perhaps by leveraging CLIENT_COMMAND_STATE_WAIT_EXTERNAL.
Third-party FTS implementations may benefit from having the NOT/AND/OR seq_range_array merging logic in squat_lookup_arg() generalized and made available to all.
It would also be helpful if FTS expunge were asynchronous, but this is not critical.
On 2.12.2013, at 20.50, Mike Abbott michael.abbott@apple.com wrote:
how [FTS indexing] could be improved for everyone in future
For sites which set client_limit > 1 it would help performance not to stall for INDEXER_WAIT_MSECS when polling the indexer for input. Currently dovecot unwinds back out to the main command loop repeatedly to allow other clients to use the process but it also stalls the whole process for INDEXER_WAIT_MSECS every time it finds no input from the indexer, which hurts responsiveness for those other clients. This can be avoided by removing the client's I/O from the main ioloop and adding the indexer's instead, or perhaps by leveraging CLIENT_COMMAND_STATE_WAIT_EXTERNAL.
Gets a bit tricky to implement, at least without changing the lib-storage API. I did have some plans for this earlier where lib-storage could call some callback when there is more data available for search/fetch/mailbox_open/etc functions. Currently I’m thinking that most of the reasons for client_limit>1 can be avoided just by moving IMAP IDLE connections to a separate imap-idle process where they wait until they have more work to do. Do you think that would work for you also?
Currently I’m thinking that most of the reasons for client_limit>1 can be avoided just by moving IMAP IDLE connections to a separate imap-idle process where they wait until they have more work to do. Do you think that would work for you also? I was exactly thinking about the same thing.. I wanted to request this feature but I guess I was too shy to write about it :D I think a special IDLE process would be a wonderful idea. I find that otherwise client_limit>1 doesn't really work. It gets especially annoying when a client with a large mailbox makes a process grow and it doesn't shrink back, is there some insight about that? And, after service_count is maxed out, you end up having lots of processes waiting for the last 1 or 2 IDLEing clients to quit, so your total number of
On 12/02/2013 02:41 PM, Timo Sirainen wrote: processes is really much larger than total connections / client_limit.
Do you think [moving IMAP IDLE connections to a separate imap-idle process] would work for you also?
Probably. It always depends on the details. Forking a new imap process every time there's a little input to read or output to send might perform poorly under load. Having a pool of ready imap processes could help that, when the configuration permits (e.g. all mail owned by one uid). It would be interesting to compare client_limit > 1 vs. an idle connection aggregator.
What's so evil about client_limit > 1 besides requiring one uid, the indexer polling I mentioned, and broken fcntl-style file locks? Or is that enough?
On 3.12.2013, at 0.09, Mike Abbott michael.abbott@apple.com wrote:
Do you think [moving IMAP IDLE connections to a separate imap-idle process] would work for you also?
Probably. It always depends on the details. Forking a new imap process every time there's a little input to read or output to send might perform poorly under load. Having a pool of ready imap processes could help that, when the configuration permits (e.g. all mail owned by one uid). It would be interesting to compare client_limit > 1 vs. an idle connection aggregator.
I was thinking that you’d have a pool of imap processes waiting and being reused. Some state would be transferred between the imap-idle and imap processes. And it could work also for non-IDLEing idling connections. Then there needs to be some kind of a good balance of figuring out when to move connection to imap-idle to maximize the amount of time it’s there but also to minimize unnecessary CPU-wasting transfers.. Oh, and this would be possible also with multiple UIDs (although imap-idle might have to run as root then).
What's so evil about client_limit > 1 besides requiring one uid, the indexer polling I mentioned, and broken fcntl-style file locks? Or is that enough?
Mainly that there are so many possible reasons for why imap process might block. It’s not possible to make all of them asynchronous. I guess getting rid of the longest waits could help, but I still wouldn’t dare to run that in production.
Substring match is important to us, so we'd love to see Squat reinstated with speed improvements. It seems like Solr can handle substrings as well ([Edge]NGramFilterFactory), but for small deployments, having the engine built right in is a plus.
Quoting Timo Sirainen tss@iki.fi:
- Support for multiple languages. Use textcat while indexing to
guess the language of the indexed data.
FWIW, you could probably use the Content-Language header (if it
exists) to at least give a hint. No guarantee it is correct, but it's
a better starting place than simply scanning all languages.
And, for that matter, you could leverage Accept-Language also (again,
if it exists). Which might be more useful, since it lists all the
languages the user recognizes.
michael
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On Sat, 30 Nov 2013, Timo Sirainen wrote:
- Don't index non-text data? For example if there is large block of base64 data or something else that definitely doesn't look like text, it's pretty useless to index it. Then again, we do want to index all kinds of IDs that someone might want to search. This could be a bit difficult to implement well.
- Attachments can be translated to indexable UTF-8 text already with fts_decoder setting by doing it via a conversion script. This could also support Apache Tika server directly.
This means some kind of MIME type based (or file type guesser) "... to UTF8 text" converter script? Some users would find that very very very ^ n nice. There are already several programs used in the field of CMS.
Steffen Kaiser -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux)
iQEVAwUBUqA8A13r2wJMiz2NAQLQYwf/bAyrg080/i2khM/XGXLlhjlcPcyxGHym KgoFFBhh2sgfl+ecRHCM4BP+WX/c5coxAScyXhSy9JjwcQz8MXUHzkbGL4d8kwa4 pgdhaD4hFhPqpOJGf1ULwBSIBEsJfZeHaOkJHlMqDgd3yKY5APoJPKJtG2z+lI+7 vqR/Pe8n8EhCcWcLC1CfEGKxcci09XYj09Sai96VGbCO2coVCm+xIKRSCW6pasoQ NTqpJBTCe2gCD3KdVA5jUNqFeEj2AQF5+nkujtSF4B1G/xrpfoABLkJ+lyQ8F5hc DTJFiHhlvJKRIIKbhuyQukeqDSzeln2UtSRce3q59fek4foFzDrhTw== =l3mf -----END PGP SIGNATURE-----
On 5.12.2013, at 10.40, Steffen Kaiser skdovecot@smail.inf.fh-brs.de wrote:
- Attachments can be translated to indexable UTF-8 text already with fts_decoder setting by doing it via a conversion script. This could also support Apache Tika server directly.
This means some kind of MIME type based (or file type guesser) "... to UTF8 text" converter script? Some users would find that very very very ^ n nice. There are already several programs used in the field of CMS.
That’s already been possible since v2.1: http://hg.dovecot.org/dovecot-2.2/file/342f6962390e/src/plugins/fts/decode2t...
participants (6)
-
Gedalya
-
Metro Domain Admin
-
Michael M Slusarz
-
Mike Abbott
-
Steffen Kaiser
-
Timo Sirainen