I was writing this to beta9 announcement, but it grew so huge that I guess a separate mail is better. :)
I've spent quite a lot of time this last week redesigning/rewriting Squat indexer. I think the redesign is good enough now that I'll just replace the existing squat with this new one in next release. I don't think many people are using the old squat, and there are also some bugs in it that cause some searches not to find anything (haven't tried to fix it).
The old Squat indexes about 3MB/s on my machine while the new one indexes 8MB/s. Or it's actually 10MB/s, but at the end of indexing it does a compression step that merges all fragmented UID lists. That drops the index size something like 5-10%, but it's pretty slow because it has to go through all the fragmented lists. It could probably be optimized to be a lot faster, but I think I'll forget about it for now.
Besides being faster there are other improvements in it:
- Old squat could answer only 4 character long substring searches. New squat can answer 1..n characters long substring searches, where n can be configurable. The larger the n, the larger the index size. For example 4..10 index sizes are about 35% -> 46% -> 53% -> 59% -> 63% -> 67% -> 70% of the mailbox size (with 32MB of Dovecot mailing list archives).
With both old and new squat if the search keyword is longer than n, Dovecot looks up the list of all possible n char long combinations of the word, gets a list of their common UIDs and then reads those mails to verify that the word is really found from them.
- New squat can also be configured to index longer words for non-substring searches. This allows it to give a definite list of UIDs where the word is found, so Dovecot doesn't have to open those mails to verify it.
However it can't give a definite list of UIDs where the substring isn't found. So it still has to do the n-char-combination lookup described above, and for those returned UIDs which aren't also in the definite-UIDs list Dovecot still has to read them to check if the substring is found or not. This could be skipped for non-standard X-FAST-TEXT and X-FAST-BODY searches which don't try to search substrings.
Adding 255 char long non-substring searches to 4 char long substring searches grows the index about 7% (35%->42% for 32MB, 28%->36% for 200MB, 27%->33% for 1,4GB). Unfortunately it also grows memory usage pretty much. For a 200MB mailbox heap usage grows from 34MB to 163MB. For 1,4GB lkml mailbox heap usage grows from 80MB to 700MB. So I'm not sure if this should be enabled by default until something can be done about the memory usage.
The current list of indexed characters are: A-Z, 0-9, @.-+#$%_& and all 8bit chars. When trying to search for non-indexed characters Dovecot has to split the search word. For example "a,b" would search "a" and "b", combine the results and then open those mails to see where it really exists. The same thing happens with "a b". If the search word doesn't contain any indexable characters, Dovecot has to read all mails. This indexable character list could be changed runtime, so if a user keeps searching for some non-indexed character, it could be added to indexed characters and the index be rebuilt.
To be able to give definite replies for BODY searches squat has to keep track of whether a word was found from body or header. This also optimizes HEADER searches because then Dovecot has to search only those mails that have the word in their headers, instead of the mails that have it in either header or body. Unfortunately this makes the indexes somewhat larger again (22%->28% for 200MB), but I think this should be done by default.
The header+body is stored by storing UID 1=message 1 header, UID 2=message 1 body, UID 3=message 2 header, etc. I haven't yet tried how well the UID lists could be compressed if could support only-header or only-body ranges. For example now if word is found from all headers but from no bodies, it stores the UID list as 1,3,5,7,etc, while it could be stored as simple "only bodies, 1:7".
I guess the UID list compression could use all kinds of other algorithms as well. Currently it supports only the whole list being either a bitmask or UID range. A long UID list might compress better as a bitmask +range combination. UID lists take most of the Squat space, so even small improvements can give huge benefits. If you're interested in trying, you can download the current test code and modify uidlist_write_array() for writing and node_uidlist_get_at_offset() for reading.
http://dovecot.org/tmp/test.c http://dovecot.org/tmp/test-uidlist.c Place to src/plugins/fts-squat/, run with ./test some-mbox-file
Another somewhat different idea to substring indexing (suggested to me by someone else) would be to index the messages' words normally like most indexers do, and then use Squat indexes for those words. So for example when searching for "ord" it first looks up all words that contain "ord" substring in them. Then it looks up message UIDs for those words, and the resulting UID list is a definite list of messages that has the substring.
I'm not sure how well the above would work in practise. If the list of words would be mostly static it would work great. Unfortunately every mail has at least two unique strings: Message-ID and Received header's ESMTP ID. Dovecot must be able to search those as well. Also with some languages, like Finnish, a single word can be written in tens of different ways and combined with other words.
A quick preliminary testing for 32MB Dovecot mailing list shows:
- 142265 unique words takes 2,0MB, one per line
- indexing those words creates 6,7MB Squat indexes
So not including extra overhead for the unique word index, these use a total of 8,8MB. 4-char-substring Squat indexes for the same mailbox take 8,5MB and 4-char-substring+255-char-full-word Squat indexes take 11MB. So this doesn't look very promising.