[Dovecot] Squat full text search indexer redesign

26 Nov 2007

      I was writing this to beta9 announcement, but it grew so huge that I
guess a separate mail is better. :)
I've spent quite a lot of time this last week redesigning/rewriting
Squat indexer. I think the redesign is good enough now that I'll just
replace the existing squat with this new one in next release. I don't
think many people are using the old squat, and there are also some bugs
in it that cause some searches not to find anything (haven't tried to
fix it).
The old Squat indexes about 3MB/s on my machine while the new one
indexes 8MB/s. Or it's actually 10MB/s, but at the end of indexing it
does a compression step that merges all fragmented UID lists. That drops
the index size something like 5-10%, but it's pretty slow because it has
to go through all the fragmented lists. It could probably be optimized
to be a lot faster, but I think I'll forget about it for now.
Besides being faster there are other improvements in it:

Old squat could answer only 4 character long substring searches. New
squat can answer 1..n characters long substring searches, where n can be
configurable. The larger the n, the larger the index size. For example
4..10 index sizes are about 35% -> 46% -> 53% -> 59% -> 63% -> 67% ->
70% of the mailbox size (with 32MB of Dovecot mailing list archives).

With both old and new squat if the search keyword is longer than n,
Dovecot looks up the list of all possible n char long combinations of
the word, gets a list of their common UIDs and then reads those mails to
verify that the word is really found from them.

New squat can also be configured to index longer words for
non-substring searches. This allows it to give a definite list of UIDs
where the word is found, so Dovecot doesn't have to open those mails to
verify it.

However it can't give a definite list of UIDs where the substring isn't
found. So it still has to do the n-char-combination lookup described
above, and for those returned UIDs which aren't also in the
definite-UIDs list Dovecot still has to read them to check if the
substring is found or not. This could be skipped for non-standard
X-FAST-TEXT and X-FAST-BODY searches which don't try to search
substrings.
Adding 255 char long non-substring searches to 4 char long substring
searches grows the index about 7% (35%->42% for 32MB, 28%->36% for
200MB, 27%->33% for 1,4GB). Unfortunately it also grows memory usage
pretty much. For a 200MB mailbox heap usage grows from 34MB to 163MB.
For 1,4GB lkml mailbox heap usage grows from 80MB to 700MB. So I'm not
sure if this should be enabled by default until something can be done
about the memory usage.

The current list of indexed characters are: A-Z, 0-9, @.-+#$%_& and
all 8bit chars. When trying to search for non-indexed characters Dovecot
has to split the search word. For example "a,b" would search "a" and
"b", combine the results and then open those mails to see where it
really exists. The same thing happens with "a b". If the search word
doesn't contain any indexable characters, Dovecot has to read all mails.
This indexable character list could be changed runtime, so if a user
keeps searching for some non-indexed character, it could be added to
indexed characters and the index be rebuilt.

To be able to give definite replies for BODY searches squat has to
keep track of whether a word was found from body or header. This also
optimizes HEADER searches because then Dovecot has to search only those
mails that have the word in their headers, instead of the mails that
have it in either header or body. Unfortunately this makes the indexes
somewhat larger again (22%->28% for 200MB), but I think this should be
done by default.

The header+body is stored by storing UID 1=message 1 header, UID
2=message 1 body, UID 3=message 2 header, etc. I haven't yet tried how
well the UID lists could be compressed if could support only-header or
only-body ranges. For example now if word is found from all headers but
from no bodies, it stores the UID list as 1,3,5,7,etc, while it could be
stored as simple "only bodies, 1:7".
I guess the UID list compression could use all kinds of other algorithms
as well. Currently it supports only the whole list being either a
bitmask or UID range. A long UID list might compress better as a bitmask
+range combination. UID lists take most of the Squat space, so even
small improvements can give huge benefits. If you're interested in
trying, you can download the current test code and modify
uidlist_write_array() for writing and node_uidlist_get_at_offset() for
reading.
http://dovecot.org/tmp/test.c
http://dovecot.org/tmp/test-uidlist.c
Place to src/plugins/fts-squat/, run with ./test some-mbox-file
Another somewhat different idea to substring indexing (suggested to me
by someone else) would be to index the messages' words normally like
most indexers do, and then use Squat indexes for those words. So for
example when searching for "ord" it first looks up all words that
contain "ord" substring in them. Then it looks up message UIDs for those
words, and the resulting UID list is a definite list of messages that
has the substring.
I'm not sure how well the above would work in practise. If the list of
words would be mostly static it would work great. Unfortunately every
mail has at least two unique strings: Message-ID and Received header's
ESMTP ID. Dovecot must be able to search those as well. Also with some
languages, like Finnish, a single word can be written in tens of
different ways and combined with other words.
A quick preliminary testing for 32MB Dovecot mailing list shows:

142265 unique words takes 2,0MB, one per line
indexing those words creates 6,7MB Squat indexes

So not including extra overhead for the unique word index, these use a
total of 8,8MB. 4-char-substring Squat indexes for the same mailbox take
8,5MB and 4-char-substring+255-char-full-word Squat indexes take 11MB.
So this doesn't look very promising.

[Dovecot] Squat full text search indexer redesign

Timo Sirainen