[Dovecot] New full text search indexer

6 Apr 2007

      As described earlier
(http://dovecot.org/list/dovecot/2006-December/018055.html), Dovecot
nowadays has full text search indexing support in CVS HEAD.
Currently there are two backends: Lucene and Squat. Lucene's problem is
that standard IMAP SEARCH command can't be used with it without breaking
IMAP RFC. So Lucene can be used only with non-standard X-BODY-FAST and
X-TEXT-FAST search commands.
Squat's code then is quite complex and it's pretty slow. It works with 4
character blocks, so it can answer:

Search string < 4 characters: Can't answer anything
Search string = 4 characters: Definite yes/no
Search string > 4 characters: Probable yes / definite no

Since search strings are usually more than 4 characters, Dovecot uses
the index only to filter out messages which definitely won't match and
then does regular searching for the non-filtered messages. It works ok,
but it could work better.
The Squat indexes take maybe 30-50% of the mailbox's size. Most of the
space goes to UID lists (list of messages matching each 4 character
block). For example one UID list could be "2,4-10,20-30". This is
already stored in a compressed format by storing the UID numbers
relative to previous UID, which allows storing them usually with a
single byte per UID, even if the UIDs are large.
So, this morning I had an idea how this could maybe be improved with two
changes:

Don't use 4 character blocks. Instead allow variable sized paths, and
keep track of the UIDs for each node in the path. So for example adding
"abc" to UID 1 will place the UID 1 to node "a", "ab" and "abc".

Use UIDs relative to parent's UIDs. For example if node "a" has UIDs
1,3,5 and "ab" is added for UID 5, the UID is changed to 2 (3th in 1,3,5
list - 1). This can help the compression a lot. The best case scenario
for the above example is if "ab" contains also 1,3,5 UIDs. Then it can
be written with a single 0-2 range.

With these changes, the indexer can be configured to perform in
different ways. Text can be indexed in two ways: full words, and
substrings within those words (as required by IMAP). I think a useful
combination of these would be to index all 4 character substrings and
all full words up to 10 characters or so. This kind of an index can
answer:

Search string <= 4 characters: Definite yes/no
For X-BODY-FAST / X-TEXT-FAST:
Search string 5-10 characters: Definite yes/no
Search string 11+ characters: Probable yes / definite no

For standard IMAP SEARCH:
Search string 5-10 characters: Definite yes if message contains the
key as non-substring, otherwise same as below. This helps only if user
does a search that matches a lot of messages.
Search string 11+ characters: Probable yes / definite no

UTF-8 text is indexed as bytes, but the path depth can be counted as
characters instead of as bytes.
I've a test program finished, and unless there are some bugs left I
think it shows that this can work pretty well:
Squat-like 4 byte substrings (but can answer 1-3 char queries also):
Message count: 7495
Index time: 4.19 CPU seconds (7.39MB/CPUs)
Node memory: 3014072 bytes (2943 kB) in 90550 nodes
UID memory: 8277212 bytes (8083 kB) in 352778 lists
Total: 11291284 / 32474899 bytes = 34.77%
Indexing the same mailbox file with squat-test gives:

Index time: 9.55 CPU seconds, 9.63 real seconds (3.24MB/CPUs)
4145192 bytes in 102208 nodes (12.76%)
7286058 bytes in 305501 uid_lists (22.44%)
11431250 bytes total of 32474899 (35.20%)

So the new indexer is over twice as fast and uses slightly less space.
Indexing 4 char substrings and words up to 10 chars:
Index time: 5.13 CPU seconds (6.04MB/CPUs)
Node memory: 6002804 bytes (5862 kB) in 615501 nodes
UID memory: 9044610 bytes (8832 kB) in 519488 lists
Total: 15047414 / 32474899 bytes = 46.34%
So it takes somewhat more space, but definitely less than having both
Squat + Lucene.
No substring indexing, words up to 10 chars (Lucene replacement):
Index time: 1.39 CPU seconds (22.28MB/CPUs)
Node memory: 3297663 bytes (3220 kB) in 548540 nodes
UID memory: 1803735 bytes (1761 kB) in 213554 lists
Total: 5101398 / 32474899 bytes = 15.71%
Is there a need for Lucene anymore with this kind of a speed and disk
space usage? :)
The above tests are run with everything being indexed except space and
control chars (ie. practically space + tab). If only alphanumeric
characters are indexed, the total lines are (in same order as above):
Total: 7048664 / 32474899 bytes = 21.70% (-13.07%)
Total: 9610543 / 32474899 bytes = 29.59% (-16.75%)
Total: 3963335 / 32474899 bytes = 12.20% (-3.51%)
There are probably still a few optimizations that can be done to get the
disk space usage even lower. I did spent quite a lot of time making it
use less CPU though. Even tried a bit different coding style variations
in the critical functions to get gcc to create faster code..
One interesting thing I noticed was that binary searching a character
from a char array was slower than just sequentially going through the
characters. I tried adding checks to do it only if the array was big,
but it was always slower. Maybe something to do with CPU cache.
If you want to try yourself, the test program is in
http://dovecot.org/tmp/new-indexer.c

[Dovecot] New full text search indexer

Timo Sirainen