[Dovecot] Full text search indexing
Jens Laas
jens.laas at data.slu.se
Wed Apr 12 10:40:47 EEST 2006
(06.04.06 kl.17:39) Timo Sirainen skrev följande till dovecot at dovecot.org:
> On Thu, 2006-04-06 at 09:33 +0200, Jens Laas wrote:
>>> So, the one usable engine which I know of is Cyrus's squat indexing. It
>>> works basically by indexing the mails in 4 byte blocks and storing them
>>> in a trie. So if there exists a string "he llo", it indexes "he l",
>>> "e ll" and " llo" strings. When searching it then gets the first 4 bytes
>>> of the search keyword, looks them up from the squat index and then once
>>> it gets a list of matched mails, it reads the mails and does a proper
>>> search for only them. So the squat index doesn't give exact results, but
>>> only narrows down the number of mails which have to be searched.
This squat index thingy gave me an idea (which Ive also partly tested).
I made an even squatter index with just character couples (couplets).
- Each character is recoded to an integer between 0-63.
- A map of 64*64 bits is created for each mail.
- For each character couple indexed from the mail the corresponding
couplet bit is set: map[C1][C2] = 1.
- The search is then linear through all the maps generated for the
mailbox. Each map is tested for the presence of all the couplets
contained in the searchstring. This generates a number of candidate
mails that may match the searchstring (false positives occur).
- Candidate mails are the bruteforce searched for searchstring.
I tested the above with indexing raw body-parts from a fairly large
mailbox (72 MB/13000 mails).
The index is then ~6MB.
Each map in the index takes 512 bytes: (64*64)/8. (Slightly larger if you
also store the UID).
Advantage with this type if index is that its simple and very fast to
build (rebuilding indexed will probably occur often if youre not using
dovecot-lda).
The number of false positives as candidates vary alot depending on the
searchstring.
IO-wise this should be a clear improvement since the index is quite a bit
smaller than the mailbox. Our mailserver is very IO-bound. CPU cycles are
plenty.
> Yes, but they have to be translated to unicode in any case, because
> searching will have to work regardless of what character set is used
> when doing the search. So the choices would be pretty much:
>
> 1) Use 4 characters long UTF-16 strings
> 2) Use 4 bytes long UTF-8 strings
> 3) Use 4 character long UTF-8 strings
Im sorry for my incomplete IMAP knowledge. Is the server required convert
the searchstring and/or mimepart to the same character set for string
searching?
>> So the index maps:
>> suffix -> word -> list ( id1 .. idN )
>
> That just would create even larger indexes than the 4-byte-trie-index..
I tried it and it resulted in a huge index. Also costly to build. I agree.
Cheers
Jens
-----------------------------------------------------------------------
'Old C programmers don't die ... they're just cast into void*'
-----------------------------------------------------------------------
Jens Låås Email: jens.laas at data.slu.se
Department of Computer Services, SLU Phone: +46 18 67 35 15
Vindbrovägen 1
P.O. Box 7079
S-750 07 Uppsala
SWEDEN
-----------------------------------------------------------------------
More information about the dovecot
mailing list