Re: [Dovecot] Full text search indexing

12 Apr 2006

      On Wed, 2006-04-12 at 09:40 +0200, Jens Laas wrote:
...
I made an even squatter index with just character couples (couplets).

Each character is recoded to an integer between 0-63.
A map of 64*64 bits is created for each mail.
For each character couple indexed from the mail the corresponding
couplet bit is set: map[C1][C2] = 1.
The search is then linear through all the maps generated for the
mailbox. Each map is tested for the presence of all the couplets
contained in the searchstring. This generates a number of candidate
mails that may match the searchstring (false positives occur).
Candidate mails are the bruteforce searched for searchstring.

I tested the above with indexing raw body-parts from a fairly large
mailbox (72 MB/13000 mails).
The index is then ~6MB.
Each map in the index takes 512 bytes: (64*64)/8. (Slightly larger if you
also store the UID).
Advantage with this type if index is that its simple and very fast to
build (rebuilding indexed will probably occur often if youre not using
dovecot-lda).
The number of false positives as candidates vary alot depending on the
searchstring.
Did you test what the rate was approximately with some somewhat sensible
search strings?
The problem I see with this is that each map still has to be searched,
so a 6MB index needs to be completely read and checked. So it could take
10x less time, but it's still O(N)..
But you're right that this is at least much simpler than the trie
approach.
...
...
Yes, but they have to be translated to unicode in any case, because
searching will have to work regardless of what character set is used
when doing the search. So the choices would be pretty much:

Use 4 characters long UTF-16 strings
Use 4 bytes long UTF-8 strings
Use 4 character long UTF-8 strings

Im sorry for my incomplete IMAP knowledge. Is the server required convert
the searchstring and/or mimepart to the same character set for string
searching?
Yes. Otherwise it's a bit difficult to search non-ASCII text in the
parts of world where multiple character sets are used :)

Re: [Dovecot] Full text search indexing

Timo Sirainen