On Wed, 2006-04-12 at 09:40 +0200, Jens Laas wrote:
I made an even squatter index with just character couples (couplets).
- Each character is recoded to an integer between 0-63.
- A map of 64*64 bits is created for each mail.
- For each character couple indexed from the mail the corresponding couplet bit is set: map[C1][C2] = 1.
- The search is then linear through all the maps generated for the mailbox. Each map is tested for the presence of all the couplets contained in the searchstring. This generates a number of candidate mails that may match the searchstring (false positives occur).
- Candidate mails are the bruteforce searched for searchstring.
I tested the above with indexing raw body-parts from a fairly large mailbox (72 MB/13000 mails). The index is then ~6MB. Each map in the index takes 512 bytes: (64*64)/8. (Slightly larger if you also store the UID).
Advantage with this type if index is that its simple and very fast to build (rebuilding indexed will probably occur often if youre not using dovecot-lda).
The number of false positives as candidates vary alot depending on the searchstring.
Did you test what the rate was approximately with some somewhat sensible search strings?
The problem I see with this is that each map still has to be searched, so a 6MB index needs to be completely read and checked. So it could take 10x less time, but it's still O(N)..
But you're right that this is at least much simpler than the trie approach.
Yes, but they have to be translated to unicode in any case, because searching will have to work regardless of what character set is used when doing the search. So the choices would be pretty much:
- Use 4 characters long UTF-16 strings
- Use 4 bytes long UTF-8 strings
- Use 4 character long UTF-8 strings
Im sorry for my incomplete IMAP knowledge. Is the server required convert the searchstring and/or mimepart to the same character set for string searching?
Yes. Otherwise it's a bit difficult to search non-ASCII text in the parts of world where multiple character sets are used :)