[Dovecot] Full text search indexing

newer
[Dovecot] http://wiki.dovecot.org/...

Timo Sirainen

3 Apr 2006 3 Apr '06

10:44 a.m.

SEARCH is probably the slowest command that Dovecot has, especially when searching texts from message bodies. There exists a lot of text search engines, but unfortunately as far as I know there exists only one that can be used with IMAP.

The reason why IMAP is special is because it requires that when searching eg. "ello", it must match also mails which contain eg. word "hello". Most search engines support only matching either complete words or matching from the beginning of the words.

So, the one usable engine which I know of is Cyrus's squat indexing. It works basically by indexing the mails in 4 byte blocks and storing them in a trie. So if there exists a string "he llo", it indexes "he l", "e ll" and " llo" strings. When searching it then gets the first 4 bytes of the search keyword, looks them up from the squat index and then once it gets a list of matched mails, it reads the mails and does a proper search for only them. So the squat index doesn't give exact results, but only narrows down the number of mails which have to be searched.

The biggest problem with those squat indexes is that it doesn't support incremental updates, so there has to be some nightly runs to rebuild the indexes from scratch. And with lots of users that's not practical, so I think that's the reason why the squat indexes aren't all that commonly used with Cyrus.

I was thinking about taking the basic ideas from the Cyrus squat indexer but making it possible to update it incrementally so Dovecot could do it automatically. Also to reduce space I was thinking about sorting the 4 byte strings so instead of eg. "abcd", "dcba" and "abdc" all being indexed, they could all be stored as "abcd" (so the search key would also have to be sorted before use). But that might cause larger result sets making the search slower, so I'll have to test if it's worth it.

Unicode characters might also be a problem. If the characters are stored in 16bit characters, it would waste space. If they're stored in UTF-8 the 4 bytes might not be enough to give a useful index (as a single character might take even 5 bytes). One thing I was thinking was to not index in 4 byte blocks, but 4 character blocks in UTF-8, so the trie could be 4..20 levels deep depending on what unicode characters are used (in finnish text it would probably be max. 7 levels deep).

So, now would be a good time to tell if you can think of some better ideas how to do this, especially if you know about some new algorithms specifically designed for this kind of searching. :) Pretty much all the algorithms I've found about this were suffix trees/arrays and this basic trie thing.

Attachments:

signature.asc (application/pgp-signature — 189 bytes)

Show replies by date

Jens Laas

6 Apr 6 Apr

10:33 a.m.

(06.04.03 kl.10:44) Timo Sirainen skrev följande till dovecot@dovecot.org:

...

SEARCH is probably the slowest command that Dovecot has, especially when searching texts from message bodies. There exists a lot of text search engines, but unfortunately as far as I know there exists only one that can be used with IMAP.

The reason why IMAP is special is because it requires that when searching eg. "ello", it must match also mails which contain eg. word "hello". Most search engines support only matching either complete words or matching from the beginning of the words.

So, the one usable engine which I know of is Cyrus's squat indexing. It works basically by indexing the mails in 4 byte blocks and storing them in a trie. So if there exists a string "he llo", it indexes "he l", "e ll" and " llo" strings. When searching it then gets the first 4 bytes of the search keyword, looks them up from the squat index and then once it gets a list of matched mails, it reads the mails and does a proper search for only them. So the squat index doesn't give exact results, but only narrows down the number of mails which have to be searched.

How would you search for '*lo' in 'hello' ? Only 'ello' would be indexed.

...

Unicode characters might also be a problem. If the characters are stored in 16bit characters, it would waste space. If they're stored in UTF-8 the 4 bytes might not be enough to give a useful index (as a single character might take even 5 bytes). One thing I was thinking was to not index in 4 byte blocks, but 4 character blocks in UTF-8, so the trie could be 4..20 levels deep depending on what unicode characters are used (in finnish text it would probably be max. 7 levels deep).

Using UTF-8 in the index sounds good, but how hard would it be in practise? The non-ascii characters are already encoded someway in the mails arent they?

...

So, now would be a good time to tell if you can think of some better ideas how to do this, especially if you know about some new algorithms specifically designed for this kind of searching. :) Pretty much all the algorithms I've found about this were suffix trees/arrays and this basic trie thing.

I am very interested in this area. The best full text search method I've yet to come across is the suffix index. You index every suffix of a word. For 'hello': 'hello' 'ello' 'llo' 'lo' 'o'

So the index maps: suffix -> word -> list ( id1 .. idN )

You can also map word to a phrase as an intermediate step.

I feel certain there must be better methods out there.

Cheers Jens

 'Old C programmers don't die ... they're just cast into void*'

 Jens Låås                              Email: jens.laas@data.slu.se
 Department of Computer Services, SLU   Phone: +46 18 67 35 15
 Vindbrovägen 1
 P.O. Box 7079
 S-750 07 Uppsala
 SWEDEN

Steffen Kaiser

11:56 a.m.

On Thu, 6 Apr 2006, Jens Laas wrote:

Hmm, I'm not certain if yet another search engine is really good, I mean:

there are search engines - ok usually for the web, but indexed is indexed: namazu, swish, mnogosearch, htdig?

So, how about a search API one can call an external indexing service?

That reminds me of something: How about storing all the mails into a database and use its search capabilites?

Bye,

-- Steffen Kaiser

Bill Boebel

12:55 p.m.

On Thu, April 6, 2006 4:56, Steffen Kaiser said:

...

Hmm, I'm not certain if yet another search engine is really good, I mean:

So, how about a search API one can call an external indexing service?

I like this idea. Add a hook to call a plugin for: (1) indexing new messages, (2) moving/deleting messages, and (3) performing full text searches.

It seems reasonable these days with GB+ sized mailboxes, to violate this part of the IMAP RFC (2060) and use a powerful search engine like Lucene which doesn't support substring matches:

6.4.4. SEARCH Command .. In all search keys that use strings, a message matches the key if the string is a substring of the field.

Bill

Timo Sirainen

5:29 p.m.

On Thu, 2006-04-06 at 05:55 -0400, Bill Boebel wrote:

...

On Thu, April 6, 2006 4:56, Steffen Kaiser said:

...
Hmm, I'm not certain if yet another search engine is really good, I mean:

So, how about a search API one can call an external indexing service?

I like this idea. Add a hook to call a plugin for: (1) indexing new messages, (2) moving/deleting messages, and (3) performing full text searches.

Well, those are all already possible to do with plugins. :) It could be made easier though.

...

It seems reasonable these days with GB+ sized mailboxes, to violate this part of the IMAP RFC (2060) and use a powerful search engine like Lucene which doesn't support substring matches:

6.4.4. SEARCH Command .. In all search keys that use strings, a message matches the key if the string is a substring of the field.

I've also been thinking about an IMAP extension which would allow less strict searching as long as the client supports it. Possibly if you really want to you could also configure Dovecot to violate the RFC with standard SEARCH command..

Timo Sirainen

5:39 p.m.

On Thu, 2006-04-06 at 09:33 +0200, Jens Laas wrote:

...

...
So, the one usable engine which I know of is Cyrus's squat indexing. It works basically by indexing the mails in 4 byte blocks and storing them in a trie. So if there exists a string "he llo", it indexes "he l", "e ll" and " llo" strings. When searching it then gets the first 4 bytes of the search keyword, looks them up from the squat index and then once it gets a list of matched mails, it reads the mails and does a proper search for only them. So the squat index doesn't give exact results, but only narrows down the number of mails which have to be searched.

How would you search for '*lo' in 'hello' ? Only 'ello' would be indexed.

There are no wildcards in IMAP's search, but "lo" search would do the same thing.

I think Cyrus simply fallbacks to searching the hard way if the search key is less than 4 bytes. I was thinking about also keeping result lists for smaller search keys, so the indexes could still be used.

So if the mail contains "hello world", then "lo w" is indexed and the mail's UID is stored to all "lo w", "lo ", "lo" and "l" nodes. And if the mail ends with "hello", then I'd index also "llo", "lo" and "o" for it. Hopefully it doesn't become too big this way.. :) At least one optimization is to allow the mail result list to contain "all mails" and negaive entries, so eg. "lo" could exist in "*,-5" meaning in all mails but mail with UID 5.

...

...
Unicode characters might also be a problem. If the characters are stored in 16bit characters, it would waste space. If they're stored in UTF-8 the 4 bytes might not be enough to give a useful index (as a single character might take even 5 bytes). One thing I was thinking was to not index in 4 byte blocks, but 4 character blocks in UTF-8, so the trie could be 4..20 levels deep depending on what unicode characters are used (in finnish text it would probably be max. 7 levels deep).

Using UTF-8 in the index sounds good, but how hard would it be in practise? The non-ascii characters are already encoded someway in the mails arent they?

Yes, but they have to be translated to unicode in any case, because searching will have to work regardless of what character set is used when doing the search. So the choices would be pretty much:

Use 4 characters long UTF-16 strings
Use 4 bytes long UTF-8 strings
Use 4 character long UTF-8 strings

...

...
So, now would be a good time to tell if you can think of some better ideas how to do this, especially if you know about some new algorithms specifically designed for this kind of searching. :) Pretty much all the algorithms I've found about this were suffix trees/arrays and this basic trie thing.

I am very interested in this area. The best full text search method I've yet to come across is the suffix index. You index every suffix of a word. For 'hello': 'hello' 'ello' 'llo' 'lo' 'o'

So the index maps: suffix -> word -> list ( id1 .. idN )

That just would create even larger indexes than the 4-byte-trie-index..

Jens Laas

12 Apr 12 Apr

10:40 a.m.

(06.04.06 kl.17:39) Timo Sirainen skrev följande till dovecot@dovecot.org:

...

On Thu, 2006-04-06 at 09:33 +0200, Jens Laas wrote:

...
...
So, the one usable engine which I know of is Cyrus's squat indexing. It works basically by indexing the mails in 4 byte blocks and storing them in a trie. So if there exists a string "he llo", it indexes "he l", "e ll" and " llo" strings. When searching it then gets the first 4 bytes of the search keyword, looks them up from the squat index and then once it gets a list of matched mails, it reads the mails and does a proper search for only them. So the squat index doesn't give exact results, but only narrows down the number of mails which have to be searched.

This squat index thingy gave me an idea (which Ive also partly tested).

I made an even squatter index with just character couples (couplets).

Each character is recoded to an integer between 0-63.
A map of 64*64 bits is created for each mail.
For each character couple indexed from the mail the corresponding couplet bit is set: map[C1][C2] = 1.
The search is then linear through all the maps generated for the mailbox. Each map is tested for the presence of all the couplets contained in the searchstring. This generates a number of candidate mails that may match the searchstring (false positives occur).
Candidate mails are the bruteforce searched for searchstring.

I tested the above with indexing raw body-parts from a fairly large mailbox (72 MB/13000 mails). The index is then ~6MB. Each map in the index takes 512 bytes: (64*64)/8. (Slightly larger if you also store the UID).

Advantage with this type if index is that its simple and very fast to build (rebuilding indexed will probably occur often if youre not using dovecot-lda).

The number of false positives as candidates vary alot depending on the searchstring.

IO-wise this should be a clear improvement since the index is quite a bit smaller than the mailbox. Our mailserver is very IO-bound. CPU cycles are plenty.

...

Yes, but they have to be translated to unicode in any case, because searching will have to work regardless of what character set is used when doing the search. So the choices would be pretty much:

Use 4 characters long UTF-16 strings

Use 4 bytes long UTF-8 strings

Use 4 character long UTF-8 strings

Im sorry for my incomplete IMAP knowledge. Is the server required convert the searchstring and/or mimepart to the same character set for string searching?

...

...
So the index maps: suffix -> word -> list ( id1 .. idN )

That just would create even larger indexes than the 4-byte-trie-index..

I tried it and it resulted in a huge index. Also costly to build. I agree.

Cheers Jens

 'Old C programmers don't die ... they're just cast into void*'

 Jens Låås                              Email: jens.laas@data.slu.se
 Department of Computer Services, SLU   Phone: +46 18 67 35 15
 Vindbrovägen 1
 P.O. Box 7079
 S-750 07 Uppsala
 SWEDEN

Timo Sirainen

11:20 a.m.

On Wed, 2006-04-12 at 09:40 +0200, Jens Laas wrote:

...

I made an even squatter index with just character couples (couplets).

Each character is recoded to an integer between 0-63.

A map of 64*64 bits is created for each mail.

For each character couple indexed from the mail the corresponding couplet bit is set: map[C1][C2] = 1.

The search is then linear through all the maps generated for the mailbox. Each map is tested for the presence of all the couplets contained in the searchstring. This generates a number of candidate mails that may match the searchstring (false positives occur).

Candidate mails are the bruteforce searched for searchstring.

I tested the above with indexing raw body-parts from a fairly large mailbox (72 MB/13000 mails). The index is then ~6MB. Each map in the index takes 512 bytes: (64*64)/8. (Slightly larger if you also store the UID).

Advantage with this type if index is that its simple and very fast to build (rebuilding indexed will probably occur often if youre not using dovecot-lda).

The number of false positives as candidates vary alot depending on the searchstring.

Did you test what the rate was approximately with some somewhat sensible search strings?

The problem I see with this is that each map still has to be searched, so a 6MB index needs to be completely read and checked. So it could take 10x less time, but it's still O(N)..

But you're right that this is at least much simpler than the trie approach.

...

...
Yes, but they have to be translated to unicode in any case, because searching will have to work regardless of what character set is used when doing the search. So the choices would be pretty much:

Use 4 characters long UTF-16 strings

Use 4 bytes long UTF-8 strings

Use 4 character long UTF-8 strings

Im sorry for my incomplete IMAP knowledge. Is the server required convert the searchstring and/or mimepart to the same character set for string searching?

Yes. Otherwise it's a bit difficult to search non-ASCII text in the parts of world where multiple character sets are used :)

Jens Laas

12:09 p.m.

(06.04.12 kl.11:20) Timo Sirainen skrev följande till Jens Laas:

...

On Wed, 2006-04-12 at 09:40 +0200, Jens Laas wrote:

...
I made an even squatter index with just character couples (couplets).

Each character is recoded to an integer between 0-63.

A map of 64*64 bits is created for each mail.

For each character couple indexed from the mail the corresponding couplet bit is set: map[C1][C2] = 1.

The search is then linear through all the maps generated for the mailbox. Each map is tested for the presence of all the couplets contained in the searchstring. This generates a number of candidate mails that may match the searchstring (false positives occur).

Candidate mails are the bruteforce searched for searchstring.

I tested the above with indexing raw body-parts from a fairly large mailbox (72 MB/13000 mails). The index is then ~6MB. Each map in the index takes 512 bytes: (64*64)/8. (Slightly larger if you also store the UID).

Advantage with this type if index is that its simple and very fast to build (rebuilding indexed will probably occur often if youre not using dovecot-lda).

The number of false positives as candidates vary alot depending on the searchstring.

Did you test what the rate was approximately with some somewhat sensible search strings?

I'll try. Sensible searchstrings are not that easy to come up with though. The data indexed are the bodyparts of an UML (linux) archive.

First line is nr of candidates. Second line is result of a grep -i in the rawdata. (Ouf of a total of almost 13000 maps).

jensl:~/project/jelindex> ./search.sh "jeff" 3652 3861 linux-gnu, Wed Apr 12 10:55:22 mobilus.data.slu.se jensl:~/project/jelindex> ./search.sh "startx" 251 36 linux-gnu, Wed Apr 12 10:55:30 mobilus.data.slu.se jensl:~/project/jelindex> ./search.sh "uml builder" 2415 77 linux-gnu, Wed Apr 12 10:55:38 mobilus.data.slu.se jensl:~/project/jelindex> ./search.sh "xnest" 82 86 jensl:~/project/jelindex> ./search.sh "kernel panic" 4692 1100

Some worse cases: jensl:~/project/jelindex> ./search.sh "management" 5098 341 jensl:~/project/jelindex> ./search.sh "Timo Sirainen" 494 0

Keep in mind: The grep may report multiple counts from the same mail. The indexed data contains a few fairly large base64 encoded attachements. I suspect the base64 data will fill those maps pretty well.

...

The problem I see with this is that each map still has to be searched, so a 6MB index needs to be completely read and checked. So it could take 10x less time, but it's still O(N)..

Yes. But if it's faster, it's faster.. Generally I think it's true that the better and faster the index the more time it takes to build. Good algorithms mitigate this of course.

To improve accuracy while potentially sacrificing IO we could make the maps larger. More map-data to be read but maybe less bruteforce searching of mails.

...

But you're right that this is at least much simpler than the trie approach.

I think something along these lines is well worth to keep in mind atleast.

 'Old C programmers don't die ... they're just cast into void*'

 Jens Låås                              Email: jens.laas@data.slu.se
 Department of Computer Services, SLU   Phone: +46 18 67 35 15
 Vindbrovägen 1
 P.O. Box 7079
 S-750 07 Uppsala
 SWEDEN

Rob Middleton

12:19 p.m.

...

...
Did you test what the rate was approximately with some somewhat sensible search strings?

I'll try. Sensible searchstrings are not that easy to come up with though. The data indexed are the bodyparts of an UML (linux) archive.

First line is nr of candidates. Second line is result of a grep -i in the rawdata. (Ouf of a total of almost 13000 maps).

grep -c -i may be more useful. (then the two numbers are directly comparable)

I certainly think that orders of magnitude is more important than O() in the search case of mailboxes where messages come and go regularly ... this type of index is actually appendable.

...

Some worse cases: jensl:~/project/jelindex> ./search.sh "management" 5098 341 jensl:~/project/jelindex> ./search.sh "Timo Sirainen" 494 0

??? What Timo Sirainen requires all of the pairs: "ti", "im", "mo", "o ", " S", "Si" ... and so on. Longer strings should (all else being equal) result in greater accuracy.

I understand spaces are treated exactly the same way as characters in IMAP search.

Rob.

Rob Middleton

11:48 a.m.

...

...
On Thu, 2006-04-06 at 09:33 +0200, Jens Laas wrote:

...
...
So, the one usable engine which I know of is Cyrus's squat indexing. It works basically by indexing the mails in 4 byte blocks and storing them in a trie. So if there exists a string "he llo", it indexes "he l", "e ll" and " llo" strings. When searching it then gets the first 4 bytes of the search keyword, looks them up from the squat index and then once it gets a list of matched mails, it reads the mails and does a proper search for only them. So the squat index doesn't give exact results, but only narrows down the number of mails which have to be searched.

This squat index thingy gave me an idea (which Ive also partly tested).

I made an even squatter index with just character couples (couplets).

Each character is recoded to an integer between 0-63.

A map of 64*64 bits is created for each mail.

I like this idea - and will be interested in statistics on real mailboxes and real searches carried out by users.

I would think that reducing the index to 32*32 bits thus reducing the IO on the index by a factor of 4 would give fairly comparable usefulness. (ie - the index should be case insensitive as case is not likely to be nearly as discriminatory as the actual letter of alphabet ... but I am not a natural language expert so don't have statistics to back that belief up.)

...

For each character couple indexed from the mail the corresponding couplet bit is set: map[C1][C2] = 1.

The search is then linear through all the maps generated for the mailbox. Each map is tested for the presence of all the couplets contained in the searchstring. This generates a number of candidate mails that may match the searchstring (false positives occur).

Candidate mails are the bruteforce searched for searchstring.

I tested the above with indexing raw body-parts from a fairly large mailbox (72 MB/13000 mails).

Which part of the body? If indexing an html email you would presumably want to strip all the html tags for indexing as that is not meaningful. What about if there is a mime multipart/alternative text or html part - index both or just the text part?

...

The index is then ~6MB. Each map in the index takes 512 bytes: (64*64)/8. (Slightly larger if you also store the UID).

Advantage with this type if index is that its simple and very fast to build (rebuilding indexed will probably occur often if youre not using dovecot-lda).

And rebuilding the index doesn't mean reprocessing all of the emails just reading in the old index processing new messages discarding deleted ones then writing the index back to disk.

...

The number of false positives as candidates vary alot depending on the searchstring.

IO-wise this should be a clear improvement since the index is quite a bit smaller than the mailbox. Our mailserver is very IO-bound. CPU cycles are plenty.

...
Yes, but they have to be translated to unicode in any case, because searching will have to work regardless of what character set is used when doing the search. So the choices would be pretty much:

Use 4 characters long UTF-16 strings

Use 4 bytes long UTF-8 strings

Use 4 character long UTF-8 strings

Im sorry for my incomplete IMAP knowledge. Is the server required convert the searchstring and/or mimepart to the same character set for string searching?

Probably this indexing method would be optimised for various character sets by different mappings from characters -> int 0-31. (I haven't thought this last comment through much ... does each 32*32 bit array want a character set id attached to it?)

Regards, Rob.

...

...
...
So the index maps: suffix -> word -> list ( id1 .. idN )

That just would create even larger indexes than the 4-byte-trie-index..

I tried it and it resulted in a huge index. Also costly to build. I agree.

Cheers Jens

Jens Laas

12:18 p.m.

(06.04.12 kl.18:48) Rob Middleton skrev följande till Jens Laas and...:

...

...
This squat index thingy gave me an idea (which Ive also partly tested).

I made an even squatter index with just character couples (couplets).

Each character is recoded to an integer between 0-63.

A map of 64*64 bits is created for each mail.

I like this idea - and will be interested in statistics on real mailboxes and real searches carried out by users.

I would think that reducing the index to 32*32 bits thus reducing the IO on the index by a factor of 4 would give fairly comparable usefulness. (ie - the index should be case insensitive as case is not likely to be nearly as discriminatory as the actual letter of alphabet ... but I am not a natural language expert so don't have statistics to back that belief up.)

I converted the (assumed ascii) data to all lowercase before indexing it. With just 32 significant characters you cant even cover 'a-z' and '0-9'. So the number of false positives will increase.

...

...
I tested the above with indexing raw body-parts from a fairly large mailbox (72 MB/13000 mails).

Which part of the body? If indexing an html email you would presumably want to strip all the html tags for indexing as that is not meaningful. What about if there is a mime multipart/alternative text or html part - index both or just the text part?

Maybe I described it badly. By body I meant everything in the mail but the rfc822 headers (the top header of the mail). I did no MIME processing at all. Just indexed everything as is. (This was simplest to just test the concept).

...

...
The index is then ~6MB. Each map in the index takes 512 bytes: (64*64)/8. (Slightly larger if you also store the UID).

Advantage with this type if index is that its simple and very fast to build (rebuilding indexed will probably occur often if youre not using dovecot-lda).

And rebuilding the index doesn't mean reprocessing all of the emails just reading in the old index processing new messages discarding deleted ones then writing the index back to disk.

I was thinking of the case when the whole dovecot-index had to be rebuilt anyway. Which currently occurs when dovecot decides the index is out of sync with the mailbox (timestamp and size i think).

But its true that theoretically you could update the index with only new or changed mails.

...

...
Im sorry for my incomplete IMAP knowledge. Is the server required convert the searchstring and/or mimepart to the same character set for string searching?

Probably this indexing method would be optimised for various character sets by different mappings from characters -> int 0-31. (I haven't thought this last comment through much ... does each 32*32 bit array want a character set id attached to it?)

That might be possibly. Thinking of different character sets makes my head ache :-).

Cheers, Jens

...

Regards, Rob.

 'Old C programmers don't die ... they're just cast into void*'

 Jens Låås                              Email: jens.laas@data.slu.se
 Department of Computer Services, SLU   Phone: +46 18 67 35 15
 Vindbrovägen 1
 P.O. Box 7079
 S-750 07 Uppsala
 SWEDEN

Timo Sirainen

1:39 p.m.

On Wed, 2006-04-12 at 11:18 +0200, Jens Laas wrote:

...

...
...
I tested the above with indexing raw body-parts from a fairly large mailbox (72 MB/13000 mails).

Which part of the body? If indexing an html email you would presumably want to strip all the html tags for indexing as that is not meaningful. What about if there is a mime multipart/alternative text or html part - index both or just the text part?

Doing any of that would again mean losing IMAP RFC compatibility.

...

...
And rebuilding the index doesn't mean reprocessing all of the emails just reading in the old index processing new messages discarding deleted ones then writing the index back to disk.

I was thinking of the case when the whole dovecot-index had to be rebuilt anyway. Which currently occurs when dovecot decides the index is out of sync with the mailbox (timestamp and size i think).

No, Dovecot doesn't really rebuild the indexes even then. Assuming mbox file it just then goes through the whole mbox file and checks that everything is still in place.

Full index rebuilds are done only when the index gets corrupted or when the mailbox's UIDVALIDITY changes (practically never unless you go manually doing something weird).

...

...
...
Im sorry for my incomplete IMAP knowledge. Is the server required convert the searchstring and/or mimepart to the same character set for string searching?

Probably this indexing method would be optimised for various character sets by different mappings from characters -> int 0-31. (I haven't thought this last comment through much ... does each 32*32 bit array want a character set id attached to it?)

That might be possibly. Thinking of different character sets makes my head ache :-).

Another problem is that with UTF-8 the two characters may describe only a single character (or not even that), which increases the false positives a lot if the language uses a lot of non-ascii.

Jens Laas

2:53 p.m.

(06.04.12 kl.13:39) Timo Sirainen skrev följande till Jens Laas:

...

...
...
...
Im sorry for my incomplete IMAP knowledge. Is the server required convert the searchstring and/or mimepart to the same character set for string searching?

Probably this indexing method would be optimised for various character sets by different mappings from characters -> int 0-31. (I haven't thought this last comment through much ... does each 32*32 bit array want a character set id attached to it?)

That might be possibly. Thinking of different character sets makes my head ache :-).

Another problem is that with UTF-8 the two characters may describe only a single character (or not even that), which increases the false positives a lot if the language uses a lot of non-ascii.

Hmm. The map should then be for couplets of UTF-8 characters. Then we just have to decide how we map the whole UTF-8 space to just 64 instances.

The way I did it in the test was to first see if the character was in an array of the "most common" characters, and if so use that index directly. If it was not found in the array I just used the lower bits.

We could decide which UTF-8 characters are most common. We could also figure out what part of the UTF-8 character is most significant and use it for the index (maybe the last byte?).

I think this would work pretty well atleast with western languages. I dont even want to think about chinese languages where one Unicode code-point is a whole word.

Any idea of how big and costly the squat index is ?

Cheers, Jens

 'Old C programmers don't die ... they're just cast into void*'

 Jens Låås                              Email: jens.laas@data.slu.se
 Department of Computer Services, SLU   Phone: +46 18 67 35 15
 Vindbrovägen 1
 P.O. Box 7079
 S-750 07 Uppsala
 SWEDEN

Jan Kundrát

6:32 p.m.

Jens Laas wrote:

...

Each character is recoded to an integer between 0-63.

If I understand your concept correctly, this won't work, there's a lot more characters than 64. Or is it only some kind of hash?

-jkt

-- cd /local/pub && more beer > /dev/mouth

7066

Age (days ago)

7075

Last active (days ago)

List overview

14 comments

6 participants

participants (6)

Bill Boebel
Jan Kundrát
Jens Laas
Rob Middleton
Steffen Kaiser
Timo Sirainen