On Mon, 2007-12-10 at 23:29 +0800, Joe Wong wrote:
Hi Timo,
Just take your suggestion. I have another collections of emails and running full text search on that did not encounter any problem no matter they are on NFS or local disk.
You mentioned that full text search is only working on for english only mailbox, what is the current limitation of it? Is there any plan to support non-english email ( conversion to UTF8? )
It should work with any UTF8 input, and I've tested that it works with some mails containing non-ASCII characters. There's nothing in design that prevents it. But I guess there is some bug then that causes these problems. If you could send me a test mailbox where this happens I could take a look at fixing it.
Although now that you mentioned it, I wonder if the current design could be optimized to work a bit differently with Chinese/Japanese/etc. Currently it works by indexing 4 character blocks, so with non-ASCII UTF-8 input it may end up indexing more than 4 bytes per block. How many bytes does a typical chinese UTF-8 character take? How many characters does a typical chinese word take? How many characters are in your typical search word?
I was just wondering if there's a lot of 1-3 character words, maybe the indexing could limit itself to something like minimum of(4 characters, ~8 bytes). That would then take less space and memory.