Timo Sirainen wrote:
On Fri, 2007-04-06 at 00:34 +0300, Timo Sirainen wrote:
Squat-like 4 byte substrings (but can answer 1-3 char queries also):
Indexing a 1,4GB Linux kernel mailing list mbox with 367919 messages:
UID count: 367919 Index time: 129.86 CPU seconds (10.43MB/CPUs), 132.47 seconds (10.23MB/s) Memory: UID list 143162 kB, heap total 263996 kB Node space: 9182910 bytes (8967 kB) in 47989 nodes UID space: 357344706 bytes (348969 kB) in 7404211 lists Total space: 366527616 / 1420611590 bytes = 25.80%
So the process takes 260MB of memory in the end. I think that's OK. If I removed memory limits, it would use 1,4GB of memory and index file size would drop to 17,96%. That could be achieved also by post-processing the indexes.
gzip compression makes the uidlist still 25% smaller (total space 19,50%). It'd have to be used to compress the file in smaller blocks because zlib doesn't support quickly seeking inside the file. That would probably waste some space. I don't think it's worth the extra CPU time and complexity.
Currently the uidlists are stored either as "packed UID ranges" or bitmasks, whichever takes less space. I haven't yet tried how much space could be saved by using a combination of these within uidlists.
I'd still have to figure out how to do incremental updates without wasting a lot of space. And support the actual searching. :)
http://dovecot.org/tmp/new-indexer.c updated.
Timo - we were just having a conversation about how we might be able to provide full body indexed search for our clients and I realised it might be worth checking the Dovecot list to see if this has been done already...
And then I find this thread!
What is the latest? Searches in the headers work fantastically but any search in a sizeable folder (1000's messages) often just time out.
Would love to offer a full text search so let me know how far you have gotten.
Will the search be able to offer phrase searching? Eg "Search for this string" as an exact match?
Fantastic - very happy to see you're working on adding this.
Daniel