On Fri, 2007-04-06 at 00:34 +0300, Timo Sirainen wrote:
Squat-like 4 byte substrings (but can answer 1-3 char queries also):
Indexing a 1,4GB Linux kernel mailing list mbox with 367919 messages:
UID count: 367919 Index time: 129.86 CPU seconds (10.43MB/CPUs), 132.47 seconds (10.23MB/s) Memory: UID list 143162 kB, heap total 263996 kB Node space: 9182910 bytes (8967 kB) in 47989 nodes UID space: 357344706 bytes (348969 kB) in 7404211 lists Total space: 366527616 / 1420611590 bytes = 25.80%
So the process takes 260MB of memory in the end. I think that's OK. If I removed memory limits, it would use 1,4GB of memory and index file size would drop to 17,96%. That could be achieved also by post-processing the indexes.
gzip compression makes the uidlist still 25% smaller (total space 19,50%). It'd have to be used to compress the file in smaller blocks because zlib doesn't support quickly seeking inside the file. That would probably waste some space. I don't think it's worth the extra CPU time and complexity.
Currently the uidlists are stored either as "packed UID ranges" or bitmasks, whichever takes less space. I haven't yet tried how much space could be saved by using a combination of these within uidlists.
I'd still have to figure out how to do incremental updates without wasting a lot of space. And support the actual searching. :)
http://dovecot.org/tmp/new-indexer.c updated.