[Dovecot] New full text search indexer

Timo Sirainen tss at iki.fi
Fri Apr 13 14:07:58 EEST 2007


On Fri, 2007-04-06 at 00:34 +0300, Timo Sirainen wrote:
> Squat-like 4 byte substrings (but can answer 1-3 char queries also):

Indexing a 1,4GB Linux kernel mailing list mbox with 367919 messages:

UID count: 367919
Index time: 129.86 CPU seconds (10.43MB/CPUs), 132.47 seconds
(10.23MB/s)
Memory: UID list 143162 kB, heap total 263996 kB
Node space: 9182910 bytes (8967 kB) in 47989 nodes
UID space: 357344706 bytes (348969 kB) in 7404211 lists
Total space: 366527616 / 1420611590 bytes = 25.80%

So the process takes 260MB of memory in the end. I think that's OK. If I
removed memory limits, it would use 1,4GB of memory and index file size
would drop to 17,96%. That could be achieved also by post-processing the
indexes.

gzip compression makes the uidlist still 25% smaller (total space
19,50%). It'd have to be used to compress the file in smaller blocks
because zlib doesn't support quickly seeking inside the file. That would
probably waste some space. I don't think it's worth the extra CPU time
and complexity.

Currently the uidlists are stored either as "packed UID ranges" or
bitmasks, whichever takes less space. I haven't yet tried how much space
could be saved by using a combination of these within uidlists.

I'd still have to figure out how to do incremental updates without
wasting a lot of space. And support the actual searching. :)

http://dovecot.org/tmp/new-indexer.c updated.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : http://dovecot.org/pipermail/dovecot/attachments/20070413/fa2c8450/attachment.pgp 


More information about the dovecot mailing list