[Dovecot] Squat indexing a Maildir of over 600 GB?
Guys,
We have a very large maildir for email auditing purposes. It's currently at 600 GB and continues to grow.
Can dovecot handle this with squat indexing, or am I out of my mind?
Thanks! John
On Mon, 2008-07-21 at 12:37 -0400, John Wells wrote:
Guys,
We have a very large maildir for email auditing purposes. It's currently at 600 GB and continues to grow.
Can dovecot handle this with squat indexing, or am I out of my mind?
You can try of course, but that might be a bit too much. :) I've only tested with a 1,4 GB mailbox and memory usage went somewhere like 700 MB I think.
It would be nice if Squat was able to scale to infinitely large mailboxes, but I currently I don't really see how that would be possible.
There are two issues here:
It needs to keep a trie in memory containing all the 4 character blocks of messages. If the input data doesn't contain all that much unique blocks perhaps this doesn't grow too large with 600 GB of data. Maybe this could be somehow changed so that the rarely used trie branches would be written to disk when memory usage gets too high.
Once the entire index is created Dovecot goes through it again and defragments all the pieces. This reduces the index size and speeds up lookups, but if the index doesn't fit entirely to memory this stage can take a really really long time.
Originally I was thinking about dropping this stage since it seemed to take forever, but then I figured out that once I first sequentially read the entire index into memory before starting the defragmentation it would take a lot less time (with the 1,5 GB mailbox it dropped from somewhere around 10 mins -> 0,5 mins). But if your index is larger than what fits into memory, this sequential read is pointless.
On Mon, Jul 21, 2008 at 12:50 PM, Timo Sirainen tss@iki.fi wrote:
On Mon, 2008-07-21 at 12:37 -0400, John Wells wrote:
Guys,
We have a very large maildir for email auditing purposes. It's currently at 600 GB and continues to grow.
Can dovecot handle this with squat indexing, or am I out of my mind?
You can try of course, but that might be a bit too much. :) I've only tested with a 1,4 GB mailbox and memory usage went somewhere like 700 MB I think.
Aha...I see...I was under the mistake impression that this was a disk-based index.
Given that squat seem unfeasible, can anyone recommend another approach? I'll look at Lucene integration, but if anyone knows of a dovecot way or of another tool that would do this effectively, commercial or open source, please let me know.
Thanks! John
On Mon, 2008-07-21 at 12:55 -0400, John Wells wrote:
On Mon, Jul 21, 2008 at 12:50 PM, Timo Sirainen tss@iki.fi wrote:
On Mon, 2008-07-21 at 12:37 -0400, John Wells wrote:
Guys,
We have a very large maildir for email auditing purposes. It's currently at 600 GB and continues to grow.
Can dovecot handle this with squat indexing, or am I out of my mind?
You can try of course, but that might be a bit too much. :) I've only tested with a 1,4 GB mailbox and memory usage went somewhere like 700 MB I think.
Aha...I see...I was under the mistake impression that this was a disk-based index.
It's stored on disk, but when indexing it needs to keep parts of the index in memory.
Given that squat seem unfeasible, can anyone recommend another approach? I'll look at Lucene integration, but if anyone knows of a dovecot way or of another tool that would do this effectively, commercial or open source, please let me know.
v1.1.2 has Solr support. It might work: http://wiki.dovecot.org/Plugins/FTS/Solr
On Mon, Jul 21, 2008 at 1:20 PM, Timo Sirainen tss@iki.fi wrote:
On Mon, 2008-07-21 at 12:55 -0400, John Wells wrote:
On Mon, Jul 21, 2008 at 12:50 PM, Timo Sirainen tss@iki.fi wrote:
On Mon, 2008-07-21 at 12:37 -0400, John Wells wrote:
Guys,
We have a very large maildir for email auditing purposes. It's currently at 600 GB and continues to grow.
Can dovecot handle this with squat indexing, or am I out of my mind?
You can try of course, but that might be a bit too much. :) I've only tested with a 1,4 GB mailbox and memory usage went somewhere like 700 MB I think.
Aha...I see...I was under the mistake impression that this was a disk-based index.
It's stored on disk, but when indexing it needs to keep parts of the index in memory.
Given that squat seem unfeasible, can anyone recommend another approach? I'll look at Lucene integration, but if anyone knows of a dovecot way or of another tool that would do this effectively, commercial or open source, please let me know.
v1.1.2 has Solr support. It might work: http://wiki.dovecot.org/Plugins/FTS/Solr
Thanks Timo...from what I know of Solr, it can handle it. But I'm curious how the integration works...specifically:
- When are messages added to Solr? Is it only when new ones arrive, or can older messages be injected as well?
- How does searching work? Do you need a front-end search tool to Solr?
Thanks! John
On Wed, 2008-07-23 at 11:42 -0400, John Wells wrote:
Thanks Timo...from what I know of Solr, it can handle it. But I'm curious how the integration works...specifically:
- When are messages added to Solr? Is it only when new ones arrive, or can older messages be injected as well?
Currently new messages are indexed only when starting the search. So you might want to have some kind of a cronjob executing the searches once a day or so.
- How does searching work? Do you need a front-end search tool to Solr?
Either use X-TEXT-FAST or X-BODY-FAST search command keys or enable break-imap-search setting and then you can use standard IMAP clients (that support server-side searches).
participants (2)
-
John Wells
-
Timo Sirainen