[Dovecot] Multiply mailboxes vs one huge
Hello,
as I need to store a lot of messages on my IMAP server (order of 900K-1000K; this is an archive for some time, maybe a year or so), I see some "slowness" in dealing with such a huge amount. I mainly need to do searches like "get all messages from user1@domain1.com to user2@domain2.tld recieved between date1 and date2".
So I really interested will it be wise to
a) split all messages into several smaller mailboxes (per-month, or per-day, or create 2-level-structure like "month/day/") b) use dbox (vs currently used mbox) storage scheme (I'm afraid of mdbox as I still not sure I'll be able to parse it by scripts later "just in case")
Dovecot is the latest one (2.1.3). No compression Dovecot-side, but it mails are in zfs volume with compression on. I ask this mainly due to my not fully understand how Dovecot indexes are working.
I also test another approach: to use my own index somewhere outside Dovecot which will store reference between emails and UIDs, and dates and UIDs, so I'll simple query my index for things I need. But then, that's exactly what IMAP index can do, so I simple slow my search down, isn't it? The only reason I think about my own index is I won't use 'all header' as search scope, I need to deal only with From:, To:, Cc:, Bcc: (if any), Recieved (if nowehere else I see the from/to info), and date field(s) - I doubt IMAP will care for that for me.
Yours, Alexander
On 9.4.2012, at 17.58, Alexander Chekalin wrote:
as I need to store a lot of messages on my IMAP server (order of 900K-1000K; this is an archive for some time, maybe a year or so), I see some "slowness" in dealing with such a huge amount. I mainly need to do searches like "get all messages from user1@domain1.com to user2@domain2.tld recieved between date1 and date2".
So by "received between date" you mean the IMAP INTERNALDATE as opposed to Date: header? These kind of searches are looked up from the index/cache files, and the performance should be exactly the same with all of the mailbox formats. It would be useful to figure out what exactly is causing the slowness. Is the SEARCH command slow? Something else? Is the slowness about user CPU, system CPU or disk IO?
Hello, Timo,
I feel a bit unsure about "which 'date' I mean", since I always consider the only date from Date: header. But which value is used as INTERNALDATE then? As soon as I use (for now) maildir storage type, all the metadata are stored in messages. So I expect Dovecot somehow parse and use Date: field itself, or I'm wrong with it? And also what's about messages without Date header at all?
But the Date isn't the worst thing. Look, to have my archive work I setup server-side filter which redirect all messages it processed also to my archive mailbox. This way, each message (after such a redirect) targeted to 'archive@mydomain', instead of its original destination email. The only place I can find out the original recipient is to parse 'Recieved' field(-s).
As I think I understand that none of these headers (Date or Received) are to be used for SEARCH anyway, and this was the idea behind creating my own index. But wait, is there any way I can make Dovecot also index additional fields (yes, I talk about 'Received') - then it'll be the best solution!
Thank you, Timo, for your work, yours, Alexander
09.04.2012 18:03, Timo Sirainen написал:
On 9.4.2012, at 17.58, Alexander Chekalin wrote:
as I need to store a lot of messages on my IMAP server (order of 900K-1000K; this is an archive for some time, maybe a year or so), I see some "slowness" in dealing with such a huge amount. I mainly need to do searches like "get all messages from user1@domain1.com to user2@domain2.tld recieved between date1 and date2". So by "received between date" you mean the IMAP INTERNALDATE as opposed to Date: header? These kind of searches are looked up from the index/cache files, and the performance should be exactly the same with all of the mailbox formats. It would be useful to figure out what exactly is causing the slowness. Is the SEARCH command slow? Something else? Is the slowness about user CPU, system CPU or disk IO?
On 9.4.2012, at 22.39, Alexander Chekalin wrote:
Hello, Timo,
I feel a bit unsure about "which 'date' I mean", since I always consider the only date from Date: header. But which value is used as INTERNALDATE then? As soon as I use (for now) maildir storage type, all the metadata are stored in messages. So I expect Dovecot somehow parse and use Date: field itself, or I'm wrong with it?
The INTERNALDATE means the same as "received date", while the Date: header is the "sent date". With mbox format the received date is stored in the separating From-lines. IMAP supports searching and sorting messages by either INTERNALDATE or Date: header
And also what's about messages without Date header at all?
The searching just doesn't match those messages then. Sorting falls back to using received date.
But the Date isn't the worst thing. Look, to have my archive work I setup server-side filter which redirect all messages it processed also to my archive mailbox. This way, each message (after such a redirect) targeted to 'archive@mydomain', instead of its original destination email. The only place I can find out the original recipient is to parse 'Recieved' field(-s).
As I think I understand that none of these headers (Date or Received) are to be used for SEARCH anyway, and this was the idea behind creating my own index. But wait, is there any way I can make Dovecot also index additional fields (yes, I talk about 'Received') - then it'll be the best solution!
If you do a SEARCH HEADER Received, then Dovecot adds the Received headers to dovecot.index.cache file and the subsequent searches should be quite fast, although the Received headers increase the cache file's size quite a lot. Also alternatively you can enable full text search indexes (Lucene or Solr) and the search is then done from them.
participants (2)
-
Alexander Chekalin
-
Timo Sirainen