[Dovecot] index IO patterns
Hey all, we're in the process of checking out alternatives to our index storage. We're currently storing indexes on a NetApp Metrocluster which works fine, but is very expensive. We're planning a few different setups and doing some actual performance tests on them.
Does anyone know some of the IO patterns of the indexes? For instance:
- mostly random reads or linear reads/writes?
- average size of reads and writes?
- how many read/writes on average for a specific mailbox size?
Anyone do any measurements of this kind?
Alternatively, does anyone have any experience with other redundant storage options? Im thinking things like MooseFS, DRBD, etc?
regards,
Cor
Indexes are very random, mostly read, some writes if using dovecot-lda (ej: dbox). The average size is rather small, maybe 5 KB in our setup. Bandwith is rather low, 20-30 MB/sec
We are using HP LeftHand for our replicated storage needs.
Regards
Javier
El 11/05/2012 08:41, Cor Bosman escribió:
Hey all, we're in the process of checking out alternatives to our index storage. We're currently storing indexes on a NetApp Metrocluster which works fine, but is very expensive. We're planning a few different setups and doing some actual performance tests on them.
Does anyone know some of the IO patterns of the indexes? For instance:
- mostly random reads or linear reads/writes?
- average size of reads and writes?
- how many read/writes on average for a specific mailbox size?
Anyone do any measurements of this kind?
Alternatively, does anyone have any experience with other redundant storage options? Im thinking things like MooseFS, DRBD, etc?
regards,
Cor
Hi javier,
Indexes are very random, mostly read, some writes if using dovecot-lda (ej: dbox). The average size is rather small, maybe 5 KB in our setup. Bandwith is rather low, 20-30 MB/sec
Even without LDA/LMTP dovecot-imap needs to write right? It would need to update the index every time an imap connect happens and new mails are found in the mail store.
Cor
Even without LDA/LMTP dovecot-imap needs to write right? It would need to update the index every time an imap connect happens and new mails are found in the mail store.
Well of course. Indexes are also updated when flags are modified, moved a messages, delete a message, etc.. But in my setup there are 65% reads and the rest writes
Regards
Javier
Cor
On 11.5.2012, at 13.56, Javier de Miguel Rodríguez wrote:
Even without LDA/LMTP dovecot-imap needs to write right? It would need to update the index every time an imap connect happens and new mails are found in the mail store.
Well of course. Indexes are also updated when flags are modified, moved a messages, delete a message, etc.. But in my setup there are 65% reads and the rest writes
There are several hard coded values related to read/write percentages. If you're interested you could try if changing them increases the read%:
mail-index-private.h:
/* Write to main index file when bytes-to-be-read-from-log is between these values. */ #define MAIL_INDEX_MIN_WRITE_BYTES (1024*8) #define MAIL_INDEX_MAX_WRITE_BYTES (1024*128)
mail-cache-private.h:
/* Never compress the file if it's smaller than this */ #define MAIL_CACHE_COMPRESS_MIN_SIZE (1024*50)
/* Compress the file when deleted space reaches n% of total size */ #define MAIL_CACHE_COMPRESS_PERCENTAGE 20
/* Compress the file when n% of rows contain continued rows. 200% means that there's 2 continued rows per record. */ #define MAIL_CACHE_COMPRESS_CONTINUED_PERCENTAGE 200
Increasing this might also improve read performance, compat.h:
/* Try to keep IO operations at least this size */ #ifndef IO_BLOCK_SIZE # define IO_BLOCK_SIZE 8192 #endif
All of these are just runtime checks (not saved anywhere), so there's no danger in changing them.
On 5/11/2012 1:41 AM, Cor Bosman wrote:
Hey all, we're in the process of checking out alternatives to our index storage. We're currently storing indexes on a NetApp Metrocluster which works fine, but is very expensive. We're planning a few different setups and doing some actual performance tests on them.
Hi Cor,
Does anyone know some of the IO patterns of the indexes? For instance:
- mostly random reads or linear reads/writes?
- average size of reads and writes?
- how many read/writes on average for a specific mailbox size?
Anyone do any measurements of this kind?
Mail is always a random IO workload, unless your mailbox count is 1, whether accessing indexes or mail files. Regarding the other two questions, you'll likely need to take your own measurements.
Alternatively, does anyone have any experience with other redundant storage options? Im thinking things like MooseFS, DRBD, etc?
You seem to be interested in multi-site clustering/failover solutions, not simply redundant storage. These two are clustering software solutions but DRBD is not suitable for multi-site use, and MooseFS doesn't seem to be either. MooseFS is based heavily on FUSE, so performance will be far less than optimal. MooseFS is a distributed filesystem, and as with all other distributed/cluster filesystems its metadata performance will suffer, eliminating maildir as a mail store option.
Can you provide more specifics on your actual storage architecture needs?
-- Stan
Alternatively, does anyone have any experience with other redundant storage options? Im thinking things like MooseFS, DRBD, etc?
You seem to be interested in multi-site clustering/failover solutions, not simply redundant storage. These two are clustering software solutions but DRBD is not suitable for multi-site use, and MooseFS doesn't seem to be either. MooseFS is based heavily on FUSE, so performance will be far less than optimal. MooseFS is a distributed filesystem, and as with all other distributed/cluster filesystems its metadata performance will suffer, eliminating maildir as a mail store option.
Can you provide more specifics on your actual storage architecture needs?
There are some people in our company that like MooseFS, so i'll just include it in the tests and let that speak for itself :) We are not looking for multisite solutions. Then we may as well stay with the metrocluster. I dont even care if it has to be in the same rack. It's only for the indexes, not the mail store itself which will stay on the metrocluster. In the very worst case, when the whole site explodes, i can always tell dovecot to use memory for indexes temporarily :)
The indexes are doing a lot of iops on the metrocluster, and it's a bit of an expensive option for something it's not even that good at.
Im aiming for something with 2 servers, each with a 12 disk enclosure with SSD for fast random io with 10G network interfaces, 24 core, 48GB memory.
I just want to test some io patterns on different hardware/software solutions, including the metrocluster itself, before we commit to a specific solution. Im slightly leaning towards DRBD right now.
Cor
On 5/12/2012 2:26 AM, Cor Bosman wrote:
The indexes are doing a lot of iops on the metrocluster, and it's a bit of an expensive option for something it's not even that good at.
This clears things up a bit.
Im aiming for something with 2 servers, each with a 12 disk enclosure with SSD for fast random io with 10G network interfaces, 24 core, 48GB memory.
AMD is a great platform and I laud your preference for it.
I just want to test some io patterns on different hardware/software solutions, including the metrocluster itself, before we commit to a specific solution. Im slightly leaning towards DRBD right now.
A DRBD cluster simply doubles your costs--twice the disks/enclosures, twice the servers, and adds another layer of redundancy software to the storage stack. It can be even more if one decides to cluster 3-6 or more DRBD servers.
Have you considered something like a Nexsan E18? In 2U it gives you dual PSUs, dual active/active RAID controllers each w/ 2GB BBWC, 2x8Gb FC and 2x1GbE iSCSI ports per controller. Optionally you can replace the FC ports with the same number of 10GbE iSCSI ports. It offers up to 18 100/200/400GB SLC SSDs, or up to 36/78 of these SSDs w/the E18X or E60X expansion chassis.
http://www.nexsan.com/en/products/e-series/~/media/Nexsan/Files/products/e-s...
http://www.nexsan.com/en/products/e-series/tech-specs.aspx http://www.nexsan.com/products/e-series.aspx
You'd simply create a single RAID1+0 array of all 18 SSDs, export it as a LUN on each iSCSI port, configure SCSI multipath and the iSCSI initiator on each Dovecot host, install GFS2/OCFS2, format the LUN and go. With 18x200GB SSDs you'll get 1.8T of net capacity and well north of 100K sustained real world random r/w block IOPS. And without needing two beefy dual socket AMD server chassis mirrored with DRBD. And of course you'll still want to use Dovecot Director to avoid locking issues.
Contact the Nexsan European office to see about an evaluation unit: http://www.nexsan.com/about/contact/locations.aspx
Disclaimer: I've never worked for Nexsan nor any affiliate. I'm simply a past customer very satisfied with their products and philosophy/strategy.
-- Stan
Mail is always a random IO workload, unless your mailbox count is 1, whether accessing indexes or mail files. Regarding the other two questions, you'll likely need to take your own measurements.
Wait, maybe there is a misunderstanding. I mean the IO inside one index file, not across the different mailboxes. So within 1 index file that covers a mailbox with say 10.000 emails, how does the IO occur. I would guess pretty random as well, but on the other hand i guess in some ways it could be pretty linear too. If dovecot keeps most changes in memory and writes it all back in 1 go.
Cor
On 12.5.2012, at 10.32, Cor Bosman wrote:
Mail is always a random IO workload, unless your mailbox count is 1, whether accessing indexes or mail files. Regarding the other two questions, you'll likely need to take your own measurements.
Wait, maybe there is a misunderstanding. I mean the IO inside one index file, not across the different mailboxes. So within 1 index file that covers a mailbox with say 10.000 emails, how does the IO occur. I would guess pretty random as well, but on the other hand i guess in some ways it could be pretty linear too. If dovecot keeps most changes in memory and writes it all back in 1 go.
Usually the index files are small enough that I think OS reads the whole files into memory anyway. Anyway..:
dovecot.index: The header is always accessed first. After that it's accessed as necessary. Many IMAP clients fetch all message flags when selecting mailbox, so this causes a sequential read of the entire file. Also with mmap_disable=yes the whole file is always read into memory.
dovecot.index.log: Usually the last few kilobytes of the file are read into memory when mailbox is opened, and after that data is appended and read from it. In some situations the reader might seek to an older data (e.g. to beginning) and read the rest of the file sequentially.
dovecot.index.cache: Accessed randomly, depending on what data is needed to be looked up. Typically clients fetch only the last few messages, so the end of the file is accessed sequentially.
Writes are typically appends + rewrites, but currently there are also a few more complex things which I want to get rid of (perhaps for v2.2).
On 5/12/2012 2:32 AM, Cor Bosman wrote:
Mail is always a random IO workload, unless your mailbox count is 1, whether accessing indexes or mail files. Regarding the other two questions, you'll likely need to take your own measurements.
Wait, maybe there is a misunderstanding. I mean the IO inside one index file, not across the different mailboxes. So within 1 index file that covers a mailbox with say 10.000 emails, how does the IO occur. I would guess pretty random as well, but on the other hand i guess in some ways it could be pretty linear too. If dovecot keeps most changes in memory and writes it all back in 1 go.
I don't see how this is relevant to designing an index storage system. Whether index file updates are sequential or random, they become random at the 2nd user and more so from there. So either way, your storage system will see a random IO pattern, and that's what you need to engineer the system for, not the single user index file update pattern. You've already expressed interest in SSD, which takes care of this concern.
-- Stan
participants (5)
-
Cor Bosman
-
Javier de Miguel Rodríguez
-
Javier Miguel Rodríguez
-
Stan Hoeppner
-
Timo Sirainen