[Dovecot] RFC: grouped (f)sync
Hello,
Currently Dovecot (and any other application, which cares about e-mail delivery) does at least one fsync per mail delivery. Given that hard disk drives have a very limited IOPS, this effectively limits the maximum mail delivery performance to a very low value, under utilizing the available storage IO capacity. Calculating with an average mail size of 50 kB and an average consumer HDD with 120 IOPS, the theoretical mail delivery performance will be 50 kB*120 IOPS=5.85 MBps. But if we could write 500 kB with every transaction, the delivery speed would be nearly 10 times as well.
Dovecot have two process models: separate processes for each client connection and an async in-process multiplexing method. This works for each one, albeit the timing is somewhat different. So here's the idea: instead of fsyncing immediately in the LDA (lmtpd) every time when the client says "\r\n.\r\n" after the DATA phase, let's introduce a user settable timer (let's call that sync_delay from now on) and only sync in every sync_delay seconds. This would introduce an up to sync_delay seconds delay in lmtpd returning "250 Ok" to the client, but that's generally not a problem, because in high traffic setups there is a great amount of concurrency, so you could use a lot of client connections easily. Take an example setting of sync_delay = 100 ms. With this, 10 syncs would happen in every second from Dovecot LDA processes, meaning if a client connects in t=0 it will immediately got the response 250, if a client connects in t=0.05, it will get the response in 50 ms (in an ideal world, where syncing does not take time), and the committed blocks could accumulate for a maximum of 100 ms. In a busy system (where this setting would make sense), it means it would be possible to write more data with less IOPS needed. I can see two problems: transaction, so instead of fsync() for each of the modified FDs, a
- there is no call for committing a lot of file descriptors in one
sync() would be needed. sync() writes all buffers to stable storage, which is bad if you have a mixed workload, where there are a lot of non-fsynced data, or other heavy fsync users. But modern file systems, like ZFS will write those back too, so there an fsync(fd) is -AFAIK- mostly equivalent with a sync(pool on which fd is). sync() of course is system wide, so if you have other file systems, those will be synced as well. (this setting isn't for everybody) 2. in a multiprocess environment this would need coordination, so instead of doing fsyncs in distinct processes, there would be one process needed, which does the sync and returns OK for the others, so they can notify the client about the commit to the stable storage.
Any opinions on this?
On 6.1.2011, at 0.27, Attila Nagy wrote:
With this, 10 syncs would happen in every second from Dovecot LDA processes, meaning if a client connects in t=0 it will immediately got the response 250, if a client connects in t=0.05, it will get the response in 50 ms (in an ideal world, where syncing does not take time), and the committed blocks could accumulate for a maximum of 100 ms. In a busy system (where this setting would make sense), it means it would be possible to write more data with less IOPS needed.
I guess this could work. Although earlier I thought about delaying fsyncs so that when saving/copying multiple mails in a single transaction with maildir it would delay about 10 or so close()s and then fsync() them all at the same time at the end. This ended up being slower (but I only tested it with a single user - maybe in real world setups it might have worked better).
I can see two problems:
- there is no call for committing a lot of file descriptors in one transaction, so instead of fsync() for each of the modified FDs, a sync() would be needed. sync() writes all buffers to stable storage, which is bad if you have a mixed workload, where there are a lot of non-fsynced data, or other heavy fsync users. But modern file systems, like ZFS will write those back too, so there an fsync(fd) is -AFAIK- mostly equivalent with a sync(pool on which fd is). sync() of course is system wide, so if you have other file systems, those will be synced as well. (this setting isn't for everybody)
- in a multiprocess environment this would need coordination, so instead of doing fsyncs in distinct processes, there would be one process needed, which does the sync and returns OK for the others, so they can notify the client about the commit to the stable storage.
It's possible for you to send the fds to another process via UNIX socket that does fsync() on them. I was also hoping for using lib-fs for at least some mailbox formats at some point (either some modified dbox, or a new one), and for that it would be even easier to add an fsync plugin that does this kind of fsync-transfer to another process.
On 6.1.2011, at 0.27, Attila Nagy wrote:
With this, 10 syncs would happen in every second from Dovecot LDA processes, meaning if a client connects in t=0 it will immediately got the response 250, if a client connects in t=0.05, it will get the response in 50 ms (in an ideal world, where syncing does not take time), and the committed blocks could accumulate for a maximum of 100 ms. In a busy system (where this setting would make sense), it means it would be possible to write more data with less IOPS needed. I guess this could work. Although earlier I thought about delaying fsyncs so that when saving/copying multiple mails in a single transaction with maildir it would delay about 10 or so close()s and then fsync() them all at the same time at the end. This ended up being slower (but I only tested it with a single user - maybe in real world setups it might have worked better). What filesystem was used for this test? If that writes only the involved FD's data with an fsync, the effect is pretty much the same when you issue fsync real time, or serialize them into nearly the same time: the file system will write small amounts of data and issue a flush after each fsync. On a file system, which writes all the dirty data for an fsync (like ZFS does), it may work better, altough only the first fsync would be necessary, with the others you will only risk that other data got into
On 01/05/2011 11:38 PM, Timo Sirainen wrote:
the caches and you make the solution useless with that.
That's why I wrote in this case you would need to use sync() instead of
fsync(), so this would make this file system independent.
Many sync() man pages write these:
FreeBSD:
BUGS
The sync() system call may return before the buffers are completely
flushed.
Linux:
BUGS
According to the standard specification (e.g., POSIX.1-2001),
sync()
schedules the writes, but may return before the actual writing
is done.
However, since version 1.3.20 Linux does actually wait.
(This still
does not guarantee data integrity: modern disks have large caches.)
But I think the same warning will stand against fsync too. Otherwise, I guess a little experimenting and reading would be needed here. I think for this setting it would be OK to assume some technical knowledge on the users and say: you should only turn this on, if you have a file system, which flushes all dirty buffers for a single fsync for the entire file system. Then you would delay fsyncs for a list of FDs, and issue only one for the list, instead of one for each of the list elements. Or just issue a single sync().
I can see two problems: possibly have some dirty buffers between those fsyncs, so it will be the
- there is no call for committing a lot of file descriptors in one transaction, so instead of fsync() for each of the modified FDs, a sync() would be needed. sync() writes all buffers to stable storage, which is bad if you have a mixed workload, where there are a lot of non-fsynced data, or other heavy fsync users. But modern file systems, like ZFS will write those back too, so there an fsync(fd) is -AFAIK- mostly equivalent with a sync(pool on which fd is). sync() of course is system wide, so if you have other file systems, those will be synced as well. (this setting isn't for everybody)
- in a multiprocess environment this would need coordination, so instead of doing fsyncs in distinct processes, there would be one process needed, which does the sync and returns OK for the others, so they can notify the client about the commit to the stable storage. It's possible for you to send the fds to another process via UNIX socket that does fsync() on them. I was also hoping for using lib-fs for at least some mailbox formats at some point (either some modified dbox, or a new one), and for that it would be even easier to add an fsync plugin that does this kind of fsync-transfer to another process. Yes, but as above stated, I don't think it will help, because on a file system, which writes only the given FD's data, it's the same, nothing gained, and on a file system, which flushes all dirty buffers, you will
same, IOPS will be the limiting factor.
participants (2)
-
Attila Nagy
-
Timo Sirainen