Hello,
Currently Dovecot (and any other application, which cares about e-mail delivery) does at least one fsync per mail delivery. Given that hard disk drives have a very limited IOPS, this effectively limits the maximum mail delivery performance to a very low value, under utilizing the available storage IO capacity. Calculating with an average mail size of 50 kB and an average consumer HDD with 120 IOPS, the theoretical mail delivery performance will be 50 kB*120 IOPS=5.85 MBps. But if we could write 500 kB with every transaction, the delivery speed would be nearly 10 times as well.
Dovecot have two process models: separate processes for each client connection and an async in-process multiplexing method. This works for each one, albeit the timing is somewhat different. So here's the idea: instead of fsyncing immediately in the LDA (lmtpd) every time when the client says "\r\n.\r\n" after the DATA phase, let's introduce a user settable timer (let's call that sync_delay from now on) and only sync in every sync_delay seconds. This would introduce an up to sync_delay seconds delay in lmtpd returning "250 Ok" to the client, but that's generally not a problem, because in high traffic setups there is a great amount of concurrency, so you could use a lot of client connections easily. Take an example setting of sync_delay = 100 ms. With this, 10 syncs would happen in every second from Dovecot LDA processes, meaning if a client connects in t=0 it will immediately got the response 250, if a client connects in t=0.05, it will get the response in 50 ms (in an ideal world, where syncing does not take time), and the committed blocks could accumulate for a maximum of 100 ms. In a busy system (where this setting would make sense), it means it would be possible to write more data with less IOPS needed. I can see two problems: transaction, so instead of fsync() for each of the modified FDs, a
- there is no call for committing a lot of file descriptors in one
sync() would be needed. sync() writes all buffers to stable storage, which is bad if you have a mixed workload, where there are a lot of non-fsynced data, or other heavy fsync users. But modern file systems, like ZFS will write those back too, so there an fsync(fd) is -AFAIK- mostly equivalent with a sync(pool on which fd is). sync() of course is system wide, so if you have other file systems, those will be synced as well. (this setting isn't for everybody) 2. in a multiprocess environment this would need coordination, so instead of doing fsyncs in distinct processes, there would be one process needed, which does the sync and returns OK for the others, so they can notify the client about the commit to the stable storage.
Any opinions on this?