[Dovecot] RFC: grouped (f)sync

Thu Jan 6 00:27:32 EET 2011

  Hello,

Currently Dovecot (and any other application, which cares about e-mail 
delivery) does at least one fsync per mail delivery. Given that hard 
disk drives have a very limited IOPS, this effectively limits the 
maximum mail delivery performance to a very low value, under utilizing 
the available storage IO capacity.
Calculating with an average mail size of 50 kB and an average consumer 
HDD with 120 IOPS, the theoretical mail delivery performance will be 50 
kB*120 IOPS=5.85 MBps. But if we could write 500 kB with every 
transaction, the delivery speed would be nearly 10 times as well.

Dovecot have two process models: separate processes for each client 
connection and an async in-process multiplexing method. This works for 
each one, albeit the timing is somewhat different.
So here's the idea: instead of fsyncing immediately in the LDA (lmtpd) 
every time when the client says "\r\n.\r\n" after the DATA phase, let's 
introduce a user settable timer (let's call that sync_delay from now on) 
and only sync in every sync_delay seconds.
This would introduce an up to sync_delay seconds delay in lmtpd 
returning "250 Ok" to the client, but that's generally not a problem, 
because in high traffic setups there is a great amount of concurrency, 
so you could use a lot of client connections easily.
Take an example setting of sync_delay = 100 ms.
With this, 10 syncs would happen in every second from Dovecot LDA 
processes, meaning if a client connects in t=0 it will immediately got 
the response 250, if a client connects in t=0.05, it will get the 
response in 50 ms (in an ideal world, where syncing does not take time), 
and the committed blocks could accumulate for a maximum of 100 ms.
In a busy system (where this setting would make sense), it means it 
would be possible to write more data with less IOPS needed.
I can see two problems:
1. there is no call for committing a lot of file descriptors in one 
transaction, so instead of fsync() for each of the modified FDs, a 
sync() would be needed. sync() writes all buffers to stable storage, 
which is bad if you have a mixed workload, where there are a lot of 
non-fsynced data, or other heavy fsync users. But modern file systems, 
like ZFS will write those back too, so there an fsync(fd) is -AFAIK- 
mostly equivalent with a sync(pool on which fd is). sync() of course is 
system wide, so if you have other file systems, those will be synced as 
well. (this setting isn't for everybody)
2. in a multiprocess environment this would need coordination, so 
instead of doing fsyncs in distinct processes, there would be one 
process needed, which does the sync and returns OK for the others, so 
they can notify the client about the commit to the stable storage.

Any opinions on this?