[Dovecot] RFC: grouped (f)sync

Thu Jan 6 10:37:02 EET 2011

  On 01/05/2011 11:38 PM, Timo Sirainen wrote:
> On 6.1.2011, at 0.27, Attila Nagy wrote:
>
>> With this, 10 syncs would happen in every second from Dovecot LDA processes, meaning if a client connects in t=0 it will immediately got the response 250, if a client connects in t=0.05, it will get the response in 50 ms (in an ideal world, where syncing does not take time), and the committed blocks could accumulate for a maximum of 100 ms.
>> In a busy system (where this setting would make sense), it means it would be possible to write more data with less IOPS needed.
> I guess this could work. Although earlier I thought about delaying fsyncs so that when saving/copying multiple mails in a single transaction with maildir it would delay about 10 or so close()s and then fsync() them all at the same time at the end. This ended up being slower (but I only tested it with a single user - maybe in real world setups it might have worked better).
What filesystem was used for this test? If that writes only the involved 
FD's data with an fsync, the effect is pretty much the same when you 
issue fsync real time, or serialize them into nearly the same time: the 
file system will write small amounts of data and issue a flush after 
each fsync.
On a file system, which writes all the dirty data for an fsync (like ZFS 
does), it may work better, altough only the first fsync would be 
necessary, with the others you will only risk that other data got into 
the caches and you make the solution useless with that.
That's why I wrote in this case you would need to use sync() instead of 
fsync(), so this would make this file system independent.
Many sync() man pages write these:
FreeBSD:
BUGS
      The sync() system call may return before the buffers are completely
      flushed.
Linux:
BUGS
        According  to  the  standard specification (e.g., POSIX.1-2001), 
sync()
        schedules the writes, but may return before the actual writing 
is done.
        However,  since  version  1.3.20 Linux does actually wait.  
(This still
        does not guarantee data integrity: modern disks have large caches.)

But I think the same warning will stand against fsync too.
Otherwise, I guess a little experimenting and reading would be needed 
here. I think for this setting it would be OK to assume some technical 
knowledge on the users and say: you should only turn this on, if you 
have a file system, which flushes all dirty buffers for a single fsync 
for the entire file system.
Then you would delay fsyncs for a list of FDs, and issue only one for 
the list, instead of one for each of the list elements.
Or just issue a single sync().

>> I can see two problems:
>> 1. there is no call for committing a lot of file descriptors in one transaction, so instead of fsync() for each of the modified FDs, a sync() would be needed. sync() writes all buffers to stable storage, which is bad if you have a mixed workload, where there are a lot of non-fsynced data, or other heavy fsync users. But modern file systems, like ZFS will write those back too, so there an fsync(fd) is -AFAIK- mostly equivalent with a sync(pool on which fd is). sync() of course is system wide, so if you have other file systems, those will be synced as well. (this setting isn't for everybody)
>> 2. in a multiprocess environment this would need coordination, so instead of doing fsyncs in distinct processes, there would be one process needed, which does the sync and returns OK for the others, so they can notify the client about the commit to the stable storage.
> It's possible for you to send the fds to another process via UNIX socket that does fsync() on them. I was also hoping for using lib-fs for at least some mailbox formats at some point (either some modified dbox, or a new one), and for that it would be even easier to add an fsync plugin that does this kind of fsync-transfer to another process.
Yes, but as above stated, I don't think it will help, because on a file 
system, which writes only the given FD's data, it's the same, nothing 
gained, and on a file system, which flushes all dirty buffers, you will 
possibly have some dirty buffers between those fsyncs, so it will be the 
same, IOPS will be the limiting factor.