[Dovecot] RFC: grouped (f)sync
Attila Nagy
bra at fsn.hu
Thu Jan 6 10:37:02 EET 2011
On 01/05/2011 11:38 PM, Timo Sirainen wrote:
> On 6.1.2011, at 0.27, Attila Nagy wrote:
>
>> With this, 10 syncs would happen in every second from Dovecot LDA processes, meaning if a client connects in t=0 it will immediately got the response 250, if a client connects in t=0.05, it will get the response in 50 ms (in an ideal world, where syncing does not take time), and the committed blocks could accumulate for a maximum of 100 ms.
>> In a busy system (where this setting would make sense), it means it would be possible to write more data with less IOPS needed.
> I guess this could work. Although earlier I thought about delaying fsyncs so that when saving/copying multiple mails in a single transaction with maildir it would delay about 10 or so close()s and then fsync() them all at the same time at the end. This ended up being slower (but I only tested it with a single user - maybe in real world setups it might have worked better).
What filesystem was used for this test? If that writes only the involved
FD's data with an fsync, the effect is pretty much the same when you
issue fsync real time, or serialize them into nearly the same time: the
file system will write small amounts of data and issue a flush after
each fsync.
On a file system, which writes all the dirty data for an fsync (like ZFS
does), it may work better, altough only the first fsync would be
necessary, with the others you will only risk that other data got into
the caches and you make the solution useless with that.
That's why I wrote in this case you would need to use sync() instead of
fsync(), so this would make this file system independent.
Many sync() man pages write these:
FreeBSD:
BUGS
The sync() system call may return before the buffers are completely
flushed.
Linux:
BUGS
According to the standard specification (e.g., POSIX.1-2001),
sync()
schedules the writes, but may return before the actual writing
is done.
However, since version 1.3.20 Linux does actually wait.
(This still
does not guarantee data integrity: modern disks have large caches.)
But I think the same warning will stand against fsync too.
Otherwise, I guess a little experimenting and reading would be needed
here. I think for this setting it would be OK to assume some technical
knowledge on the users and say: you should only turn this on, if you
have a file system, which flushes all dirty buffers for a single fsync
for the entire file system.
Then you would delay fsyncs for a list of FDs, and issue only one for
the list, instead of one for each of the list elements.
Or just issue a single sync().
>> I can see two problems:
>> 1. there is no call for committing a lot of file descriptors in one transaction, so instead of fsync() for each of the modified FDs, a sync() would be needed. sync() writes all buffers to stable storage, which is bad if you have a mixed workload, where there are a lot of non-fsynced data, or other heavy fsync users. But modern file systems, like ZFS will write those back too, so there an fsync(fd) is -AFAIK- mostly equivalent with a sync(pool on which fd is). sync() of course is system wide, so if you have other file systems, those will be synced as well. (this setting isn't for everybody)
>> 2. in a multiprocess environment this would need coordination, so instead of doing fsyncs in distinct processes, there would be one process needed, which does the sync and returns OK for the others, so they can notify the client about the commit to the stable storage.
> It's possible for you to send the fds to another process via UNIX socket that does fsync() on them. I was also hoping for using lib-fs for at least some mailbox formats at some point (either some modified dbox, or a new one), and for that it would be even easier to add an fsync plugin that does this kind of fsync-transfer to another process.
Yes, but as above stated, I don't think it will help, because on a file
system, which writes only the given FD's data, it's the same, nothing
gained, and on a file system, which flushes all dirty buffers, you will
possibly have some dirty buffers between those fsyncs, so it will be the
same, IOPS will be the limiting factor.
More information about the dovecot
mailing list