[Dovecot] Best filesystem?

Mon Jan 31 19:23:31 EET 2011

Ron Leach wrote:
> Finally, and I do apologise for all the questions, we're wishing to
> move to NFS.  (At the moment we have a 'one box' Dovecot solution,
> but this makes upgrade of OS, upgrade of Dovecot, or upgrade of
> storage always a problem.  We have already exported the new XFS
> filestore over NFS - but Dovecot is not (yet) using it, that's the
> next step for us.) Does the fsync solution we've been discussing
> work just as well when the XFS filestore is exported over NFS?
> 

I realise replying to self is not the best thing to do, but I do not
want to waste people's time.  NFS is handled with quite different
system calls, and has a quite different 'sequential behaviour' (for
want of a better word.  Deep in the comments on Ted Tso's post -
referred previously by Timo - are these remarks:

Delayed allocation and the zero-length file problem | Thoughts by Ted

> "In comment #49, Ted says:
> 
> First of all, that’s not NFS’s model; NFS v2/v3 requires that each
> write RPC call not return until the data has hit stable storage. So
> in fact it’s a stronger requirement than alloc on close.
> 
> This statement is misleading.
> 
> Firstly, it accurately describes the NFSv2 semantics, but no sane
> person deliberately uses NFSv2 anymore so the statement is of no
> help in the real world.
> 
> The NFSv3 semantics are more flexible. The v3 WRITE RPC adds a flag
> which allows the client to say whether it wants the data on stable
> storage before the RPC returns, i.e. whether to do the old slow
> thing that was the only way with NFSv2. This flag is hardly ever
> used by clients (O_SYNC or a “sync” mount enable it, as does
> O_DIRECT in most circumstances). Instead clients will typically
> send a bunch of WRITE calls with data, and then a COMMIT call which
> does the actual forcing of data to server-side stable storage. This
> is significantly faster than the NFSv2 model.
> 
> NFSv4 behaves like NFSv3 in this regard, but adds a further feature
> called file delegations which complicate the picture even further.
> 
> Now to actually answer Frank’s question. But first some background.
> 
> 
> The NFS protocol doesn’t know about any block-level behaviour like
> allocation, it works entirely on files. When the allocation occurs
> is a server-side implementation detail entirely. Having the data on
> stable storage is indeed a stronger requirement than forcing
> allocation, but the server could choose to do the allocation at any
> time from the start of unstable WRITE RPC to the end of the COMMIT
> RPC, which can be a window of several seconds.
> 
> NFS however does keep data in the client which has been written by
> applications on the client but not yet sent to the server. If the
> lifetime of the application is short and the file is small, this
> could include the entire data of the file.
> 
> NFS clients practice a behaviour known as CTO or “Close-To-Open
> consistency”, which is a very weak form of inter-client file cache
> consistency. This means that when the application close()s the last
> fd the client will perform the equivalent of an fsync(), i.e. issue
> WRITEs to the server for all any dirty data remaining in the client
> and a COMMIT to force that data to stable storage on the server. In
> other words, when close() returns to the app, the data is safe on
> the server. This is the behaviour on close() that Frank refers to
> above. Note that this is a much tighter constraint than POSIX
> requires."
> 

http://thunk.org/tytso/blog/2009/03/12/delayed-allocation-and-the-zero-length-file-problem/
   [Comment # 84]

Further, the NFS shares can be mounted on the client with a 'sync' 
option that forces physical writes before returning to the caller. 
Though this would be horrifically slow in any high load (network 
transmission times, disc io queues etc), in our situation of low load 
we could consider using this option to minimise the potential for 
email loss due to crash or power fail.

One further optimization, not relevant to Dovecot or email, but worth 
mentioning in the (unlikely) event that anyone is really this 
interested, if we were to split our XFS share into 'two' shares, one 
for email, and the other for general data storage, then we could apply 
'sync' only to the XFS share for email (hence ensuring immediate 
writes) and not to the other share for general storage.

Unless I'm wrong about something here, I think this closes the 
NFS-related concern about XFS and Dovecot and loss of email.

regards, Ron