[Dovecot] minimize mbox mdbox fragmentation
Hi Timo,
Any chance the mbox/mdbox writer code could be modified to do physical preallocation on files to help avoid file(system) fragmentation? Constantly appending a file is the prime recipe for causing fragmentation, and mbox is notorious for this--not a fault of Dovecot but the nature of the mbox beast. Obviously maildir doesn't have such a problem, but some (many?) of us still prefer mbox for many reasons, fast full body search being one, portability being another. mdbox file fragmentation could benefit from such a change as well.
I was having a discussion on the XFS list about this, trying to tweak XFS mount options to mitigate the fragmentation effects. Alas, there is no way to do this purely at the filesystem level. From Dave Chinner, one of the lead XFS devs:
"What you want is _physical_ preallocation, not speculative preallocation. i.e. look up XFS_IOC_RESVSP or FIEMAP so your application does _permanent_ preallocate past EOF. Alternatively, the filesystem will avoid the truncation on close() if the file has the APPEND attribute set and the application is writing via O_APPEND...
The filesystem cannot do everything for you. Sometimes the application has to help...."
How great would the effort need to be to implement something like this? Would the return on investment be sufficient to justify doing this, in your eyes?
-- Stan
On Tue, 2010-10-19 at 21:55 -0500, Stan Hoeppner wrote:
Any chance the mbox/mdbox writer code could be modified to do physical preallocation on files to help avoid file(system) fragmentation?
I've been thinking about that before.
"What you want is _physical_ preallocation, not speculative preallocation. i.e. look up XFS_IOC_RESVSP or FIEMAP so your application does _permanent_ preallocate past EOF.
Oh, interesting. I didn't know that was possible. And even better: Linux has fallocate() that can do it for other filesystems than just XFS. Or looks like it's only XFS and ext4 (ext3 doesn't support it). I don't know if other OSes support this. Maybe in future I could make mdbox support writing to files whose size has been preallocated by actually writing NUL bytes, but that requires some extra code.
http://hg.dovecot.org/dovecot-2.0/rev/22c81f884032 http://hg.dovecot.org/dovecot-2.0/rev/b884441a713f
On 2010-10-20 12:53 PM, Timo Sirainen wrote:
Oh, interesting. I didn't know that was possible. And even better: Linux has fallocate() that can do it for other filesystems than just XFS. Or looks like it's only XFS and ext4 (ext3 doesn't support it).
How about reiserfs (3, not 4)?
--
Best regards,
Charles
On Wed, 2010-10-20 at 13:32 -0400, Charles Marcus wrote:
On 2010-10-20 12:53 PM, Timo Sirainen wrote:
Oh, interesting. I didn't know that was possible. And even better: Linux has fallocate() that can do it for other filesystems than just XFS. Or looks like it's only XFS and ext4 (ext3 doesn't support it).
How about reiserfs (3, not 4)?
Doesn't support.
On Wed, Oct 20, 2010 at 06:45:17PM +0100, Timo Sirainen wrote:
On Wed, 2010-10-20 at 13:32 -0400, Charles Marcus wrote:
On 2010-10-20 12:53 PM, Timo Sirainen wrote:
Oh, interesting. I didn't know that was possible. And even better: Linux has fallocate() that can do it for other filesystems than just XFS. Or looks like it's only XFS and ext4 (ext3 doesn't support it).
How about reiserfs (3, not 4)?
Doesn't support.
Is it possible with UFS and ZFS?
-- Denny Lin
On 21.10.2010, at 3.12, Denny Lin wrote:
On Wed, Oct 20, 2010 at 06:45:17PM +0100, Timo Sirainen wrote:
On Wed, 2010-10-20 at 13:32 -0400, Charles Marcus wrote:
On 2010-10-20 12:53 PM, Timo Sirainen wrote:
Oh, interesting. I didn't know that was possible. And even better: Linux has fallocate() that can do it for other filesystems than just XFS. Or looks like it's only XFS and ext4 (ext3 doesn't support it).
How about reiserfs (3, not 4)?
Doesn't support.
Is it possible with UFS and ZFS?
Linux doesn't support either and my googling didn't find any FreeBSD or Solaris interface for this feature, so I don't know.
On Oct 20, 2010, at 10:14 PM, Timo Sirainen wrote:
On 21.10.2010, at 3.12, Denny Lin wrote:
On Wed, Oct 20, 2010 at 06:45:17PM +0100, Timo Sirainen wrote:
On Wed, 2010-10-20 at 13:32 -0400, Charles Marcus wrote:
On 2010-10-20 12:53 PM, Timo Sirainen wrote:
Oh, interesting. I didn't know that was possible. And even better: Linux has fallocate() that can do it for other filesystems than just XFS. Or looks like it's only XFS and ext4 (ext3 doesn't support it).
How about reiserfs (3, not 4)?
Doesn't support.
Is it possible with UFS and ZFS?
Linux doesn't support either and my googling didn't find any FreeBSD or Solaris interface for this feature, so I don't know.
AIX supports posix_fallocate but only for JFS2. For GPFS(what we use), the function is gpfs_prealloc.
On 20/10/2010 18:32, Charles Marcus wrote:
On 2010-10-20 12:53 PM, Timo Sirainen wrote:
Oh, interesting. I didn't know that was possible. And even better: Linux has fallocate() that can do it for other filesystems than just XFS. Or looks like it's only XFS and ext4 (ext3 doesn't support it).
How about reiserfs (3, not 4)?
Though in the case of the "small number of large files" (i.e. the opposite of ReiserFS's strength), which you would get with mbox and mdbox, one would have to ask what upside ReiserFS would bring to the party which would outweigh the downside that Namesys is kaput and Hans Reiser is currently serving 15 years to life for second degree murder in a California State Prison...
Bill
On 2010-10-21 6:58 AM, William Blunn wrote:
Though in the case of the "small number of large files" (i.e. the opposite of ReiserFS's strength), which you would get with mbox and mdbox,
Good point...
one would have to ask what upside ReiserFS would bring to the party which would outweigh the downside that Namesys is kaput and Hans Reiser is currently serving 15 years to life for second degree murder in a California State Prison...
Please don't politicize the issue.
Reiserfs is not 'kaput', it is still being maintained in the linux kernel (both v3 and work is ongoing for v4), and will be for the foreseeable future.
--
Best regards,
Charles
On 21/10/2010 14:25, Charles Marcus wrote:
Reiserfs is not 'kaput', it is still being maintained in the linux kernel (both v3 and work is ongoing for v4), and will be for the foreseeable future.
For the benefit of anyone reading this and wondering "Well is it kaput or not?": Charles and I are both right.
I said that *Namesys* was kaput. Namesys was the company which developed ReiserFS.
Charles stated correctly (but perhaps confusingly when done as a reply to my comment) that *ReiserFS* is not kaput.
We were talking about two different things.
The company is dead, but the software (being open source) lives on.
Bill
On 2010-10-21 10:40 AM, William Blunn wrote:
On 21/10/2010 14:25, Charles Marcus wrote:
Reiserfs is not 'kaput', it is still being maintained in the linux kernel (both v3 and work is ongoing for v4), and will be for the foreseeable future.
For the benefit of anyone reading this and wondering "Well is it kaput or not?": Charles and I are both right.
I said that *Namesys* was kaput. Namesys was the company which developed ReiserFS.
Charles stated correctly (but perhaps confusingly when done as a reply to my comment) that *ReiserFS* is not kaput.
We were talking about two different things.
The company is dead, but the software (being open source) lives on.
That was my point... the way it was worded your comment could be interpreted as the same as reiserfs being kaput. I merely pointed out that the original programmers company/website of an open-source project whose code is and has been included in the mainline kernel for a very long time being kaput is far far from the same as the filesystem itself being kaput - which was the last part of your comment.
--
Best regards,
Charles
Timo Sirainen put forth on 10/20/2010 11:53 AM:
On Tue, 2010-10-19 at 21:55 -0500, Stan Hoeppner wrote:
Any chance the mbox/mdbox writer code could be modified to do physical preallocation on files to help avoid file(system) fragmentation?
I've been thinking about that before.
"What you want is _physical_ preallocation, not speculative preallocation. i.e. look up XFS_IOC_RESVSP or FIEMAP so your application does _permanent_ preallocate past EOF.
Oh, interesting. I didn't know that was possible. And even better: Linux has fallocate() that can do it for other filesystems than just XFS. Or looks like it's only XFS and ext4 (ext3 doesn't support it). I don't know if other OSes support this. Maybe in future I could make mdbox support writing to files whose size has been preallocated by actually writing NUL bytes, but that requires some extra code.
http://hg.dovecot.org/dovecot-2.0/rev/22c81f884032 http://hg.dovecot.org/dovecot-2.0/rev/b884441a713f
There exists posix_fallocate() which would widen the platforms that would support this Timo. You may also want to look at posix_fadvise() as well (if you're not using it already) which might increase Dovecot's overall disk performance a bit.
NOTE: I don't believe fallocate() in either posix or linux only form will actually accomplish decreased m[d]box file fragmentation. I don't believe it actually increases the file size on disk, i.e. physically allocating additional free extents tailing the end of the file. fallocate() is _speculative_ preallocation, which isn't what you want. mbox and mdbox file _will_ grow, so you'd want _physical_ preallocation. I'm not sure if physical preallocation requires writing a bunch of zeros to the end of the file or not. I don't "think" it does. I think you can extend the size of the file past EOF to grow the file and the remainder is just left at nulls or something. Again, I've not a dev. I know just barely enough about this stuff to get myself into real trouble. ;)
See these comments:
On Tue, Oct 19, 2010 at 10:03:19PM -0500, Stan Hoeppner wrote:
Dave Chinner put forth on 10/19/2010 6:42 PM:
I've explained how allocsize works, and that speculative allocation gets truncated away whenteh file is closed. Hence is the application is doing:
open() seek(EOF) write() close()
I don't know if it changes anything in the sequence above, but Dovecot uses mmap i/o. As I've said, I'm not a dev. Just thought this could/might be relevant. Would using mmap be compatible with physical preallocation? mmap() can't write beyond EOF or extend the file. hence it would have to be:
open()
mmap()
ftrucate(new_size)
<write via mmap>
In this method, there is no speculative preallocation because the there is never a delayed allocation that extends the file size. it simply doesn't matter where the close() occurs. Hence if you use mmap() writes like this, the only way you can avoid fragmentation is to use physical preallocation beyond EOF before you start any writes....
It would be beneficial I think if you'd sub to the xfs list Timo and pick some brains. All the devs there are Linux devs but have experience with many platforms including IRIX and other UNIX variants. Most if not all of them have been developing on UNIX systems their entire careers, and only UNIX. They could answer any question you have about the Linux IO subsystem, not just XFS specific stuff. Some are current SGI employees some former, some Redhat, etc. They could probably answer any posix call questions you might have as well.
http://oss.sgi.com/mailman/listinfo/xfs
-- Stan
On Wed, 2010-10-20 at 23:50 -0500, Stan Hoeppner wrote:
Oh, interesting. I didn't know that was possible. And even better: Linux has fallocate() that can do it for other filesystems than just XFS. Or looks like it's only XFS and ext4 (ext3 doesn't support it). I don't know if other OSes support this. There exists posix_fallocate() which would widen the platforms that would support this Timo.
I know about posix_fallocate(), but that does a different thing. It would work only when this works:
Maybe in future I could make mdbox
support writing to files whose size has been preallocated by actually writing NUL bytes, but that requires some extra code.
You may also want to look at posix_fadvise() as well (if you're not using it already) which might increase Dovecot's overall disk performance a bit.
I once wrote a patch for that and asked if anyone could confirm if it made things worse or better, but no one ever did. http://dovecot.org/list/dovecot/2007-November/026819.html
NOTE: I don't believe fallocate() in either posix or linux only form will actually accomplish decreased m[d]box file fragmentation. I don't believe it actually increases the file size on disk, i.e. physically allocating additional free extents tailing the end of the file. fallocate() is _speculative_ preallocation, which isn't what you want.
Why do you believe that? The man pages and everything tell me that it's about physical preallocation:
fallocate is used to preallocate blocks to a file. For filesystems
which support the fallocate system call, this is done quickly by allo‐
cating blocks and marking them as uninitialized, requiring no IO to the
data blocks. This is much faster than creating a file by filling it
with zeros.
Also "du" shows that the file's size is actually the preallocated size. That's how I checked which filesystems supported it.
And I think Linux fallocate() with XFS internally does exactly what XFS_IOC_RESVSP does, which your quoted mail mentioned being about physical preallocation.
I don't know if it changes anything in the sequence above, but Dovecot uses mmap i/o.
mmap is used only for index files, not for mdbox files.
It would be beneficial I think if you'd sub to the xfs list Timo and pick some brains.
Maybe, but there is still so much other stuff to do, like bug fixing. :)
Timo Sirainen put forth on 10/21/2010 9:52 AM:
On Wed, 2010-10-20 at 23:50 -0500, Stan Hoeppner wrote:
Oh, interesting. I didn't know that was possible. And even better: Linux has fallocate() that can do it for other filesystems than just XFS. Or looks like it's only XFS and ext4 (ext3 doesn't support it). I don't know if other OSes support this. There exists posix_fallocate() which would widen the platforms that would support this Timo.
I know about posix_fallocate(), but that does a different thing. It would work only when this works:
Ahh, ok, didn't realize that it was different than the Linux specific call.
Maybe in future I could make mdbox
support writing to files whose size has been preallocated by actually writing NUL bytes, but that requires some extra code.
You may also want to look at posix_fadvise() as well (if you're not using it already) which might increase Dovecot's overall disk performance a bit.
I once wrote a patch for that and asked if anyone could confirm if it made things worse or better, but no one ever did. http://dovecot.org/list/dovecot/2007-November/026819.html
I wasn't aware of this. I don't really have the resources, skill set, or time to test patches, or I'd do so. :(
NOTE: I don't believe fallocate() in either posix or linux only form will actually accomplish decreased m[d]box file fragmentation. I don't believe it actually increases the file size on disk, i.e. physically allocating additional free extents tailing the end of the file. fallocate() is _speculative_ preallocation, which isn't what you want.
Why do you believe that? The man pages and everything tell me that it's about physical preallocation:
fallocate is used to preallocate blocks to a file. For filesystems which support the fallocate system call, this is done quickly by allo‐ cating blocks and marking them as uninitialized, requiring no IO to the data blocks. This is much faster than creating a file by filling it with zeros.
Also "du" shows that the file's size is actually the preallocated size. That's how I checked which filesystems supported it.
And I think Linux fallocate() with XFS internally does exactly what XFS_IOC_RESVSP does, which your quoted mail mentioned being about physical preallocation.
Well that's cool then.
I don't know if it changes anything in the sequence above, but Dovecot uses mmap i/o.
mmap is used only for index files, not for mdbox files.
Ahh, that's really good to know. IIRC the comments for enabling mmap in dovecot.conf don't state this. I thus assumed enabling mmap was for all files.
It would be beneficial I think if you'd sub to the xfs list Timo and pick some brains.
Maybe, but there is still so much other stuff to do, like bug fixing. :)
Looks like you've got a handle on it. Like I said, I know just enough about this stuff to make me dangerous. ;)
-- Stan
On 20.10.2010, at 17.53, Timo Sirainen wrote:
"What you want is _physical_ preallocation, not speculative preallocation. i.e. look up XFS_IOC_RESVSP or FIEMAP so your application does _permanent_ preallocate past EOF.
Oh, interesting. I didn't know that was possible. And even better: Linux has fallocate() that can do it for other filesystems than just XFS. Or looks like it's only XFS and ext4 (ext3 doesn't support it). I don't know if other OSes support this. Maybe in future I could make mdbox support writing to files whose size has been preallocated by actually writing NUL bytes, but that requires some extra code.
Looks like OS X has fcntl(F_PREALLOCATE), although it doesn't seem to produce any visible results, other than giving ENOSPC error if I give too large of a size (100 MB or so). Disk usage doesn't shrink though, so maybe it's more of a hint?..
Mike, do you think this code is actually working/useful with OSX? :)
On 21.10.2010, at 22.58, Timo Sirainen wrote:
Looks like OS X has fcntl(F_PREALLOCATE), although it doesn't seem to produce any visible results, other than giving ENOSPC error if I give too large of a size (100 MB or so). Disk usage doesn't shrink though, so maybe it's more of a hint?..
Mike, do you think this code is actually working/useful with OSX? :)
Oh, forgot to give the link:
participants (6)
-
Charles Marcus
-
Denny Lin
-
Jonathan Siegle
-
Stan Hoeppner
-
Timo Sirainen
-
William Blunn