[Dovecot] dsync redesign

newer
[Dovecot] Per-user flags/indexes...

older
[Dovecot] Dovecot 1.2.9. next_uid...

Timo Sirainen

23 Mar 2012 23 Mar '12

11:25 p.m.

In case anyone is interested in reading (and maybe helping!) with a dsync redesign that's intended to fix all of its current problems, here are some possibly incoherent ramblings about it:

http://dovecot.org/tmp/dsync-redesign.txt

and even if you don't understand that, here's another document disguising as an algorithm class problem :) If anyone has thoughts on how to solve it, would be great:

http://dovecot.org/tmp/dsync-redesign-problem.txt

It only deals with saving new messages, not expunges/flag changes/etc, but those should be much simpler.

Show replies by date

Attila Nagy

24 Mar 24 Mar

9:19 a.m.

On 03/23/12 22:25, Timo Sirainen wrote:

...

In case anyone is interested in reading (and maybe helping!) with a dsync redesign that's intended to fix all of its current problems, here are some possibly incoherent ramblings about it:

http://dovecot.org/tmp/dsync-redesign.txt

and even if you don't understand that, here's another document disguising as an algorithm class problem :) If anyone has thoughts on how to solve it, would be great:

http://dovecot.org/tmp/dsync-redesign-problem.txt

It only deals with saving new messages, not expunges/flag changes/etc, but those should be much simpler.

Well, dsync is a very useful tool, but with continuous replication it tries to solve a problem which should be handled -at least partially- elsewhere. Storing stuff in plain file systems and duplicating them to another one just doesn't scale.

I personally think that Dovecot could gain much more if the amount of work going into fixing or improving dsync would go into making Dovecot to (be able of) use a high scale, distributed storage backend. I know it's much harder, because there are several major differences compared to the "low latency" and consistency problem free local file systems, but its fruits are also sweeter for the long term. :)

It would bring Dovecot into the class of open source mail servers where there are currently no contenders.

BTW, for the previous question in this topic (are there any nosql dbs supporting application-level conflict resolution?), there are similar solutions (like CouchDB, but having some experiences with it, I wouldn't recommend it for massive mail storage -at least the plain CouchDB product), but I guess you would be better off with designing a schema which doesn't need it at the first time. For example, messages are immutable, so you won't face this issue in this area. And for metadata, maybe the solution is not to store "digested" snapshots of the current metadata (folders, flags, message links for folders etc), but to store the changes happening on the user's mailbox and occasionally aggregate them into a last known good and consistent state. Also, there are other interesting ideas, maybe with real single instance store (splitting mime parts? Storing attachments in plain binary form? This always brings up the question of whether the mail server should modify the mails, can be pretty bad for encrypted/signed stuff).

And of course there is always the problem of designing a good, consistent method which is also efficient.

Jan-Frode Myklebust

12:04 p.m.

On Sat, Mar 24, 2012 at 08:19:48AM +0100, Attila Nagy wrote:

...

On 03/23/12 22:25, Timo Sirainen wrote:

...
Well, dsync is a very useful tool, but with continuous replication it tries to solve a problem which should be handled -at least partially- elsewhere. Storing stuff in plain file systems and duplicating them to another one just doesn't scale.

I don't see why this shouldn't scale. Mailboxes are after all changed relatively infrequently. One idea for making it more scalable might be to treat indexes/metadata and messages differently. Make index/metadata updates synchronous over the clusters/locations (with re-sync capability in case of lost synchronisation), while messages are store in one "altstorage" per cluster/location.

For a two-location solution, message-data should be stored in:

mail_location = mdbox:~/mdbox
		ALTcache=mdbox:~/mdbox-remoteip-cache
		ALT=dfetch://remoteip/   &lt;-- new protocol

If a message is in the index, look for it in that order:

local mdbox
ALTcache
ALT

if it finds the message in ALT, make a copy into ALTcache (or local mdbox?).

Syncronizing messages could be a very low frequency job, and could be handled by simple rsync of ALT to ALTcache. No need for specialized tool for this job. Syncronizing ALTcache to local mdbox could be done with a reversed doveadm-altmove, but might not be necessary.

Of course this is probably all very naive.. but you get the idea :-)

-jf

Timo Sirainen

1:49 p.m.

On 24.3.2012, at 9.19, Attila Nagy wrote:

...

Well, dsync is a very useful tool, but with continuous replication it tries to solve a problem which should be handled -at least partially- elsewhere. Storing stuff in plain file systems and duplicating them to another one just doesn't scale.

dsync solves several other problems besides replication. Even if Dovecot had a super efficient replicated storage, dsync would still exist for doing things like:

migrating between mailbox formats
migrating from other imap/pop3 servers
creating (incremental) backups
the redesign works great for super-high latency replication (USB sticks, cross-planet replication :)
and when you really just don't want any kind of a complex replicated database, just something simple

So I'll need to get this working well in any case. And with the redesign the replication should be efficient enough to scale pretty well.

...

I personally think that Dovecot could gain much more if the amount of work going into fixing or improving dsync would go into making Dovecot to (be able of) use a high scale, distributed storage backend. I know it's much harder, because there are several major differences compared to the "low latency" and consistency problem free local file systems, but its fruits are also sweeter for the long term. :)

Yes, I'm also planning on implementing that, but not yet.

...

It would bring Dovecot into the class of open source mail servers where there are currently no contenders.

BTW, for the previous question in this topic (are there any nosql dbs supporting application-level conflict resolution?), there are similar solutions (like CouchDB, but having some experiences with it, I wouldn't recommend it for massive mail storage -at least the plain CouchDB product), but I guess you would be better off with designing a schema which doesn't need it at the first time. For example, messages are immutable, so you won't face this issue in this area. And for metadata, maybe the solution is not to store "digested" snapshots of the current metadata (folders, flags, message links for folders etc), but to store the changes happening on the user's mailbox and occasionally aggregate them into a last known good and consistent state.

My plan was to create similar index files as currently exists in filesystem. It would work pretty much the same as you described: There's a "log" where changes are appended, and once in a while the changes are written into an "index" snapshot. When reading you first read the snapshot and then apply new changes from the log. The conflict resolution if DB supports it would work by reading the two logs in parallel and figure out a way to merge them consistently, similar to how dsync does pretty much the same thing. Hmm. Perhaps the metadata log could exist exactly as the dsync data format and have dsync code do the merging?..

...

Also, there are other interesting ideas, maybe with real single instance store (splitting mime parts? Storing attachments in plain binary form? This always brings up the question of whether the mail server should modify the mails, can be pretty bad for encrypted/signed stuff).

This is already optionally done in v2.0+dbox. MIME attachments can be stored in plain binary form if they can be reconstructed back into their original form. It doesn't break any signed stuff.

Jeff Gustafson

26 Mar 26 Mar

10:34 p.m.

On Sat, 2012-03-24 at 08:19 +0100, Attila Nagy wrote:

...

I personally think that Dovecot could gain much more if the amount of work going into fixing or improving dsync would go into making Dovecot to (be able of) use a high scale, distributed storage backend. I know it's much harder, because there are several major differences compared to the "low latency" and consistency problem free local file systems, but its fruits are also sweeter for the long term. :)

Do you have any suggestions for a distributed replicated filesystem

that works well with dovecot? I've looked into glusterfs, but the latency is way too high for lots of small files. They claim this problem is fixed in glusterfs 3.3. NFS too slow for my installation so I don't see how any of the distributed filesystems would help me. I've also tried out ZFS, but it appears to have issues with metadata look ups with directories that have tens or hundreds of thousands of files in them. For me, the best filesystem is straight up ext4 running on locally attached storage. I think a solid, fast dsync implementation would be very useful for a large installation.

	...Jeff

Stan Hoeppner

27 Mar 27 Mar

11:09 p.m.

On 3/26/2012 2:34 PM, Jeff Gustafson wrote:

...

Do you have any suggestions for a distributed replicated filesystem that works well with dovecot? I've looked into glusterfs, but the latency is way too high for lots of small files. They claim this problem is fixed in glusterfs 3.3. NFS too slow for my installation so I don't see how any of the distributed filesystems would help me. I've also tried out ZFS, but it appears to have issues with metadata look ups with directories that have tens or hundreds of thousands of files in them. For me, the best filesystem is straight up ext4 running on locally attached storage. I think a solid, fast dsync implementation would be very useful for a large installation.

It sounds like you're in need of a more robust and capable storage/backup solution, such as an FC/iSCSI SAN array with PIT and/or incremental snapshot capability.

Also, you speak of a very large maildir store, with hundreds of thousands of directories, obviously many millions of files, of 1TB total size. Thus I would assume you have many thousands of users, if not 10s of thousands.

It's a bit hard to believe you're not running XFS on your storage, given your level of parallelism. You'd get much better performance using XFS vs EXT4. Especially with kernel 2.6.39 or later which includes the delayed logging patch. This patch increases metadata write throughput by a factor of 2-50+ depending on thread count, and decreases IOPS and MB/s hitting the storage by about the same factor, depending on thread count.

Before this patch XFS sucked at the write portion of the maildir workload due to the extremely high IOPS and MB/s hitting just the log journal, not including the actual file writes. It's parallel maildir read performance was better than any other, but the write was so bad it bogged down the storage producing high latency for everything. With the delaylog patch, XFS now trounces every filesystem at medium to high parallelism levels. Delaylog was introduced in mid 2009, included in 2.6.35 as experimental, and is the default in 2.6.39 and later. If you're a Red Hat or CentOS user it's included in 6.2.

This one patch, which was 5+ years in development, dramatically changed the character of XFS with this class of metadata intensive parallel workloads. Many people with such a workload who ran from XFS in the past, as if it were the Fukushima reactor, are now adopting it in droves.

What a difference a few hundred lines of very creative code can make...

-- Stan

Jeff Gustafson

11:57 p.m.

On Tue, 2012-03-27 at 15:09 -0500, Stan Hoeppner wrote:

...

On 3/26/2012 2:34 PM, Jeff Gustafson wrote:

...
Do you have any suggestions for a distributed replicated filesystem that works well with dovecot? I've looked into glusterfs, but the latency is way too high for lots of small files. They claim this problem is fixed in glusterfs 3.3. NFS too slow for my installation so I don't see how any of the distributed filesystems would help me. I've also tried out ZFS, but it appears to have issues with metadata look ups with directories that have tens or hundreds of thousands of files in them. For me, the best filesystem is straight up ext4 running on locally attached storage.

It sounds like you're in need of a more robust and capable storage/backup solution, such as an FC/iSCSI SAN array with PIT and/or incremental snapshot capability.

We do have a FC system that another department is using. The company

dropped quite a bit of cash on it for a specific purpose. Our department does not have access it to. People are somewhat afraid of iSCSI around here because they believe it will add too much latency to the overall IO performance. They're a big believer in locally attached disks. Less features, but very good performance. We thought ZFS would provide us with a nice snapshot and backup system (with zfs send). We never got that far once we discovered that ZFS doesn't work very well in this context. Running rsync on it gave us terrible performance.

...

Also, you speak of a very large maildir store, with hundreds of thousands of directories, obviously many millions of files, of 1TB total size. Thus I would assume you have many thousands of users, if not 10s of thousands.

It's a bit hard to believe you're not running XFS on your storage, given your level of parallelism. You'd get much better performance using XFS vs EXT4. Especially with kernel 2.6.39 or later which includes the delayed logging patch. This patch increases metadata write throughput by a factor of 2-50+ depending on thread count, and decreases IOPS and MB/s hitting the storage by about the same factor, depending on thread count.

I've relatively new here, but I'll ask around about XFS and see if

anyone had tested it in the development environment.

			...Jeff

Stan Hoeppner

28 Mar 28 Mar

7:07 p.m.

On 3/27/2012 3:57 PM, Jeff Gustafson wrote:

...

We do have a FC system that another department is using. The company dropped quite a bit of cash on it for a specific purpose. Our department does not have access it to. People are somewhat afraid of iSCSI around here because they believe it will add too much latency to the overall IO performance. They're a big believer in locally attached disks. Less features, but very good performance.

If you use a software iSCSI initiator with standard GbE ports, block IO latency can become a problem, but basically in only 3 scenarios:

Slow CPUs or not enough CPUs/cores. This is unlikely to be a problem in 2012, given the throughput of today's multi-core CPUs. Low CPU throughput hasn't generally been the cause of software iSCSI initiator latency problems since pre-2007/8 with most applications. I'm sure some science/sim apps that tax both CPU and IO may have still had issues. Those would be prime candidates for iSCSI HBAs.
An old OS kernel that doesn't thread IP stack, SCSI encapsulation, and/or hardware interrupt processing amongst all cores. Recent Linux kernels do this rather well, especially with MSI-X enabled, older ones not so well. I don't know about FreeBSD, Solaris, AIX, HP-UX, Windows, etc.
System under sufficiently high CPU load to slow IP stack and iSCSI encapsulation processing, and or interrupt handling. Again, with today's multi-core fast CPUs this probably isn't going to be an issue, especially given that POP/IMAP are IO latency bound, not CPU bound. Most people running Dovecot today are going to have plenty of idle CPU cycles to perform the additional iSCSI initiator and TCP stack processing without introducing undue block IO latency effects.

As always, YMMV. The simply path is to acquire your iSCSI SAN array and use software initiators on client hosts. In the unlikely event you do run into block IO latency issues, you simply drop an iSCSI HBA into each host suffering the latency. They run ~$700-900 USD each for single port models, and they eliminate block IO latency completely, which is one reason they cost so much. They have an onboard RISC chip and memory doing the TCP and SCSI encapsulation processing. They also give you the ability to boot diskless servers from LUNs on the SAN array. This is very popular with blade server systems, and I've done this many times myself, albeit with fibre channel HBAs/SANs, not iSCSI.

Locally attached/internal/JBOD storage typically offers the best application performance per dollar spent, until you get to things like backup scenarios, where off node network throughput is very low, and your backup software may suffer performance deficiencies, as is the issue titling this thread. Shipping full or incremental file backups across ethernet is extremely inefficient, especially with very large filesystems. This is where SAN arrays with snapshot capability come in really handy.

The snap takes place wholly within the array and is very fast, without the problems you see with host based snapshots such as with Linux LVM, where you must first freeze the filesystem, wait for the snapshot to complete, which could be a very long time with a 1TB FS. While this occurs your clients must wait or timeout while trying to access mailboxes. With a SAN array snapshot system this isn't an issue as the snap is transparent to hosts with little or no performance degradation during the snap. Two relatively inexpensive units that have such snapshot capability are:

http://www.equallogic.com/products/default.aspx?id=10613

http://h10010.www1.hp.com/wwpc/us/en/sm/WF04a/12169-304616-241493-241493-241...

The Equallogic units are 1/10 GbE iSCSI only IIRC, whereas the HP can be had in 8Gb FC, 1/10Gb iSCSI, or 6Gb direct attach SAS. Each offer 4 or more host/network connection ports when equipped with dual controllers. There are many other vendors with similar models/capabilities. I mention these simply because Dell/HP are very popular and many OPs are already familiar with their servers and other products.

...

We thought ZFS would provide us with a nice snapshot and backup system (with zfs send). We never got that far once we discovered that ZFS doesn't work very well in this context. Running rsync on it gave us terrible performance.

There are 3 flavors of ZFS: native Oracle Solaris, native FreeBSD, Linux FUSE. Which were you using? If the last, that would fully explain the suck.

...

...
Also, you speak of a very large maildir store, with hundreds of thousands of directories, obviously many millions of files, of 1TB total size. Thus I would assume you have many thousands of users, if not 10s of thousands.

It's a bit hard to believe you're not running XFS on your storage, given your level of parallelism. You'd get much better performance using XFS vs EXT4. Especially with kernel 2.6.39 or later which includes the delayed logging patch. This patch increases metadata write throughput by a factor of 2-50+ depending on thread count, and decreases IOPS and MB/s hitting the storage by about the same factor, depending on thread count.

I've relatively new here, but I'll ask around about XFS and see if anyone had tested it in the development environment.

If they'd tested it properly, and relatively recently, I would think they'd have already replaced EXT4 on your Dovecot server. Unless others factors prevented such a migration. Or unless I've misunderstood the size of your maildir workload.

-- Stan

Jeff Gustafson

11:54 p.m.

On Wed, 2012-03-28 at 11:07 -0500, Stan Hoeppner wrote:

...

Locally attached/internal/JBOD storage typically offers the best application performance per dollar spent, until you get to things like backup scenarios, where off node network throughput is very low, and your backup software may suffer performance deficiencies, as is the issue titling this thread. Shipping full or incremental file backups across ethernet is extremely inefficient, especially with very large filesystems. This is where SAN arrays with snapshot capability come in really handy.

I'm a new employee at the company. I was a bit surprised they were not

using iSCSI. They claim they just can't risk the extra latency. I believe that you are right. It seems to me that offloading snapshots and backups to an iSCSI SAN would improve things. The problem is that this company has been burned on storage solutions more than once and they are a little skeptical that a product can scale to what they need. There are some SAN vendor names that are a four letter word here. So far, their newest FC SAN is performing well. I think having more, small, iSCSI boxes would be a good solution. One problem I've seen with smaller iSCSI products is that feature sets like snapshotting are not the best implementation. It works, but doing any sort of automation can be painful.

...

The snap takes place wholly within the array and is very fast, without the problems you see with host based snapshots such as with Linux LVM, where you must first freeze the filesystem, wait for the snapshot to complete, which could be a very long time with a 1TB FS. While this occurs your clients must wait or timeout while trying to access mailboxes. With a SAN array snapshot system this isn't an issue as the snap is transparent to hosts with little or no performance degradation during the snap. Two relatively inexpensive units that have such snapshot capability are:

How does this work? I've always had Linux create a snapshot. Would the

SAN doing a snapshot without any OS buy-in cause the filesystem to be saved in an inconsistent state? I know that ext4 is pretty good at logging, but still, wouldn't this be a problem?

...

http://www.equallogic.com/products/default.aspx?id=10613

http://h10010.www1.hp.com/wwpc/us/en/sm/WF04a/12169-304616-241493-241493-241...

The Equallogic units are 1/10 GbE iSCSI only IIRC, whereas the HP can be had in 8Gb FC, 1/10Gb iSCSI, or 6Gb direct attach SAS. Each offer 4 or more host/network connection ports when equipped with dual controllers. There are many other vendors with similar models/capabilities. I mention these simply because Dell/HP are very popular and many OPs are already familiar with their servers and other products.

I will take a look. I might have some convincing to do.

...

There are 3 flavors of ZFS: native Oracle Solaris, native FreeBSD, Linux FUSE. Which were you using? If the last, that would fully explain the suck.

There is one more that I had never used before coming on board here:

ZFSonLinux. ZFSonLinux is a real kernel level fs plugin. My understanding is that they were using it on the backup machines with the front end dovecot machines using ext4. I'm told the metadata issue is a ZFS thing and they have the same problem on Solaris/Nexenta.

...

...
I've relatively new here, but I'll ask around about XFS and see if anyone had tested it in the development environment.

If they'd tested it properly, and relatively recently, I would think they'd have already replaced EXT4 on your Dovecot server. Unless others factors prevented such a migration. Or unless I've misunderstood the size of your maildir workload.

I don't know the entire history of things. I think they really wanted

to use ZFS for everything and then fell back to ext4 because it performed well enough in the cluster. Performance becomes an issue with backups using rsync. Rsync is faster than Dovecot's native dsync by a very large margin. I know that dsync is doing more than rsync, but still, seconds compared to over five minutes? That is a significant difference. The problem is that rsync can't get a perfect backup.

		...Jeff

Stan Hoeppner

29 Mar 29 Mar

3:24 p.m.

On 3/28/2012 3:54 PM, Jeff Gustafson wrote:

...

On Wed, 2012-03-28 at 11:07 -0500, Stan Hoeppner wrote:

...
Locally attached/internal/JBOD storage typically offers the best application performance per dollar spent, until you get to things like backup scenarios, where off node network throughput is very low, and your backup software may suffer performance deficiencies, as is the issue titling this thread. Shipping full or incremental file backups across ethernet is extremely inefficient, especially with very large filesystems. This is where SAN arrays with snapshot capability come in really handy.

I'm a new employee at the company. I was a bit surprised they were not using iSCSI. They claim they just can't risk the extra latency. I

The tiny amount of extra latency using a software initiator is a non argument for a mail server workload, unless the server is undersized for the workload--high CPU load and low memory constantly. As I said, in that case you drop in an iSCSI HBA and eliminate any possibility of block latency.

...

believe that you are right. It seems to me that offloading snapshots and backups to an iSCSI SAN would improve things.

If you get the right unit you won't understand how you ever lived without it. The snaps complete transparently, and the data is on the snap LUN within a few minutes, depending on the priority you give to internal operations, snaps/rebuilds/etc, vs external IO requests. Depending on model

...

The problem is that this company has been burned on storage solutions more than once and they are a little skeptical that a product can scale to what they need. There are

More than once? More than once?? Hmm...

...

some SAN vendor names that are a four letter word here. So far, their newest FC SAN is performing well.

Interesting. Care to name them (off list)?

...

I think having more, small, iSCSI boxes would be a good solution. One problem I've seen with smaller iSCSI products is that feature sets like snapshotting are not the best implementation. It works, but doing any sort of automation can be painful.

As is most often the case, you get what you pay for.

...

...
The snap takes place wholly within the array and is very fast, without the problems you see with host based snapshots such as with Linux LVM, where you must first freeze the filesystem, wait for the snapshot to complete, which could be a very long time with a 1TB FS. While this occurs your clients must wait or timeout while trying to access mailboxes. With a SAN array snapshot system this isn't an issue as the snap is transparent to hosts with little or no performance degradation during the snap. Two relatively inexpensive units that have such snapshot capability are:

How does this work? I've always had Linux create a snapshot. Would the SAN doing a snapshot without any OS buy-in cause the filesystem to be saved in an inconsistent state? I know that ext4 is pretty good at logging, but still, wouldn't this be a problem?

Instead of using "SAN" as a generic term for a "box", which it is not, please use the terms "SAN" for "storage area network", "SAN array" or "SAN controller" when talking about a box with or without disks that performs the block IO shipping and other storage functions, "SAN switch" for a fiber channel switch, or ethernet switch dedicated to the SAN infrastructure. The acronym "SAN" is an umbrella covering many different types of hardware and network topologies. It drives me nuts when people call a fiber channel or iSCSI disk array a "SAN". These can be part of a SAN, but are not themselves, a SAN. If they are direct connected to a single host they are simple disk arrays, and the word "SAN" isn't relevant. Only uneducated people, or those who simply don't care to be technically correct, call a single intelligent disk box a "SAN". Ok, end rant on "SAN".

Read this primer from Dell: http://files.accord.com.au/EQL/Docs/CB109_Snapshot_Basic.pdf

The snapshots occur entirely at the controller/disk level inside the box. This is true of all SAN units that offer snap ability. No host OS involvement at all in the snap. As I previously said, It's transparent. Snaps are filesystem independent, and are point-in-time, or PIT copies of one LUN to another. Read up on "LUN" if you're not familiar with the term. Everything in SAN storage is based on LUNs.

Now, as the document above will tell you, array based snapshots may or may not be a total backup solution for your environment. You need to educate yourself and see if this technology is a feature that fits your file backup and disaster avoidance and recovery needs.

...

...
http://www.equallogic.com/products/default.aspx?id=10613

http://h10010.www1.hp.com/wwpc/us/en/sm/WF04a/12169-304616-241493-241493-241...

The Equallogic units are 1/10 GbE iSCSI only IIRC, whereas the HP can be had in 8Gb FC, 1/10Gb iSCSI, or 6Gb direct attach SAS. Each offer 4 or more host/network connection ports when equipped with dual controllers. There are many other vendors with similar models/capabilities. I mention these simply because Dell/HP are very popular and many OPs are already familiar with their servers and other products.

I will take a look. I might have some convincing to do.

SAN array features/performance are an easy sell. Price not so much. Each fully loaded ~24 drive SAN array is going to run you between $15k-30k USD depending on the vendor and how many spindles you need for IOPS, disk size for total storage, snap/replication features you need, expandability, etc.

...

...
There are 3 flavors of ZFS: native Oracle Solaris, native FreeBSD, Linux FUSE. Which were you using? If the last, that would fully explain the suck.

There is one more that I had never used before coming on board here: ZFSonLinux. ZFSonLinux is a real kernel level fs plugin. My

It's a "roll your own" patch set not in mainline and not supported by any Linux distro/vendor, AFAIK. Which is why I didn't include it.

...

understanding is that they were using it on the backup machines with the front end dovecot machines using ext4. I'm told the metadata issue is a ZFS thing and they have the same problem on Solaris/Nexenta.

I've never used ZFS, and don't plan to, so I can't really comment on this. That and I have no technical details of the problem.

...

...
...
I've relatively new here, but I'll ask around about XFS and see if anyone had tested it in the development environment.

If they'd tested it properly, and relatively recently, I would think they'd have already replaced EXT4 on your Dovecot server. Unless others factors prevented such a migration. Or unless I've misunderstood the size of your maildir workload.

I don't know the entire history of things. I think they really wanted to use ZFS for everything and then fell back to ext4 because it performed well enough in the cluster. Performance becomes an issue with backups using rsync. Rsync is faster than Dovecot's native dsync by a very large margin. I know that dsync is doing more than rsync, but still, seconds compared to over five minutes? That is a significant difference. The problem is that rsync can't get a perfect backup.

This happens with a lot of "fan boys". There was so much hype surrounding ZFS that even many logically thinking people were frothing at the mouth waiting to get their hands on it. Then, as with many/most things in the tech world, the goods didn't live up to the hype.

XFS has been around since 1994, has never had hype surrounding it, has simply been steadily, substantially improved over time. It has been since day 1 the highest performance filesystem with parallel workloads, and finally overcame its last barrier preventing it from being suitable for just about any workload: metadata write performance. Which makes it faster than any FS with the maildir workload when sufficient parallelism/concurrency is present. Meaning servers with a few thousand active users will benefit. Those with 7 users won't.

-- Stan

Robin

9:13 p.m.

On 3/29/2012 5:24 AM, Stan Hoeppner wrote:

...

This happens with a lot of "fan boys". There was so much hype surrounding ZFS that even many logically thinking people were frothing at the mouth waiting to get their hands on it. Then, as with many/most things in the tech world, the goods didn't live up to the hype.

The problem with zfs especially is that there are so many different implementations, with only the commercial Sun, er, Oracle paid Solaris having ALL of the promised features and the bug-fixes to make them safely usable. For those users, with very large RAM-backed Sun, er, Oracle, hardware, it probably works well.

FreeBSD and even the last versions of OpenSolaris lack fixes for some wickedly nasty box-bricking bugs in de-dup, as well as many of the "sexy" features in zpool that had people flocking to it in the first place.

The bug database that used to be on the OpenSolaris portal by Sun's gone dark, but you may have some luck through archive.org. I know when I tried it out for myself using the "Community Edition" of Solaris, I did feel annoyed by the bait-and-switch, and the RAM requirements to run de-dupe with merely adequate performance were staggering if I wanted to have plenty of spare block cache left over for improving performance overall.

Sun left some of the FOSS operating systems a poison pill with its CDDL licence, which is the main reason why the implementations of zfs on Linux are immature and is being "re-implemented" with US DOE sponsorship, ostensibly in a GNU compatible licence.

zfs reminds me a great deal of TIFF - lots of great ideas in the "White Paper", but an elusive (or very very costly) white elephant to acquire. "Rapidly changing", "bleeding edge", and "hot & new" are not descriptors for filesystems I want to trust more than a token amount of data to.

=R=

Michescu Andrei

27 Mar 27 Mar

1:14 a.m.

Hello Timo,

Thank you very much for planning a redesign of the dsycn and for opening this discussion.

As I can see from the replies that came until now everybody misses the main point of IMAP: IMAP has been designed to work as a disconnected, high-latency data store.

To make this more clear: once and IMAP client finishes the synchronization with the server, both have client and server have a consistent state of the mailbox. After this both the "client" and the "server" act like master for their own local copy (on the "server" new emails get created etc, on the "client" existing emails get changed (flags) and moved, and new emails appear (sent items)).

So the protocol is designed, originally, to handle the master-master replication. And as this it make sense a deployment global-wide, where servers work independently and from time to time they "merge" the changes.

This being said and acknowledged here are my 2 cents:

I think that the current '1 brain / 2 workers' seems to be the correct model. The "the client" connects to the "server" and pushes the local changes and after retrieves the updated/new items from the "server". "The brain" considers first server as the "local storage" and the second server as "server storage".

For the split design, "come to the same conclusion of the state" is very race-condition prone.

As long as the algorithm is kept as you described it in the original document then the backups should really be incremental (because you only do the changes since last sync).

As the most changes are "metadata-only" the sync can be pretty fast by merging indexes.

Thank you, Andrei

...

In case anyone is interested in reading (and maybe helping!) with a dsync redesign that's intended to fix all of its current problems, here are some possibly incoherent ramblings about it:

http://dovecot.org/tmp/dsync-redesign.txt

and even if you don't understand that, here's another document disguising as an algorithm class problem :) If anyone has thoughts on how to solve it, would be great:

http://dovecot.org/tmp/dsync-redesign-problem.txt

It only deals with saving new messages, not expunges/flag changes/etc, but those should be much simpler.

!DSPAM:4f6cea4c260302917022693!

Timo Sirainen

29 Mar 29 Mar

1:30 a.m.

On 27.3.2012, at 1.14, Michescu Andrei wrote:

...

This being said and acknowledged here are my 2 cents:

I think that the current '1 brain / 2 workers' seems to be the correct model. The "the client" connects to the "server" and pushes the local changes and after retrieves the updated/new items from the "server". "The brain" considers first server as the "local storage" and the second server as "server storage".

This design makes it too easy to design it in a way that adds extra roundtrips = extra latency. It also kind of hides other problems as well. For example now dsync can way too easily just fail if something unexpected happens during dsync (e.g. mailbox gets renamed/deleted). And there are of course some bugs that I don't really understand why some people are seeing them at all.

...

For the split design, "come to the same conclusion of the state" is very race-condition prone.

It's race-condition prone with the brain design as well. dsync can't just lock the mailbox during its sync, since the sync can take a long time. With a "brainless" design it's clear from the beginning that there are race conditions and they need to be dealt with.

Micah Anderson

27 Mar 27 Mar

6:47 p.m.

Timo Sirainen <tss@iki.fi> writes:

...

In case anyone is interested in reading (and maybe helping!) with a dsync redesign that's intended to fix all of its current problems, here are some possibly incoherent ramblings about it:

thank you for opening this discussion about dsync!

besides the problems I've encountered with dsync, there are a couple things that I think would be great to build into the new vision of the protocol.

One would be the ability to perform *intelligent* incremental/rotated backups. I can do this now by running a dsync backup operation and then doing manual hardlinking or moving of the backup directories (daily.1, daily.2, weekly.1, monthly.1, etc.), but it would be more intelligent if this were baked into the backup process.

Secondly, being able to filter out mailboxes could result in much more efficient syncing. Now there is the capability to operate on only specific mailboxes, but this doesn't scale well when I am trying to backup thousands of users and I want to omit the Spam and Trash folders from the sync. I would have to get a mailbox list of each user, and then iterate over each mailbox for each user, skipping the Spam and Trash folders, forking a new 'dsync backup' for each of their mailboxes, for each user.

Lastly, there isn't a good method for restoring backups. I can reverse the backup process, onto the user's "live" mailbox, but that brings the user into an undesirable state (eg. their mailbox state one day ago). Better would be if their backup could be restored in such a way that the user can resolve the missing pieces manually, as they know best.

thanks again for your work on this, from my position dovecot is an amazing piece of software, the only part that seems to have some issues is dsync and I applaud the effort to redesign to fix things!

micah

Charles Marcus

10:34 p.m.

On 2012-03-27 11:47 AM, Micah Anderson <micah@riseup.net> wrote:

...

One would be the ability to perform *intelligent* incremental / rotated backups. I can do this now by running a dsync backup operation and then doing manual hardlinking or moving of the backup directories (daily.1, daily.2, weekly.1, monthly.1, etc.), but it would be more intelligent if this were baked into the backup process.

There are already numerous tools that do this flawlessly - I've been using rsnapshot (which uses rsync) for this for years.

I don't know if Timo should be spending his time reinventing the wheel.

I'm much more interested in dsync working flawlessly to keep one or more secondary servers in sync, and leave backups to backup software.

...

Lastly, there isn't a good method for restoring backups. I can reverse the backup process, onto the user's "live" mailbox, but that brings the user into an undesirable state (eg. their mailbox state one day ago). Better would be if their backup could be restored in such a way that the user can resolve the missing pieces manually, as they know best.

Again, best left to the backup software I think?

Although, one interesting piece that I am hopeful I'll be able to implement soon (with Timo's professional help) is the ability to easily and automatically map my rsnapshot snapshots directory to a read-only 'Backups' namespace that automatically shows the snapshots by date and time as they are produced. This way users could 'go back in time' anytime they wanted without having to call me... :)

...

thanks again for your work on this, from my position dovecot is an amazing piece of software, the only part that seems to have some issues is dsync and I applaud the effort to redesign to fix things!

Ditto all of that! :)

Best regards,

Charles

Micah Anderson

3 Apr 3 Apr

2:15 a.m.

Charles Marcus <CMarcus@Media-Brokers.com> writes:

...

On 2012-03-27 11:47 AM, Micah Anderson <micah@riseup.net> wrote:

...
One would be the ability to perform *intelligent* incremental / rotated backups. I can do this now by running a dsync backup operation and then doing manual hardlinking or moving of the backup directories (daily.1, daily.2, weekly.1, monthly.1, etc.), but it would be more intelligent if this were baked into the backup process.

There are already numerous tools that do this flawlessly - I've been using rsnapshot (which uses rsync) for this for years.

Are you snapshotting your filesystem (using LVM, or SAN, or similar) before doing rsnapshot? Because if you aren't then rsync will not assuredly get everything in a consistent state.

...

I don't know if Timo should be spending his time reinventing the wheel.

dsync backup is already here, and it is quite useful.

...

I'm much more interested in dsync working flawlessly to keep one or more secondary servers in sync, and leave backups to backup software.

I'm not against that idea, I just have not yet found a good way to use any backup software in such a way to handle large numbers of user's mail.

...

Although, one interesting piece that I am hopeful I'll be able to implement soon (with Timo's professional help) is the ability to easily and automatically map my rsnapshot snapshots directory to a read-only 'Backups' namespace that automatically shows the snapshots by date and time as they are produced. This way users could 'go back in time' anytime they wanted without having to call me... :)

Interesting idea, would be a great one to share with the community if you decide to do so.

micah

Charles Marcus

2:33 p.m.

On 2012-04-02 7:15 PM, Micah Anderson <micah@riseup.net> wrote:

...

Charles Marcus<CMarcus@Media-Brokers.com> writes:

...
On 2012-03-27 11:47 AM, Micah Anderson<micah@riseup.net> wrote:

...
One would be the ability to perform *intelligent* incremental / rotated backups. I can do this now by running a dsync backup operation and then doing manual hardlinking or moving of the backup directories (daily.1, daily.2, weekly.1, monthly.1, etc.), but it would be more intelligent if this were baked into the backup process.

...

...
There are already numerous tools that do this flawlessly - I've beenusing rsnapshot (which uses rsync) for this for years.

...

Are you snapshotting your filesystem (using LVM, or SAN, or similar) before doing rsnapshot? Because if you aren't then rsync will not assuredly get everything in a consistent state.

No, and you are correct... but I run it in the middle of the night, and the system is only barely utilized at the time, so the very minor inconsistencies are not a problem overall.

I will, however, be changing this to using FS snapshots once I get my mailserver virtualized (already being planned for when our new office location comes online), so that will allow me to perform snapshots multiple times during the day (I'm thinking 4 times per day will be enough).

...

...
I don't know if Timo should be spending his time reinventing the wheel.

...

dsync backup is already here, and it is quite useful.

I'm not saying it isn't, I'm just saying that there are already *plenty* of different backup tools, and I don't see the sense in Timo spending lots of time on creating a new one just for dovecot. His time would be better spent just making it easier for any other backup tool to work better.

...

...
Although, one interesting piece that I am hopeful I'll be able to implement soon (with Timo's professional help) is the ability to easily and automatically map my rsnapshot snapshots directory to a read-only 'Backups' namespace that automatically shows the snapshots by date and time as they are produced. This way users could 'go back in time' anytime they wanted without having to call me... :)

...

Interesting idea, would be a great one to share with the community if you decide to do so.

Absolutely - that is already on my list for when I pay Timo's company to do this - document it on the wiki. Hopefully if any code changes are needed to make it work right, they will be minor.

Best regards,

Charles

Timo Sirainen

29 Mar 29 Mar

1:43 a.m.

On 23.3.2012, at 23.25, Timo Sirainen wrote:

...

and even if you don't understand that, here's another document disguising as an algorithm class problem :) If anyone has thoughts on how to solve it, would be great:

http://dovecot.org/tmp/dsync-redesign-problem.txt

It only deals with saving new messages, not expunges/flag changes/etc, but those should be much simpler.

Step #3 was more difficult than I first realized. I spent last two days figuring out a way to make it work, and looks like I finally did. I didn't update the document yet, but I wrote a test program: http://dovecot.org/tmp/test-dsync.c

Step #2 should be easy enough.

Step #4 I think I'll forget about and just implement a per-mailbox dsync lock. The main reason I wanted to get rid of locks was because a per-user lock can't work with shared mailboxes. But a per-mailbox lock is okay enough. Note that #3 allows the two dsyncs to run in parallel and send duplicate changes, just not modifying the same mailbox at the same time (which would duplicate mails due to two transactions adding the same mails).

4945

Age (days ago)

4956

Last active (days ago)

List overview

17 comments

9 participants

participants (9)

Attila Nagy
Charles Marcus
Jan-Frode Myklebust
Jeff Gustafson
Micah Anderson
Michescu Andrei
Robin
Stan Hoeppner
Timo Sirainen