[Dovecot] Best Cluster Storage
Hi Everyone,
I wish to create a Postfix/Dovecot active-active cluster (each node will run Postfix *and* Dovecot), which will obviously have to use central storage. I'm looking for ideas to see what's the best out there. All of this will be running on multiple Xen hosts, however I don't think that matters as long as I make sure that the cluster nodes are on different physical boxes.
Here are my ideas so far for the central storage:
NFS Server using DRBD+LinuxHA. Export the same NFS share to each mail server. Which this seems easy, how well does Dovecot work with NFS? I've read the wiki page, and it doesn't sound promising. But it may be outdated..
Export block storage using iSCSI from targets which have GFS2 on DRBD+LinuxHA. This is tricky to get working well, and it's only a theory.
GlusterFS. Easy to set up, but apparently very slow to run.
So what's everybody using? I know that Postfix runs well on NFS (according to their docs). I intend to use Maildir
Thanks
Jonathan Tripathy put forth on 1/13/2011 1:22 AM:
I wish to create a Postfix/Dovecot active-active cluster (each node will run Postfix *and* Dovecot), which will obviously have to use central storage. I'm looking for ideas to see what's the best out there. All of this will be running on multiple Xen hosts, however I don't think that matters as long as I make sure that the cluster nodes are on different physical boxes.
I've never used Xen. Doesn't it abstract the physical storage layer in the same manner as VMWare ESX? If so, everything relating to HA below is pretty much meaningless except for locking.
Here are my ideas so far for the central storage:
NFS Server using DRBD+LinuxHA. Export the same NFS share to each mail server. Which this seems easy, how well does Dovecot work with NFS? I've read the wiki page, and it doesn't sound promising. But it may be outdated..
Export block storage using iSCSI from targets which have GFS2 on DRBD+LinuxHA. This is tricky to get working well, and it's only a theory.
GlusterFS. Easy to set up, but apparently very slow to run.
So what's everybody using? I know that Postfix runs well on NFS (according to their docs). I intend to use Maildir
In this Xen setup, I think the best way to accomplish your goals is to create 6 guests:
2 x Linux Postfix 2 x Linux Dovecot 1 x Linux NFS server 1 x Linux Dovecot director
Each of these can be painfully small stripped down Linux instances. Configure each Postfix and Dovecot server to access the same NFS export. Configure Postfix to use native local delivery to NFS/maildir. Don't use LDA (deliver).
With Postfix HA is automatic: you simply setup both servers with the same DNS MX priority. DNS automatically takes care of HA for MX mail by design. If a remote SMTP client can't reach one MX it'll try the other automatically. Of course, you already knew this (or should have).
Configure each Dovecot instance to use the NFS/maildir export. Disable indexing unless or until you've confirmed that director is working sufficiently well to keep each client hitting the same Dovecot server.
Have Xen run Postfix+Dovecot paired on two different hosts and have the NFS server and director on a third Xeon host. This ordering will obviously change if hosts fail and your Xen scripts auto restart the guests on other hosts.
Now, all of the above assumes that since you are running a Xen cluster that you are using shared fiber channel or iSCSI storage arrays on the back end, and that each Xen host has a direct (or switched) connection to such storage and thus has block level access to the LUNs on each SAN array. If you do not have shared storage for the cluster, disregard everything above, and pondering why you asked any of this in the first place.
For any meaningful use of virtualized clusters with Xen, ESX, etc, a prerequisite is shared storage. If you don't have it, get it. The hypervisor is what gives you fault tolerance. This requires shared storage. If you do not intend to install shared storage, and intend to use things like drbd between guests to get your storage redundancy, then you really need to simply throw out your hypervisor, in this case Xen, and do direct bare metal host clustering with drbd, gfs2, NFS, etc.
-- Stan
In this Xen setup, I think the best way to accomplish your goals is to create 6 guests:
2 x Linux Postfix 2 x Linux Dovecot 1 x Linux NFS server 1 x Linux Dovecot director
Each of these can be painfully small stripped down Linux instances. Configure each Postfix and Dovecot server to access the same NFS export. Configure Postfix to use native local delivery to NFS/maildir. Don't use LDA (deliver). Ok so this is interesting. As long as I use Postfix native delivery, along with Dovecot director, NFS should work ok?
For any meaningful use of virtualized clusters with Xen, ESX, etc, a prerequisite is shared storage. If you don't have it, get it. The hypervisor is what gives you fault tolerance. This requires shared storage. If you do not intend to install shared storage, and intend to use things like drbd between guests to get your storage redundancy, then you really need to simply throw out your hypervisor, in this case Xen, and do direct bare metal host clustering with drbd, gfs2, NFS, etc. Why is this the case? Apart from the fact that Virtualisation becomes "more useful" with shared storage (which I agree with), is there anything wrong with doing DR between guests? We don't have shared storage set up yet for the location this email system is going. We will get one in time though.
Jonathan Tripathy put forth on 1/13/2011 2:24 AM:
Ok so this is interesting. As long as I use Postfix native delivery, along with Dovecot director, NFS should work ok?
One has nothing to do with the other. Director doesn't touch smtp (afaik), only imap. The reason for having Postfix use its native local(8) delivery agent for writing into the maildir, instead of Dovecot deliver, is to avoid Dovecot index locking/corruption issues with a back end NFS mail store. So if you want to do sorting you'll have to use something other than sieve, such as maildrop or procmail. These don't touch Dovecot's index files, while Deliver (LSA) does write to them during message delivery into the maildir.
For any meaningful use of virtualized clusters with Xen, ESX, etc, a prerequisite is shared storage. If you don't have it, get it. The hypervisor is what gives you fault tolerance. This requires shared storage. If you do not intend to install shared storage, and intend to use things like drbd between guests to get your storage redundancy, then you really need to simply throw out your hypervisor, in this case Xen, and do direct bare metal host clustering with drbd, gfs2, NFS, etc.
Why is this the case? Apart from the fact that Virtualisation becomes "more useful" with shared storage (which I agree with), is there anything wrong with doing DR between guests? We don't have shared storage set up yet for the location this email system is going. We will get one in time though.
I argue that datacenter virtualization is useless without shared storage. This is easy to say for those of us who have done it both ways. You haven't yet. Your eyes will be opened after you do Xen or ESX atop a SAN. If you're going to do drbd replication between two guests on two physical Xen hosts then you may as well not use Xen at all. It's pointless.
What you need to do right now is build the justification case for installing the SAN storage as part of the initial build out and setup your virtual architecture around shared SAN storage. Don't waste your time on this other nonsense of replication from one guest to another, with an isolated storage pool attached to each physical Xen server. That's just nonsense. Do it right or don't do it at all.
Don't take my word for it. Hit Novell's website and VMWare's and pull up the recommended architecture and best practices docs.
One last thing. I thought I read something quite some time ago about Xen working on adding storage layer abstraction which would allow any Xen server to access directly connected storage on another Xen server, creating a sort of quasi shared SAN storage over ethernet without the cost of the FC SAN. Did anything ever come of that?
-- Stan
On Thu, Jan 13, 2011 at 04:57:20AM -0600, Stan Hoeppner wrote:
One has nothing to do with the other. Director doesn't touch smtp (afaik), only imap.
The director can do lmtp proxying, but I haven't seen much documentation on it except the few lines at:
http://wiki2.dovecot.org/Director
-jf
I use ocfs2 with 3 dovecots. one only for mailman.
We have problens with IO. Have about 4k active users.
We are now testing more ocfs2 clusters, becasue one of yours theorys is that iff all mail resides in only one ocfs2 cluster, it takes too long to find the file. ocfs2 i guess does not support index. using ocfs2 1.4
So now, we are gettins smallers luns from your storages and mounting 3 ocfs2 clusters that way we think the DLM will work better.
Sorry if i did not answer your question.
Anyway, we had some tests with NFS and it wasn't good also. We prefere sticky with ocfs2.
We are balacing with IPVS, not using dovecot director.
How many users will have ?
oh, we also use xen. Every host on mailsystem is an xen vm.
[]'sf.rique
On Thu, Jan 13, 2011 at 9:18 AM, Jan-Frode Myklebust <janfrode@tanso.net>wrote:
On Thu, Jan 13, 2011 at 04:57:20AM -0600, Stan Hoeppner wrote:
One has nothing to do with the other. Director doesn't touch smtp
(afaik), only
imap.
The director can do lmtp proxying, but I haven't seen much documentation on it except the few lines at:
http://wiki2.dovecot.org/Director
-jf
On Thu, 13 Jan 2011 10:33:34 -0200, Henrique Fernandes <sf.rique@gmail.com> wrote: I use ocfs2 with 3 dovecots. one only for mailman.
We have problens with IO. Have about 4k active users.
We are now testing more ocfs2 clusters, becasue one of yours theorys is that iff all mail resides in only one ocfs2 cluster, it takes too long to find the file. ocfs2 i guess does not support index. using ocfs2 1.4
My last production environment using OCFS2 was with quite recent ocfs2/dovecot - linux 2.6.35 and dovecot 1.2.15 with dbox mail storage. We got a lot of problems - high IO, fragmentation and exponential grow of access time etc. We tested also with directory indexes but this hasn't helped a lot.
Finaly we scrapped the ocfs2 setup and moved to less advanced setup: We created distinct volumes for every worker on the SAN, formated it with with XFS. The volumes got mounted on different mountpoints on workers. We setup a Pacemaker as cluster manager on the workers, so if worker dies its volume gets mounted on another worker and its service IP is brought up there.
As a result we are using a fraction of the IO compared with OCFS, the wait time on the workers dropped significantly, the service got better.
You have different options to distribute mailboxes through the workers. In owr setup the load is distributed by domain, because we are servicing hundreds of domains. So every domain MX/pop3/imap was changed to the service IP of the worker. If there are a lot of mailboxes in one domain you should put a balancer that knows on which server the mailbox is located and forward the requests there.
So now, we are gettins smallers luns from your storages and mounting 3 ocfs2 clusters that way we think the DLM will work better.
Sorry if i did not answer your question.
Anyway, we had some tests with NFS and it wasn't good also. We prefere sticky with ocfs2.
My test with NFS3/NFS4 were not good also, so it was not considered an option.
We are balacing with IPVS, not using dovecot director.
With IPVS you could not stick the same mailbox to the same server - this is important with ocfs setup because of filesystem caches and the locks. We were using nginx as proxy/balancer that could stick the same mailbox to the same backend - we did this before there was director service in dovecot but now you could use the director.
Best regards
-- Luben Karavelov
Luben Karavelov put forth on 1/26/2011 1:21 PM:
Finaly we scrapped the ocfs2 setup and moved to less advanced setup: We created distinct volumes for every worker on the SAN, formated it with with XFS. The volumes got mounted on different mountpoints on workers. We setup a Pacemaker as cluster manager on the workers, so if worker dies its volume gets mounted on another worker and its service IP is brought up there.
As a result we are using a fraction of the IO compared with OCFS, the wait time on the workers dropped significantly, the service got better.
That's obviously not a "perfectly" load balanced system, but it is an intriguing solution nonetheless. XFS will obviously be much faster than OCFS2.
Are you using the "-o delaylog" mount option? Are you seeing an increase in metadata performance due to it?
-- Stan
Ldirector and IPVS can "sticky" same ip to same server, so the ocfs2 cache still good.
We are trying to saparete the DLM network to ssee any performance issue!
[]'sf.rique
On Wed, Jan 26, 2011 at 6:42 PM, Stan Hoeppner <stan@hardwarefreak.com>wrote:
Luben Karavelov put forth on 1/26/2011 1:21 PM:
Finaly we scrapped the ocfs2 setup and moved to less advanced setup: We created distinct volumes for every worker on the SAN, formated it with with XFS. The volumes got mounted on different mountpoints on workers. We setup a Pacemaker as cluster manager on the workers, so if worker dies its volume gets mounted on another worker and its service IP is brought up there.
As a result we are using a fraction of the IO compared with OCFS, the wait time on the workers dropped significantly, the service got better.
That's obviously not a "perfectly" load balanced system, but it is an intriguing solution nonetheless. XFS will obviously be much faster than OCFS2.
Are you using the "-o delaylog" mount option? Are you seeing an increase in metadata performance due to it?
-- Stan
Jonathan Tripathy put forth on 1/13/2011 2:24 AM:
Ok so this is interesting. As long as I use Postfix native delivery, along with Dovecot director, NFS should work ok? One has nothing to do with the other. Director doesn't touch smtp (afaik), only imap. The reason for having Postfix use its native local(8) delivery agent for writing into the maildir, instead of Dovecot deliver, is to avoid Dovecot index locking/corruption issues with a back end NFS mail store. So if you want to do sorting you'll have to use something other than sieve, such as maildrop or procmail. These don't touch Dovecot's index files, while Deliver (LSA) does write to them during message delivery into the maildir. Yes, I thought it had something to do with that
For any meaningful use of virtualized clusters with Xen, ESX, etc, a prerequisite is shared storage. If you don't have it, get it. The hypervisor is what gives you fault tolerance. This requires shared storage. If you do not intend to install shared storage, and intend to use things like drbd between guests to get your storage redundancy, then you really need to simply throw out your hypervisor, in this case Xen, and do direct bare metal host clustering with drbd, gfs2, NFS, etc. Why is this the case? Apart from the fact that Virtualisation becomes "more useful" with shared storage (which I agree with), is there anything wrong with doing DR between guests? We don't have shared storage set up yet for the location this email system is going. We will get one in time though. I argue that datacenter virtualization is useless without shared storage. This is easy to say for those of us who have done it both ways. You haven't yet. Your eyes will be opened after you do Xen or ESX atop a SAN. If you're going to do drbd replication between two guests on two physical Xen hosts then you may as well not use Xen at all. It's pointless. Where did I say I havn't done that yet? I have indeed worked with VM infrastructures using SAN storage, and yes, it's fantastic. Just this
On 13/01/11 10:57, Stan Hoeppner wrote: particular location doesn't have a SAN box installed. And we will have to agree to disagree as I personally do see the benefit of using VMs with local storage
What you need to do right now is build the justification case for installing the SAN storage as part of the initial build out and setup your virtual architecture around shared SAN storage. Don't waste your time on this other nonsense of replication from one guest to another, with an isolated storage pool attached to each physical Xen server. That's just nonsense. Do it right or don't do it at all.
Don't take my word for it. Hit Novell's website and VMWare's and pull up the recommended architecture and best practices docs. You don't need to tell me :) I already know how great it is One last thing. I thought I read something quite some time ago about Xen working on adding storage layer abstraction which would allow any Xen server to access directly connected storage on another Xen server, creating a sort of quasi shared SAN storage over ethernet without the cost of the FC SAN. Did anything ever come of that?
I haven’t really been following how the 4.x branch is going as it wasn't stable enough for our needs. Random lockups would always occur. The 3.x branch is rock solid. There have been no crashes (yet!)
Would DRBD + GFS2 work better than NFS? While NFS is simple, I don't mind experimenting with DRBD and GFS2 is it means fewer problems?
Jonathan Tripathy put forth on 1/13/2011 7:11 AM:
Would DRBD + GFS2 work better than NFS? While NFS is simple, I don't mind experimenting with DRBD and GFS2 is it means fewer problems?
Depends on your definition of "better". If you do two dovecot+drbd nodes you have only two nodes. If you do NFS you have 3 including the NFS server. Performance would be very similar between the two.
Now, when you move to 3 dovecot nodes or more you're going to run into network scaling problems with the drbd traffic, because it increases logarithmically (or is it exponentially?) with node count. If using GFS2 atop drbd across all nodes, each time a node writes to GFS, the disk block gets encapsulated by the drbd driver and transmitted to all other drbd nodes. With each new mail that's written by each server, or each flag is updated, it gets written 4 times, once locally, and 3 times via drbd.
With NFS, each of these writes occurs over the network only once. With drbd it's always a good idea to dedicate a small high performance GbE switch to the cluster nodes just for drbd traffic. This may not be necessary in a low volume environment, but it's absolutely necessary in high traffic setups. Beyond a certain number of nodes even in a moderately busy mail network, drbd mirroring just doesn't work. The bandwidth requirements become too high, and nodes bog down from processing all of the drbd packets. Without actually using it myself, and just using some logical reasoning based on the technology, I'd say the ROI of drbd mirroring starts decreasing rapidly between 2 and 4 nodes, and beyond for nodes...
You'd be much better off with an NFS server, or GFS2 directly on a SAN LUN. CXFS would be far better, but it's not free. In fact it's rather expensive, and it requires a dedicated metadata server(s), which is one of the reasons it's so #@! damn fast compared to most clustered filesystems.
Another option is a hybrid setup, with dual NFS servers each running GFS2 accessing the shared SAN LUN(s). This eliminates the one NFS server as a potential single point of failure, but also increases costs significantly as you have to spend about $15K USD minimum for low end SAN array, and another NFS server box, although the latter need not be expensive.
-- Stan
On 13/01/11 21:34, Stan Hoeppner wrote:
Jonathan Tripathy put forth on 1/13/2011 7:11 AM:
Would DRBD + GFS2 work better than NFS? While NFS is simple, I don't mind experimenting with DRBD and GFS2 is it means fewer problems? Depends on your definition of "better". If you do two dovecot+drbd nodes you have only two nodes. If you do NFS you have 3 including the NFS server. Performance would be very similar between the two.
Now, when you move to 3 dovecot nodes or more you're going to run into network scaling problems with the drbd traffic, because it increases logarithmically (or is it exponentially?) with node count. If using GFS2 atop drbd across all nodes, each time a node writes to GFS, the disk block gets encapsulated by the drbd driver and transmitted to all other drbd nodes. With each new mail that's written by each server, or each flag is updated, it gets written 4 times, once locally, and 3 times via drbd.
With NFS, each of these writes occurs over the network only once. With drbd it's always a good idea to dedicate a small high performance GbE switch to the cluster nodes just for drbd traffic. This may not be necessary in a low volume environment, but it's absolutely necessary in high traffic setups. Beyond a certain number of nodes even in a moderately busy mail network, drbd mirroring just doesn't work. The bandwidth requirements become too high, and nodes bog down from processing all of the drbd packets. Without actually using it myself, and just using some logical reasoning based on the technology, I'd say the ROI of drbd mirroring starts decreasing rapidly between 2 and 4 nodes, and beyond for nodes...
You'd be much better off with an NFS server, or GFS2 directly on a SAN LUN. CXFS would be far better, but it's not free. In fact it's rather expensive, and it requires a dedicated metadata server(s), which is one of the reasons it's so #@! damn fast compared to most clustered filesystems.
Another option is a hybrid setup, with dual NFS servers each running GFS2 accessing the shared SAN LUN(s). This eliminates the one NFS server as a potential single point of failure, but also increases costs significantly as you have to spend about $15K USD minimum for low end SAN array, and another NFS server box, although the latter need not be expensive.
Hi Stan,
The problem is, is that we do not have the budget at the minute to buy a SAN box, which is why I'm just looking to setup Linux environment to substitute for now.
Regarding the servers, I was thinking of having a 2 node drbd cluster (in active+standby), which would export a single iSCSI LUN. Then, I would have a 2 node dovecot+postfix cluster (in active-active), where each node would mount the same LUN (With GFS2 on top). This is 4 servers in total (Well, 4 VMs running on 4 physically separate servers).
I'm hearing different things on whether dovecot works well or not with GFS2. Of course, I could simply replace the iSCSI LUN above with an nfs server running on each DRBD node, if you feel NFS would work better than GFS2. Either way, I would probably use a crossover cable for the DRBD cluster. Could maybe even bond 2 cables together if I'm feeling adventurous!
The way I see it, is that there are 2 issues to deal with:
- Which "Shared Disk" technology is best (GFS2 over LUN or a simple NFS server) and
- What is the best method of HA for the storage system
Any advice is appreciated.
Quoting Jonathan Tripathy <jonnyt@abpni.co.uk>:
I'm hearing different things on whether dovecot works well or not with GFS2.
Dovecot works fine with GFS2. The question is performance of Dovecot on GFS2. I do dovecot on GFS2 (with mbox instead of maildir) and it works fine for my user load... Your userload may vary, and using maildir may make your results different than mine.
Of course, I could simply replace the iSCSI LUN above with an nfs
server running on each DRBD node, if you feel NFS would work better
than GFS2.
Either should work. I'd use GFS2 myself, unless you have some compelling reason not to...
Either way, I would probably use a crossover cable for the DRBD cluster.
I use 2 1Gb links bonded together, over crossover cables...
Could maybe even bond 2 cables together if I'm feeling adventurous!
Yes, recommended. That is what I do on all my clusters.
The way I see it, is that there are 2 issues to deal with:
- Which "Shared Disk" technology is best (GFS2 over LUN or a simple
NFS server) and- What is the best method of HA for the storage system
Any advice is appreciated.
Best is relative to workload, budget, expectations, environment, etc. And sometimes, it is just a "religious" thing. So I don't think you will get much of a consensus as to which is "best" since it really depends...
-- Eric Rostetter The Department of Physics The University of Texas at Austin
Go Longhorns!
Either way, I would probably use a crossover cable for the DRBD cluster.
I use 2 1Gb links bonded together, over crossover cables...
Could maybe even bond 2 cables together if I'm feeling adventurous!
Yes, recommended. That is what I do on all my clusters.
How do you bond the connections? Do you just use Linux kernel bonding? Or some driver level stuff?
Quoting Jonathan Tripathy <jonnyt@abpni.co.uk>:
Either way, I would probably use a crossover cable for the DRBD cluster.
I use 2 1Gb links bonded together, over crossover cables...
Could maybe even bond 2 cables together if I'm feeling adventurous!
Yes, recommended. That is what I do on all my clusters. How do you bond the connections? Do you just use Linux kernel
bonding? Or some driver level stuff?
Linux kernel bonding, mode=4 (IEEE 802.3ad Dynamic link aggregation).
-- Eric Rostetter The Department of Physics The University of Texas at Austin
Go Longhorns!
On 14/01/11 03:26, Eric Rostetter wrote:
Quoting Jonathan Tripathy <jonnyt@abpni.co.uk>:
Either way, I would probably use a crossover cable for the DRBD cluster.
I use 2 1Gb links bonded together, over crossover cables...
Could maybe even bond 2 cables together if I'm feeling adventurous!
Yes, recommended. That is what I do on all my clusters. How do you bond the connections? Do you just use Linux kernel bonding? Or some driver level stuff?
Linux kernel bonding, mode=4 (IEEE 802.3ad Dynamic link aggregation).
I'm guessing that since you're using a cross over cable, by just setting up the bond0 interfaces as usual (As per this article http://www.cyberciti.biz/tips/linux-bond-or-team-multiple-network-interfaces...), you didn't need to do anything else, since there is no switch?
Quoting Jonathan Tripathy <jonnyt@abpni.co.uk>:
Linux kernel bonding, mode=4 (IEEE 802.3ad Dynamic link aggregation).
I'm guessing that since you're using a cross over cable, by just
setting up the bond0 interfaces as usual (As per this article
http://www.cyberciti.biz/tips/linux-bond-or-team-multiple-network-interfaces...), you didn't need to do anything else, since there is no
switch?
I use it via cross-over for my DRBD replication, I also use it to a switch for my public interfaces. For the cross-over, just configure on each linux node. For the public interface to the switch, just configure it the same way on each linux box, then also configure the switch for bonding. Not rocket science either way on the linux end. Depending on your switch, it _might_ seem like rocket science on the switch end, if using a switch. ;)
So, in answer to your question, no, I don't need to do anything else with the crossover cable implementation.
-- Eric Rostetter The Department of Physics The University of Texas at Austin
Go Longhorns!
As are you thinking, you will have 2 servers with drbd active/standby you could teset both setups, exporting over NFS or over iscsi + gfs2
Does gfs2 guarantee integridy withou anm fency device ?
Where i work i guess we choose ocfs2 becasue of this litle problem, we could not have an fenc device in xen.
On our teste, ocfs2 shows to be better than NFS. But we did not test as well as we wish, cause is already in production.
[]'sf.rique
On Thu, Jan 13, 2011 at 8:17 PM, Jonathan Tripathy <jonnyt@abpni.co.uk>wrote:
On 13/01/11 21:34, Stan Hoeppner wrote:
Jonathan Tripathy put forth on 1/13/2011 7:11 AM:
Would DRBD + GFS2 work better than NFS? While NFS is simple, I don't mind
experimenting with DRBD and GFS2 is it means fewer problems?
Depends on your definition of "better". If you do two dovecot+drbd nodes you have only two nodes. If you do NFS you have 3 including the NFS server. Performance would be very similar between the two.
Now, when you move to 3 dovecot nodes or more you're going to run into network scaling problems with the drbd traffic, because it increases logarithmically (or is it exponentially?) with node count. If using GFS2 atop drbd across all nodes, each time a node writes to GFS, the disk block gets encapsulated by the drbd driver and transmitted to all other drbd nodes. With each new mail that's written by each server, or each flag is updated, it gets written 4 times, once locally, and 3 times via drbd.
With NFS, each of these writes occurs over the network only once. With drbd it's always a good idea to dedicate a small high performance GbE switch to the cluster nodes just for drbd traffic. This may not be necessary in a low volume environment, but it's absolutely necessary in high traffic setups. Beyond a certain number of nodes even in a moderately busy mail network, drbd mirroring just doesn't work. The bandwidth requirements become too high, and nodes bog down from processing all of the drbd packets. Without actually using it myself, and just using some logical reasoning based on the technology, I'd say the ROI of drbd mirroring starts decreasing rapidly between 2 and 4 nodes, and beyond for nodes...
You'd be much better off with an NFS server, or GFS2 directly on a SAN LUN. CXFS would be far better, but it's not free. In fact it's rather expensive, and it requires a dedicated metadata server(s), which is one of the reasons it's so #@! damn fast compared to most clustered filesystems.
Another option is a hybrid setup, with dual NFS servers each running GFS2 accessing the shared SAN LUN(s). This eliminates the one NFS server as a potential single point of failure, but also increases costs significantly as you have to spend about $15K USD minimum for low end SAN array, and another NFS server box, although the latter need not be expensive.
Hi Stan,
The problem is, is that we do not have the budget at the minute to buy a SAN box, which is why I'm just looking to setup Linux environment to substitute for now.
Regarding the servers, I was thinking of having a 2 node drbd cluster (in active+standby), which would export a single iSCSI LUN. Then, I would have a 2 node dovecot+postfix cluster (in active-active), where each node would mount the same LUN (With GFS2 on top). This is 4 servers in total (Well, 4 VMs running on 4 physically separate servers).
I'm hearing different things on whether dovecot works well or not with GFS2. Of course, I could simply replace the iSCSI LUN above with an nfs server running on each DRBD node, if you feel NFS would work better than GFS2. Either way, I would probably use a crossover cable for the DRBD cluster. Could maybe even bond 2 cables together if I'm feeling adventurous!
The way I see it, is that there are 2 issues to deal with:
- Which "Shared Disk" technology is best (GFS2 over LUN or a simple NFS server) and
- What is the best method of HA for the storage system
Any advice is appreciated.
for drbd you only need a heartbeat i guess.
But to use gfs2 you need fence device, ocfs2 does not require once, like the ocfs2 driver takes care, it reboots if it thinks it is desyncronized
[]'sf.rique
On Thu, Jan 13, 2011 at 9:04 PM, Jonathan Tripathy <jonnyt@abpni.co.uk>wrote:
Does gfs2 guarantee integridy withou anm fency device ?
You make a fair point. Would I need any hardware fencing for DRBD (and GFS2)?
Quoting Henrique Fernandes <sf.rique@gmail.com>:
for drbd you only need a heartbeat i guess.
Fencing is not needed for drbd, though recommended.
But to use gfs2 you need fence device, ocfs2 does not require once, like the ocfs2 driver takes care, it reboots if it thinks it is desyncronized
gfs2 technically requires fencing, since it technically requires a cluster, and red hat clustering requires fencing. Some people "get around this" by using "manual" fencing, though this is "not recommended for production" as it could result in a machine staying down until manual intervention, which usually conflicts with the "uptime" desire for a cluster... But that is up to the implementor to decide on...
[]'sf.rique
-- Eric Rostetter The Department of Physics The University of Texas at Austin
Go Longhorns!
On 14/01/11 03:39, Eric Rostetter wrote:
Quoting Henrique Fernandes <sf.rique@gmail.com>:
for drbd you only need a heartbeat i guess.
Fencing is not needed for drbd, though recommended.
But to use gfs2 you need fence device, ocfs2 does not require once, like the ocfs2 driver takes care, it reboots if it thinks it is desyncronized
gfs2 technically requires fencing, since it technically requires a cluster, and red hat clustering requires fencing. Some people "get around this" by using "manual" fencing, though this is "not recommended for production" as it could result in a machine staying down until manual intervention, which usually conflicts with the "uptime" desire for a cluster... But that is up to the implementor to decide on...
[]'sf.rique
I've actually been reading on ocfs2 and it looks quite promising. According to this presentation:
http://www.gpaterno.com/publications/2010/dublin_ossbarcamp_2010_fs_comparis...
ocfs2 seems to work quite well with lots of small files (typical of maildir). I'm guessing that since ocfs2 reboot a system automatically, it doesn't require any additional fencing?
I was thinking of following this article:
http://wiki.virtastic.com/display/howto/Clustered+Filesystem+with+DRBD+and+O...
with the only difference being that I'm going to export the drbd device via iSCSI to my active-active mail servers.
On Fri, 2011-01-14 at 03:48 +0000, Jonathan Tripathy wrote:
ocfs2 seems to work quite well with lots of small files (typical of maildir). I'm guessing that since ocfs2 reboot a system automatically, it doesn't require any additional fencing?
We have a two-node active-active DRBD+OCFS2 Dovecot cluster. We're currently unable to fully use it due to (what I believe is) an OCFS2 bug:
http://oss.oracle.com/bugzilla/show_bug.cgi?id=1297
so while DRBD is in a dual-primary setup and the dovecot volumes are mounted read/write on both cluster nodes, I had to remove one of them from the load balancer, and thus only one of them handles connections while the other is sitting there as a failover node.
HTH, Andre
Am 13.01.2011 23:17, schrieb Jonathan Tripathy:
On 13/01/11 21:34, Stan Hoeppner wrote:
Jonathan Tripathy put forth on 1/13/2011 7:11 AM:
Would DRBD + GFS2 work better than NFS? While NFS is simple, I don't mind experimenting with DRBD and GFS2 is it means fewer problems? Depends on your definition of "better". If you do two dovecot+drbd nodes you have only two nodes. If you do NFS you have 3 including the NFS server. Performance would be very similar between the two.
Now, when you move to 3 dovecot nodes or more you're going to run into network scaling problems with the drbd traffic, because it increases logarithmically (or is it exponentially?) with node count. If using GFS2 atop drbd across all nodes, each time a node writes to GFS, the disk block gets encapsulated by the drbd driver and transmitted to all other drbd nodes. With each new mail that's written by each server, or each flag is updated, it gets written 4 times, once locally, and 3 times via drbd.
With NFS, each of these writes occurs over the network only once. With drbd it's always a good idea to dedicate a small high performance GbE switch to the cluster nodes just for drbd traffic. This may not be necessary in a low volume environment, but it's absolutely necessary in high traffic setups. Beyond a certain number of nodes even in a moderately busy mail network, drbd mirroring just doesn't work. The bandwidth requirements become too high, and nodes bog down from processing all of the drbd packets. Without actually using it myself, and just using some logical reasoning based on the technology, I'd say the ROI of drbd mirroring starts decreasing rapidly between 2 and 4 nodes, and beyond for nodes...
You'd be much better off with an NFS server, or GFS2 directly on a SAN LUN. CXFS would be far better, but it's not free. In fact it's rather expensive, and it requires a dedicated metadata server(s), which is one of the reasons it's so #@! damn fast compared to most clustered filesystems.
Another option is a hybrid setup, with dual NFS servers each running GFS2 accessing the shared SAN LUN(s). This eliminates the one NFS server as a potential single point of failure, but also increases costs significantly as you have to spend about $15K USD minimum for low end SAN array, and another NFS server box, although the latter need not be expensive.
Hi Stan,
The problem is, is that we do not have the budget at the minute to buy a SAN box, which is why I'm just looking to setup Linux environment to substitute for now.
Regarding the servers, I was thinking of having a 2 node drbd cluster (in active+standby), which would export a single iSCSI LUN. Then, I would have a 2 node dovecot+postfix cluster (in active-active), where each node would mount the same LUN (With GFS2 on top). This is 4 servers in total (Well, 4 VMs running on 4 physically separate servers).
I'm hearing different things on whether dovecot works well or not with GFS2. Of course, I could simply replace the iSCSI LUN above with an nfs server running on each DRBD node, if you feel NFS would work better than GFS2. Either way, I would probably use a crossover cable for the DRBD cluster. Could maybe even bond 2 cables together if I'm feeling adventurous!
The way I see it, is that there are 2 issues to deal with:
- Which "Shared Disk" technology is best (GFS2 over LUN or a simple NFS server) and
- What is the best method of HA for the storage system
Any advice is appreciated.
Hi Jonathan the explains from Stan were good enough to choose what fit to your needs ( thx Stan for explain drbd so deeply ) so what are the number of mailboxes and the traffic volume you wait for ? ) at minimum you should have 2 drbd nodes binding to a seperate gb interface via cross cable ( this might not be needed with virtual machines, but check before you setup, and dont forget for real ha you always need 2 vm master machines, so for your setup this may increase the budget), and 2 loadblancers with i.e keepalive, if this isnt enough for you, you should follow Stans advice using SAN or equal, after all this is the real world , budget must always be as high to solve your task, you cant press an elephant trough a mouse hole.... so there is no best solution, there is only a solution what fits best to your needs
-- Best Regards
MfG Robert Schetterer
Germany/Munich/Bavaria
Jonathan Tripathy put forth on 1/13/2011 4:17 PM:
Regarding the servers, I was thinking of having a 2 node drbd cluster (in active+standby), which would export a single iSCSI LUN. Then, I would have a 2 node dovecot+postfix cluster (in active-active), where each node would mount the same LUN (With GFS2 on top). This is 4 servers in total (Well, 4 VMs running on 4 physically separate servers).
Something you need to consider very carefully:
drbd is a kernel block storage driver. You run in ON a PHYSICAL cluster node, and never inside a virtual machine guest. drbd is RAID 1 over a network instead of a SCSI cable. Is is meant to protect against storage and node failures. This is how you need to look at drbd. Again, DO NOT run DRBD inside of a VM guest. If you have a decent background in hardware and operating systems, it won't take you 30 seconds to understand what I'm saying here. If it takes you longer, then consider this case:
You have a consolidated Xen cluster of two 24 core AMD Magny Cours servers each with 128GB RAM, an LSI MegaRAID SAS controller with dual SFF8087 ports backed by 32 SAS drives in external jbod enclosures setup as a single hardware RAID 10. You spread your entire load of 97 virtual machine guests across this two node farm. Within this set of 97 guests, 12 of them are clustered network applications, and two of these 12 are your Dovecot/Postfix guests.
If you use drbd in the way you currently have in your head, you are mirroring virtual disk partitions with drbd _SIX times_ instead of once. Here, where you'd want to run drbd is within the Xen hypervisor kernel. drbd works at the BLOCK DEVICE level, not the application layer.
Eric already mentioned this once. Apparently you weren't paying attention.
-- Stan
Stan Hoeppner put forth on 1/14/2011 1:00 PM:
You have a consolidated Xen cluster of two 24 core AMD Magny Cours servers each with 128GB RAM, an LSI MegaRAID SAS controller with dual SFF8087 ports backed by 32 SAS drives in external jbod enclosures setup as a single hardware RAID 10. You spread your entire load of 97 virtual machine guests across this two node farm. Within this set of 97 guests, 12 of them are clustered network applications, and two of these 12 are your Dovecot/Postfix guests.
I forgot the drbd interfaces in my example. This setup would include two PCIe X8 dual port 10 GbE RJ45 copper adapters connected via x-over cables and link aggregated, yielding 2 GB/s of full duplex b/w.
Also, as the filesystem runs at the guest level, you'd still need gfs2 running on each cluster guest OS to handle file level locking. The underlying disk device would be a Xen virtual disk, which sites atop the drbd driver.
Although, to be quite honest, this isn't the best example, as with that much server and disk hardware involved, the ROI of FC SAN storage would have already kicked in and you'd be using gfs2 directly on SAN LUNs instead of drbd.
The technical point is properly illustrated nonetheless.
-- Stan
Quoting Patrick Westenberg <pw@wk-serv.de>:
just to get it right: DRBD for shared storage replication is OK?
Yes, but only if done correctly. ;) There is some concern on Stan's part (and mime) that you might do it wrong (e.g., in a vm guest rather than at the vm host, etc).
-- Eric Rostetter The Department of Physics The University of Texas at Austin
Go Longhorns!
Eric Rostetter schrieb:
Quoting Patrick Westenberg <pw@wk-serv.de>:
just to get it right: DRBD for shared storage replication is OK?
Yes, but only if done correctly. ;) There is some concern on Stan's part (and mime) that you might do it wrong (e.g., in a vm guest rather than at the vm host, etc).
My storage _hosts_ will be dedicated systems of course :)
Quoting Patrick Westenberg <pw@wk-serv.de>:
Eric Rostetter schrieb:
Quoting Patrick Westenberg <pw@wk-serv.de>:
just to get it right: DRBD for shared storage replication is OK?
Yes, but only if done correctly. ;) There is some concern on Stan's part (and mime) that you might do it wrong (e.g., in a vm guest rather than at the vm host, etc).
My storage _hosts_ will be dedicated systems of course :)
No problem then... I run dovecot off drbd+gfs2 now without problems (no virtual machines involved though, just physical machines).
-- Eric Rostetter The Department of Physics The University of Texas at Austin
Go Longhorns!
On 14/01/11 20:07, Eric Rostetter wrote:
Quoting Patrick Westenberg <pw@wk-serv.de>:
just to get it right: DRBD for shared storage replication is OK?
Yes, but only if done correctly. ;) There is some concern on Stan's part (and mime) that you might do it wrong (e.g., in a vm guest rather than at the vm host, etc).
What is actually wrong with doing in it VM guests? I appreciate that there will be a slight performance hit, but not too much as Xen PV guests have excellent disk and network performance.
On 14/01/11 19:00, Stan Hoeppner wrote:
Jonathan Tripathy put forth on 1/13/2011 4:17 PM:
Regarding the servers, I was thinking of having a 2 node drbd cluster (in active+standby), which would export a single iSCSI LUN. Then, I would have a 2 node dovecot+postfix cluster (in active-active), where each node would mount the same LUN (With GFS2 on top). This is 4 servers in total (Well, 4 VMs running on 4 physically separate servers). Something you need to consider very carefully:
drbd is a kernel block storage driver. You run in ON a PHYSICAL cluster node, and never inside a virtual machine guest. drbd is RAID 1 over a network instead of a SCSI cable. Is is meant to protect against storage and node failures. This is how you need to look at drbd. Again, DO NOT run DRBD inside of a VM guest. If you have a decent background in hardware and operating systems, it won't take you 30 seconds to understand what I'm saying here. If it takes you longer, then consider this case:
You have a consolidated Xen cluster of two 24 core AMD Magny Cours servers each with 128GB RAM, an LSI MegaRAID SAS controller with dual SFF8087 ports backed by 32 SAS drives in external jbod enclosures setup as a single hardware RAID 10. You spread your entire load of 97 virtual machine guests across this two node farm. Within this set of 97 guests, 12 of them are clustered network applications, and two of these 12 are your Dovecot/Postfix guests.
If you use drbd in the way you currently have in your head, you are mirroring virtual disk partitions with drbd _SIX times_ instead of once. Here, where you'd want to run drbd is within the Xen hypervisor kernel. drbd works at the BLOCK DEVICE level, not the application layer.
Eric already mentioned this once. Apparently you weren't paying attention.
I'm sorry I don't follow this. It would be appreciated if you could include a simpler example. The way I see it, a VM disk is just a small chunck "LVM LV in my case" of a real disk.
Jonathan Tripathy put forth on 1/14/2011 4:58 PM:
I'm sorry I don't follow this. It would be appreciated if you could include a simpler example. The way I see it, a VM disk is just a small chunck "LVM LV in my case" of a real disk.
We can't teach you everything on a mailing list. You need to actually go out and do some research. What you actually need is for someone to forcibly rip you from that chair, drag you outside and sit you down on the patio/side walk with some children's blocks, and visually demonstrate to you how this works. This is meant to be humorous not mean. :)
You are stuck in an improper mindset that is physically impossible to fix with typed words (at least for me). You need to _see_ conceptualization of what Eric and I are talking about. I can't provide that with text. It must be visual. Which means you're going to have to do your own research to find those conceptual answers. A really good place for that might be the drbd home page. :) There are some very good pictures there that explain the concepts very well.
DRBD is for mirroring physical devices over a network. You might be able to do DRBD inside a VM guest, but to what end? What sense does it make to do so? It doesn't. Instead of asking we who know to tell you "why not", you need to tell us "why yes" WRT DRBD inside the VM guests.
-- Stan
Quoting Stan Hoeppner <stan@hardwarefreak.com>:
DRBD is for mirroring physical devices over a network. You might be
able to do DRBD inside a VM guest, but to what end? What sense does it make to do so?
It doesn't really make sense, and it can cause problems... What problems depends on your VM implementation (Xen, KVM, VMWare, VirtualBox, etc).
Some things might be mount table propogation between hosts/guests causing problem unmounting drbd paritions or even shutting down the VM, running lots of drbd instances instead of only one (as Stan mentioned) which can mean more processes and more (buffer) memory being used than is needed and more configuration files needed, performance issues of all kinds, and so on.
Think about it: you are increasing the number of processes in the VM guests, you are increasing the amount of memory used in the VM guests, you are increasing traffic to the virtual switch/bridge, you are potentially increasing the complexity of your configuration, possibly taking a big performance hit (depending on VM type and config), limiting your flexibiilty, increases difficulty of debugging and performance tuning, and so on. Is it worth it?
Plus, if you run DRBD in the VM, then the VM must run DRBD. If you run DRBD in the physical host, you can then export it to any VM, even a VM that doesn't support DRBD. Things like this can impact VM flexibility (migrations, OS support, backups, etc).
Can you do it? Yes. Can you get away with it? Probably. Should you do it? No. Would I do it? Never...
-- Eric Rostetter The Department of Physics The University of Texas at Austin
Go Longhorns!
On 01/14/2011 03:58 PM, Jonathan Tripathy wrote:
On 14/01/11 19:00, Stan Hoeppner wrote:
Jonathan Tripathy put forth on 1/13/2011 4:17 PM:
Regarding the servers, I was thinking of having a 2 node drbd cluster (in active+standby), which would export a single iSCSI LUN. Then, I would have a 2 node dovecot+postfix cluster (in active-active), where each node would mount the same LUN (With GFS2 on top). This is 4 servers in total (Well, 4 VMs running on 4 physically separate servers). Something you need to consider very carefully:
drbd is a kernel block storage driver. You run in ON a PHYSICAL cluster node, and never inside a virtual machine guest. drbd is RAID 1 over a network instead of a SCSI cable. Is is meant to protect against storage and node failures. This is how you need to look at drbd. Again, DO NOT run DRBD inside of a VM guest. If you have a decent background in hardware and operating systems, it won't take you 30 seconds to understand what I'm saying here. If it takes you longer, then consider this case:
You have a consolidated Xen cluster of two 24 core AMD Magny Cours servers each with 128GB RAM, an LSI MegaRAID SAS controller with dual SFF8087 ports backed by 32 SAS drives in external jbod enclosures setup as a single hardware RAID 10. You spread your entire load of 97 virtual machine guests across this two node farm. Within this set of 97 guests, 12 of them are clustered network applications, and two of these 12 are your Dovecot/Postfix guests.
If you use drbd in the way you currently have in your head, you are mirroring virtual disk partitions with drbd _SIX times_ instead of once. Here, where you'd want to run drbd is within the Xen hypervisor kernel. drbd works at the BLOCK DEVICE level, not the application layer.
Eric already mentioned this once. Apparently you weren't paying attention.
I'm sorry I don't follow this. It would be appreciated if you could include a simpler example. The way I see it, a VM disk is just a small chunck "LVM LV in my case" of a real disk.
Perhaps if you were to compare and contrast a virtual disk to a raw disk, that would help. If you wanted to use drbd with a raw disk being accessed via a VM guest, that would probably be all right. Might not be "supported" though.
-- -Eric 'shubes'
On 01/14/2011 03:58 PM, Jonathan Tripathy wrote:
On 14/01/11 19:00, Stan Hoeppner wrote:
Jonathan Tripathy put forth on 1/13/2011 4:17 PM:
Regarding the servers, I was thinking of having a 2 node drbd cluster (in active+standby), which would export a single iSCSI LUN. Then, I would have a 2 node dovecot+postfix cluster (in active-active), where each node would mount the same LUN (With GFS2 on top). This is 4 servers in total (Well, 4 VMs running on 4 physically separate servers). Something you need to consider very carefully:
drbd is a kernel block storage driver. You run in ON a PHYSICAL cluster node, and never inside a virtual machine guest. drbd is RAID 1 over a network instead of a SCSI cable. Is is meant to protect against storage and node failures. This is how you need to look at drbd. Again, DO NOT run DRBD inside of a VM guest. If you have a decent background in hardware and operating systems, it won't take you 30 seconds to understand what I'm saying here. If it takes you longer, then consider this case:
You have a consolidated Xen cluster of two 24 core AMD Magny Cours servers each with 128GB RAM, an LSI MegaRAID SAS controller with dual SFF8087 ports backed by 32 SAS drives in external jbod enclosures setup as a single hardware RAID 10. You spread your entire load of 97 virtual machine guests across this two node farm. Within this set of 97 guests, 12 of them are clustered network applications, and two of these 12 are your Dovecot/Postfix guests.
If you use drbd in the way you currently have in your head, you are mirroring virtual disk partitions with drbd _SIX times_ instead of once. Here, where you'd want to run drbd is within the Xen hypervisor kernel. drbd works at the BLOCK DEVICE level, not the application layer.
Eric already mentioned this once. Apparently you weren't paying attention.
I'm sorry I don't follow this. It would be appreciated if you could include a simpler example. The way I see it, a VM disk is just a small chunck "LVM LV in my case" of a real disk.
Perhaps if you were to compare and contrast a virtual disk to a raw disk, that would help. If you wanted to use drbd with a raw disk being accessed via a VM guest, that would probably be all right. Might not be "supported" though. Thanks Eric, Now I understand where you are coming from: It's not the fact that DRBD is running in a VM is the problem, is the fact that DRBD should be replicating a raw physical disk, which of course is still
On 15/01/11 00:59, Eric Shubert wrote: possible from with a Xen VM
Also thanks to Stan and everyone else for the helpful comments.
I still haven’t decided between GFS2 or OCFS2 yet. I guess I'll have to try both and see what works the best.
I really wish NFS didn't have the caching issue, as it's the most simple to set up
Jonathan,
-----Original Message-----
I really wish NFS didn't have the caching issue, as it's the most simple to set up
Don't give up on the simplest solution too easily - lots of us run NFS with quite large installs. As a matter of fact, I think all of the large installs run NFS; hence the need for the Director in 2.0.
-Brad
On Fri, Jan 14, 2011 at 05:16:50PM -0800, Brad Davidson wrote:
Don't give up on the simplest solution too easily - lots of us run NFS with quite large installs. As a matter of fact, I think all of the large installs run NFS; hence the need for the Director in 2.0.
Not all, if this counts as large:
Filesystem Size Used Avail Use% Mounted on
/dev/gpfsmail 9.9T 8.7T 1.2T 88% /maildirs
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/gpfsmail 105279488 90286634 14992854 86% /maildirs
Running GPFS, with 7 nodes all active against the same filesystem. But the Director should still be useful..
-jf
On Mon, Jan 17, 2011 at 7:32 AM, Jan-Frode Myklebust <janfrode@tanso.net> wrote:
On Fri, Jan 14, 2011 at 05:16:50PM -0800, Brad Davidson wrote:
Don't give up on the simplest solution too easily - lots of us run NFS with quite large installs. As a matter of fact, I think all of the large installs run NFS; hence the need for the Director in 2.0.
Not all, if this counts as large:
Filesystem Size Used Avail Use% Mounted on /dev/gpfsmail 9.9T 8.7T 1.2T 88% /maildirs
Filesystem Inodes IUsed IFree IUse% Mounted on /dev/gpfsmail 105279488 90286634 14992854 86% /maildirs
how do you backup that data? :)
-ah
[]'sf.rique
On Thu, Jan 20, 2011 at 12:10 PM, alex handle <alex.handle@gmail.com> wrote:
On Mon, Jan 17, 2011 at 7:32 AM, Jan-Frode Myklebust <janfrode@tanso.net> wrote:
On Fri, Jan 14, 2011 at 05:16:50PM -0800, Brad Davidson wrote:
Don't give up on the simplest solution too easily - lots of us run NFS with quite large installs. As a matter of fact, I think all of the large installs run NFS; hence the need for the Director in 2.0.
Not all, if this counts as large:
Filesystem Size Used Avail Use% Mounted on /dev/gpfsmail 9.9T 8.7T 1.2T 88% /maildirs Filesystem Inodes IUsed IFree IUse% Mounted on /dev/gpfsmail 105279488 90286634 14992854 86% /maildirs
how do you backup that data? :)
Same question!
I have about 1TB used and it takes 22 hrs to backup maildirs!
I have problens with ocfs2 in fouding the file!
-ah
On Thu, Jan 20, 2011 at 5:20 PM, Henrique Fernandes <sf.rique@gmail.com> wrote:
Not all, if this counts as large:
Filesystem Size Used Avail Use% Mounted on /dev/gpfsmail 9.9T 8.7T 1.2T 88% /maildirs
Filesystem Inodes IUsed IFree IUse% Mounted on /dev/gpfsmail 105279488 90286634 14992854 86% /maildirs
how do you backup that data? :)
Same question!
I have about 1TB used and it takes 22 hrs to backup maildirs!
Our maildirs are spread in subfolders under /maildirs/[a-z0-9], where mail addresses starting with a is stored under /maildirs/a/, b in /maildirs/b, etc.. and then we have distributed these top-level directories about evenly for backup by each host. So the 7 servers all run backups of different parts of the filesystem. The backups go to Tivoli Storage Manager, with it´s default incremental forever policy, so there´s not much data to back up. The problem is that it´s very slow to traverse all the directories and compare against what was already backed up. I believe we´re also using around 20-24 hours for the daily incremental backups... so we soon will have to start looking at alternative ways of doing it (or get rid of the non-dovecot accesses to maildirs, which are probably stealing quite a bit performance from the file scans).
One alternative is the "mmbackup"-utility, which is supposed to use a much faster inode scan interface in GPFS:
http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=%2Fco...
but last time we tested it it was a too fragile...
-jf
Stan!
Sorry i did not explained well!
FULL
Spool to disk: ~24h TransferRate: 6MB/s
Despool to tape: ~7h TransferRate: 16MB/s
INCREMENTAL
Spool to disk: ~11h TransferRate: 300KB/s
Despool to tape: ~12m TransferRate: 16MB/s
When doind a backup, we turn on another machine in the ocfs2 cluster and from there make an spool in the disk and after it it goes from the disk to the tape.
Nothing is in SAN everthing in dlinks swith at 1Gbit.
Even the storage system is not SUN those ocfs2 servers are connect via iSCSI from the storage with ocfs2 in virtual machine
Sorry, my english such and make it harder to explain!
[]'sf.rique
On Thu, Jan 20, 2011 at 3:17 PM, Jan-Frode Myklebust <janfrode@tanso.net>wrote:
On Thu, Jan 20, 2011 at 5:20 PM, Henrique Fernandes <sf.rique@gmail.com> wrote:
Not all, if this counts as large:
Filesystem Size Used Avail Use% Mounted on /dev/gpfsmail 9.9T 8.7T 1.2T 88% /maildirs Filesystem Inodes IUsed IFree IUse% Mounted on /dev/gpfsmail 105279488 90286634 14992854 86% /maildirs
how do you backup that data? :)
Same question!
I have about 1TB used and it takes 22 hrs to backup maildirs!
Our maildirs are spread in subfolders under /maildirs/[a-z0-9], where mail addresses starting with a is stored under /maildirs/a/, b in /maildirs/b, etc.. and then we have distributed these top-level directories about evenly for backup by each host. So the 7 servers all run backups of different parts of the filesystem. The backups go to Tivoli Storage Manager, with it´s default incremental forever policy, so there´s not much data to back up. The problem is that it´s very slow to traverse all the directories and compare against what was already backed up. I believe we´re also using around 20-24 hours for the daily incremental backups... so we soon will have to start looking at alternative ways of doing it (or get rid of the non-dovecot accesses to maildirs, which are probably stealing quite a bit performance from the file scans).
One alternative is the "mmbackup"-utility, which is supposed to use a much faster inode scan interface in GPFS:
http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=%2Fco...
but last time we tested it it was a too fragile...
-jf
Henrique Fernandes put forth on 1/20/2011 11:55 AM:
Even the storage system is not SUN those ocfs2 servers are connect via iSCSI from the storage with ocfs2 in virtual machine
Please provide a web link to the iSCSI storage array product you are using, and tell us how many 1GbE ports you are link aggregating to the switch.
Also, are you using Oracle OCFS2 or IBM GPFS? You mentioned both. Considering you are experiencing severe performance issues with metadata operations due to the distributed lock manager...
Have you considered SGI CXFS? It's the fastest cluster FS on the planet by an order of magnitude. It uses dedicated metadata servers instead of a DLM, which is why it's so fast. Directory traversal operations would be orders of magnitude faster than what you have now.
http://en.wikipedia.org/wiki/CXFS http://www.sgi.com/products/storage/software/cxfs.html
-- Stan
[]'sf.rique
On Fri, Jan 21, 2011 at 2:14 AM, Stan Hoeppner <stan@hardwarefreak.com>wrote:
Henrique Fernandes put forth on 1/20/2011 11:55 AM:
Even the storage system is not SUN those ocfs2 servers are connect via iSCSI from the storage with ocfs2 in virtual machine
Storage are an EMC CX4 ( don't have all info about it right now )
Please provide a web link to the iSCSI storage array product you are using, and tell us how many 1GbE ports you are link aggregating to the switch.
Just one, the EMC have one interface and we are exporting iSCSI over it. Right now is what we can do, we are waiting to be able to buy an SUN to make everything works.
Also, are you using Oracle OCFS2 or IBM GPFS? You mentioned both. Considering you are experiencing severe performance issues with metadata operations due to the distributed lock manager...
OCFS2 1.4 Because it is free. Have no money for anything else, we have
considered swith it to NFS.
Have you considered SGI CXFS? It's the fastest cluster FS on the planet by an order of magnitude. It uses dedicated metadata servers instead of a DLM, which is why it's so fast. Directory traversal operations would be orders of magnitude faster than what you have now.
http://en.wikipedia.org/wiki/CXSGI CXFSFS<http://en.wikipedia.org/wiki/CXFS> http://www.sgi.com/products/storage/software/cxfs.html
We don't considerd buying an clustered filesystem.
We are out of ideias to make it faster. We only came up making more ocfs2 cluster with smaller disks. With this we are gettng better performance. We have now 2 cluster one with 4 TB other with 1 TB and are migrating some os emails form 4TB to 1TB and already have ready another cluster with 1 TB. So we have 3 machines and those 3 mount 3 disks each from the storage and mount 3 ocfs2 cluster. So we think the each DLM gets less work. Are we right?
Thanks!
-- Stan
Henrique Fernandes put forth on 1/21/2011 1:38 AM:
We are out of ideias to make it faster. We only came up making more ocfs2 cluster with smaller disks. With this we are gettng better performance. We have now 2 cluster one with 4 TB other with 1 TB and are migrating some os emails form 4TB to 1TB and already have ready another cluster with 1 TB. So we have 3 machines and those 3 mount 3 disks each from the storage and mount 3 ocfs2 cluster. So we think the each DLM gets less work. Are we right?
That's impossible to say without me having an understanding of how this is actually setup. From your description I'm unable to understand what you have.
-- Stan
[]'sf.rique
On Fri, Jan 21, 2011 at 5:59 AM, Stan Hoeppner <stan@hardwarefreak.com>wrote:
Henrique Fernandes put forth on 1/21/2011 1:38 AM:
We are out of ideias to make it faster. We only came up making more ocfs2 cluster with smaller disks. With this we are gettng better performance. We have now 2 cluster one with 4 TB other with 1 TB and are migrating some os emails form 4TB to 1TB and already have ready another cluster with 1 TB. So we have 3 machines and those 3 mount 3 disks each from the storage and mount 3 ocfs2 cluster. So we think the each DLM gets less work. Are we right?
That's impossible to say without me having an understanding of how this is actually setup. From your description I'm unable to understand what you have.
Let me try explain better.
We have 3 virtual machines with this set up:
/dev/sda1 3.6T 2.4T 1.3T 66% /A /dev/sdb1 1.0T 36G 989G 4% /B /dev/sdc1 1.0T 3.3G 1021G 1% /C
/dev/sda1 on /A type ocfs2 (rw,_netdev,heartbeat=local) /dev/sdb1 on /B type ocfs2 (rw,_netdev,heartbeat=local) /dev/sdc1 on /C type ocfs2 (rw,_netdev,heartbeat=local)
My question is, what is faster ? Configuring just one big disk with ocfs2 ( sda1) or using more and smaller disks sdb1 and sdc1 and more ?
It is ok now ?
All our emails are in sda1 and we are having many many performance problens. So we are migrating some of email to sdb1 and eventualy to sdc1. Right now, seens to be much better performance in sdb1 than in sda1. But we are not sure if it is because have so much less emails and concurrency or because is acctualy better.
-- Stan
Henrique Fernandes put forth on 1/21/2011 9:50 AM:
Let me try explain better.
We have 3 virtual machines with this set up:
/dev/sda1 3.6T 2.4T 1.3T 66% /A /dev/sdb1 1.0T 36G 989G 4% /B /dev/sdc1 1.0T 3.3G 1021G 1% /C
/dev/sda1 on /A type ocfs2 (rw,_netdev,heartbeat=local) /dev/sdb1 on /B type ocfs2 (rw,_netdev,heartbeat=local) /dev/sdc1 on /C type ocfs2 (rw,_netdev,heartbeat=local)
My question is, what is faster ? Configuring just one big disk with ocfs2 ( sda1) or using more and smaller disks sdb1 and sdc1 and more ?
It is ok now ?
All our emails are in sda1 and we are having many many performance problens. So we are migrating some of email to sdb1 and eventualy to sdc1. Right now, seens to be much better performance in sdb1 than in sda1. But we are not sure if it is because have so much less emails and concurrency or because is acctualy better.
None of this means much in absence of an accurate ESX host hardware and iSCSI network layout description. You haven't stated how /dev/sd[abc]1 are physically connected to the ESX hosts. You haven't given a _physical hardware description_ of /dev/sd[abc]1 or the connections to the EMC CX4.
For instance, if /dev/sda1 in a 10 disk RAID5 group in the CX4, but /dev/sdb1 is a 24 disk RAID10 group in the CX4, *AND*
/dev/sda1 is LUN mapped out of an iSCSI port on the CX4 along with many many other LUNS which are under constant heavy use, *AND* /dev/sdb1 is LUN mapped out of an iSCSI port that shares no other LUNs, *then*
I would say the reason /dev/sdb1 is much faster is due to:
A. 24 drive RAID10 vs 10 drive RAID6 will yield ~10x increase in random IOPS B. Zero congestion on the /dev/sdb1 iSCSI port will decrease latency
We need to know the physical characteristics of the hardware. SAN performance issues are not going to be related (most of the time) to how you have Dovecot setup.
Do you have any iostat data to share with us? Any data/graphs from the EMC controller showing utilization per port and per array?
If you're unable to gather such performance metric data it will be difficult to assist you.
-- Stan
On Thu, Jan 20, 2011 at 10:14:42PM -0600, Stan Hoeppner wrote:
Have you considered SGI CXFS? It's the fastest cluster FS on the planet by an order of magnitude. It uses dedicated metadata servers instead of a DLM, which is why it's so fast. Directory traversal operations would be orders of magnitude faster than what you have now.
That sounds quite impressive. Order of magnitude improvements would be very welcome. Do you have any data to back up that statement ? Are you talking streaming performance, IOPS or both ?
I've read that CXFS has bad metadata performance, and that the metadata-server can become a bottleneck.. Is the metadata-server function only possible to run on one node (with passive standby node for availability) ?
Do you know anything about the pricing of CXFS? I'm quite satisfied with GPFS, but know I might be a bit biased since I work for IBM :-) If CXFS really is that good for maildir-type storage, I probably should have another look..
-jf
Jan-Frode Myklebust put forth on 1/21/2011 5:49 AM:
On Thu, Jan 20, 2011 at 10:14:42PM -0600, Stan Hoeppner wrote:
Have you considered SGI CXFS? It's the fastest cluster FS on the planet by an order of magnitude. It uses dedicated metadata servers instead of a DLM, which is why it's so fast. Directory traversal operations would be orders of magnitude faster than what you have now.
That sounds quite impressive. Order of magnitude improvements would be very welcome. Do you have any data to back up that statement ? Are you talking streaming performance, IOPS or both ?
Both.
I've read that CXFS has bad metadata performance, and that the metadata-server can become a bottleneck.. Is the metadata-server function only possible to run on one node (with passive standby node for availability) ?
Where did you read this? I'd like to take a look. The reason CXFS is faster than other cluster filesystems is _because of_ the metadata broker. It is much faster than distributed lock manager schemes at high loads, and equally fast at low loads. There is one active metadata broker server _per filesystem_ with as many standby backup servers per filesystem as you want. So for a filesystem seeing heavy IOPS you'd want a dedicated metadata broker. For filesystems storing large amounts of data but with low metadata IOPS you would use one broker server for multiple such filesystems.
Using GbE for the metadata network yields excellent performance. Using Infiniband is even better, especially with large CXFS client node counts under high loads, due to the dramatically lower packet latency through the switches, and a typical 20 or 40 Gbit signaling rate for 4x DDR/QDR. Using Infiniband for the metadata network actually helps DLM cluster filesystems more than those with metadata brokers.
Do you know anything about the pricing of CXFS? I'm quite satisfied with GPFS, but know I might be a bit biased since I work for IBM :-) If CXFS really is that good for maildir-type storage, I probably should have another look..
Given the financial situation SGI has found itself in the last few years, I have no idea how they're pricing CXFS or the SAN arrays. One acquisition downside to CXFS is that you have to deploy the CXFS metadata brokers on SGI hardware only, and their servers are more expensive that most nearly identical competing products.
Typically, they only sell CXFS as an add on to their fiber channel SAN products. So it's not an inexpensive solution. It's extremely high performance, but you pay for it. Honestly, for most organizations doing mail clusters, unless you have a _huge_ user base and lots of budget, you might not afford an SGI solution for mail cluster data storage. It never hurts to ask though, and sales people's time is free to potential customers. If your current cluster filesystem+SAN isn't cutting it, it can't hurt to ask an SGI salesperson.
At minimum you're probably looking at the cost of an Altix UV10 for the metadata broker server, an SGI InfiniteStorage 4100 Array, and the CXFS licenses for each cluster node you connect. Obviously you'll need other things such as a fiber channel switch, HBAs, etc, but that's the same with for any other fiber channel cluster setup.
Even though you may pay a small price premium, SGI's fiber channel arrays are truly some of the best available. The specs on their lowest end model, the 4100, are pretty darn impressive for the _bottom_ of the line card: http://www.sgi.com/pdfs/4180.pdf
If/when deploying such a solution, it really pays to use fewer fat Dovecot nodes instead of lots of thin nodes. Fewer big core count boxes with lots of memory and a single FC HBA cost less in the long run than many lower core count boxes with low memory and an HBA. The cost of a single port FC HBA is typically more than a white box 1U single socket quad core server with 4GB RAM. Add the FC HBA and CXFS license to each node and you should see why fewer larger nodes is better.
-- Stan
--- On Thu, 20/1/11, Henrique Fernandes <sf.rique@gmail.com> wrote:
From: Henrique Fernandes <sf.rique@gmail.com> Subject: Re: [Dovecot] Best Cluster Storage To: "alex handle" <alex.handle@gmail.com> Cc: dovecot@dovecot.org Date: Thursday, 20 January, 2011, 18:20 []'sf.rique
On Thu, Jan 20, 2011 at 12:10 PM, alex handle <alex.handle@gmail.com> wrote:
On Mon, Jan 17, 2011 at 7:32 AM, Jan-Frode Myklebust <janfrode@tanso.net> wrote:
On Fri, Jan 14, 2011 at 05:16:50PM -0800, Brad Davidson wrote:
Don't give up on the simplest solution too
easily - lots of us run NFS
with quite large installs. As a matter of fact, I think all of the large installs run NFS; hence the need for the Director in 2.0.
Not all, if this counts as large:
Filesystem Size Used Avail Use% Mounted on /dev/gpfsmail 9.9T 8.7T 1.2T 88% /maildirs
Filesystem Inodes IUsed IFree IUse% Mounted on /dev/gpfsmail 105279488 90286634 14992854 86% /maildirs
how do you backup that data? :)
Same question!
I have about 1TB used and it takes 22 hrs to backup maildirs!
I have problens with ocfs2 in fouding the file!
-ah
Yeah ! Same here. How do you backup all this ?
s.
"I merely function as a channel that filters music through the chaos of noise"
- Vangelis
On 20/01/2011 16:20, Henrique Fernandes wrote:
Same question!
I have about 1TB used and it takes 22 hrs to backup maildirs!
I have problens with ocfs2 in fouding the file!
Just an idea, but have you evaluated performance of mdbox (new dovecot format) on your storage devices? It appears to be a gentle hybrid of mbox and maildir, with many mails packed into a single file (which might increase your performance due to fewer stat calls), but there is more than one file per folder, so some of the mbox limitations are avoided?
I haven't personally tried it, but I think you can see the theoretical appeal?
Good luck
Ed W
I have considered the idea, but we just change from mbox to maildir about 4 months ago, and we have many problens with some accouts. We were using dsync to migrate.
But once we choose mdbox we are sticky to dovecot, or gona have to migrate all users again if we choose to use another imap server.
But thanks!
[]'sf.rique
On Fri, Jan 21, 2011 at 3:16 PM, Ed W <lists@wildgooses.com> wrote:
On 20/01/2011 16:20, Henrique Fernandes wrote:
Same question!
I have about 1TB used and it takes 22 hrs to backup maildirs!
I have problens with ocfs2 in fouding the file!
Just an idea, but have you evaluated performance of mdbox (new dovecot format) on your storage devices? It appears to be a gentle hybrid of mbox and maildir, with many mails packed into a single file (which might increase your performance due to fewer stat calls), but there is more than one file per folder, so some of the mbox limitations are avoided?
I haven't personally tried it, but I think you can see the theoretical appeal?
Good luck
Ed W
Hi
I have considered the idea, but we just change from mbox to maildir about 4 months ago, and we have many problens with some accouts. We were using dsync to migrate.
Out of curiousity - how did the backup times change between mbox vs maildir? I would suggest that this gives you a baseline for how much performance you could recover by switching back to something which is kind of an mbox/maildir hybrid?
But once we choose mdbox we are sticky to dovecot, or gona have to migrate all users again if we choose to use another imap server.
True, but seriously, what are your options these days? Dovecot, Cyrus and ...? If you switch to cyrus then I think you need to plan your migration carefully due to it's own custom indexes (so maildir buys you little). If you move to MS Exchange then you still can't use raw maildir. Actually apart from Courier is there another big name IMAP server using raw maildir?
With that in mind perhaps you just bite the bullet and assume that future migration will need dsync again? It's likely to only get easier as dsync matures?
Good luck
Ed W
[]'sf.rique
On Fri, Jan 21, 2011 at 3:29 PM, Ed W <lists@wildgooses.com> wrote:
Hi
I have considered the idea, but we just change from mbox to maildir about
4 months ago, and we have many problens with some accouts. We were using dsync to migrate.
Out of curiousity - how did the backup times change between mbox vs maildir? I would suggest that this gives you a baseline for how much performance you could recover by switching back to something which is kind of an mbox/maildir hybrid?
I don't know if i got your question right, but before, while using mbox, we had less users and much less quota, it was only 200MB now is about 1GB. And before we did not have a good backup system, had many problens. We pretty much change to maildir to be easie to make incremental backups and etc.
And we are considering testing mbdox or sdbox. But still to earlier to make another big change like this.
But once we choose mdbox we are sticky to dovecot, or gona have to migrate
all users again if we choose to use another imap server.
True, but seriously, what are your options these days? Dovecot, Cyrus and ...? If you switch to cyrus then I think you need to plan your migration carefully due to it's own custom indexes (so maildir buys you little). If you move to MS Exchange then you still can't use raw maildir. Actually apart from Courier is there another big name IMAP server using raw maildir?
With that in mind perhaps you just bite the bullet and assume that future migration will need dsync again? It's likely to only get easier as dsync matures?
Yeah, i know there is no better choices but, still in mind. I had problens in dsync with acounts that was write by dovecot. I am studing dovecot dbox!
Still an alternative.
Good luck
Ed W
On 21/01/2011 17:50, Henrique Fernandes wrote:
I don't know if i got your question right, but before, while using mbox, we had less users and much less quota, it was only 200MB now is about 1GB. And before we did not have a good backup system, had many problens. We pretty much change to maildir to be easie to make incremental backups and etc.
Sorry, the point of the question was simply whether you could use your old setup to help estimate whether there is actually any point switching from maildir? Sounds like you didn't have the same backup service back then, so you can't compare though?
Just pointing out that it's completely unproven whether moving mdbox will actually make a difference anyway...
And we are considering testing mbdox or sdbox. But still to earlier to make another big change like this.
Sure - by the way I believe you can mix mailbox storage formats to a large extent? I'm not using this stuff so please check the docs before believing me, but I believe you can mix storage formats even down to the folder level under some conditions? I dare say you did exactly this during your migration so I doubt I'm telling you anything new...?
The only point of mentioning that is that you could do something as simple as duplicating some proportion of the mailboxes to new "dummy" accounts, simply for the purpose of padding out some new format directories - users wouldn't really access them. Then you could try and compare the backup times of the original mailboxes (that the users actually use) with the duplicated ones in whatever format you are testing?
Just an idea?
Good luck
Ed W
[]'sf.rique
On Fri, Jan 21, 2011 at 4:31 PM, Ed W <lists@wildgooses.com> wrote:
On 21/01/2011 17:50, Henrique Fernandes wrote:
I don't know if i got your question right, but before, while using mbox, we had less users and much less quota, it was only 200MB now is about 1GB. And before we did not have a good backup system, had many problens. We pretty much change to maildir to be easie to make incremental backups and etc.
Sorry, the point of the question was simply whether you could use your old setup to help estimate whether there is actually any point switching from maildir? Sounds like you didn't have the same backup service back then, so you can't compare though?
I am not comparin anything, because we reformulate ALL email system, before it was only one machine with local disk. So we bougth an EMC and starting using it to the new mail system in virtual machines iSCSI etc.
Just pointing out that it's completely unproven whether moving mdbox will actually make a difference anyway...
And we are considering testing mbdox or sdbox. But still to earlier to
make another big change like this.
Sure - by the way I believe you can mix mailbox storage formats to a large extent? I'm not using this stuff so please check the docs before believing me, but I believe you can mix storage formats even down to the folder level under some conditions? I dare say you did exactly this during your migration so I doubt I'm telling you anything new...?
Yeah i did likely you said, mix of mbox and maildir, actulay only active users have maildir, inactive users still mbox.
The only point of mentioning that is that you could do something as simple as duplicating some proportion of the mailboxes to new "dummy" accounts, simply for the purpose of padding out some new format directories - users wouldn't really access them. Then you could try and compare the backup times of the original mailboxes (that the users actually use) with the duplicated ones in whatever format you are testing?
Just an idea?
We usualy use one domain per test. Like this other sdb1 we are testing.
Good luck
Ed W
But you asked before about haardware.
It is an EMC CX4, linked with ONE 1gbE to ONE dlink ( i am not sure but i guess if full Gbit ) and from this dlink it conects to 4 XEN machines at 1gbit and in the virtual machines over iSCSI to EMC.
About the disk is 8 disk in RAID 1+0 in sda and i guess in sdc and sdb is RAID5 with 12 disk ( those are test )
Sorry don't know spec form the disks.
We think it is the ocfs2 and the size of the partition, becasue. We can write an big file in a accetable speed. But if we try to delete or create or read lots of small files the speed is horrible. We think is an DLM problem in propagate the locks and etc.
Do you have any idea how to test the storage from maildir usage ? We made a bashscript that write some diretores and lots of files and after it removes and etc.
Any better ideias ?
Apreciate your help!
Henrique Fernandes put forth on 1/21/2011 12:53 PM:
But you asked before about haardware.
I asked about the hardware.
It is an EMC CX4, linked with ONE 1gbE to ONE dlink ( i am not sure but i guess if full Gbit ) and from this dlink it conects to 4 XEN machines at 1gbit and in the virtual machines over iSCSI to EMC.
OMG!? A DLink switch? Is it one of their higher end managed models or consumer grade? Which model is it? Do you currently dedicate this DLink GbE switch to *only* iSCSI SAN traffic? What network/switch do you currently run OCFS metadata traffic over? Same as the client network? If so, that's bad.
You *NEED* a *QUALITY* managed dedicated GbE switch for iSCSI and OCFS metadata traffic. You *NEED* to get a decent GbE managed switch if that DLink isn't one of their top of line models. You will setup link aggregation between the two GbE ports on the CX4 and the managed switch. Program the switch and HBAs, and the ports on the CX4 for jumbo frame support. Read the documentation that comes with each product, and read the Linux ethernet docs to learn how to do link aggregation. You will need 3 GbE ports on each Xen host. One will plug into the network switch that carries client traffic. Two will plug into the SAN dedicated managed switch, one for OCFS metadata traffic and the other for iSCSI SAN traffic. If you don't separate these 3 types of traffic onto dedicated 3 GbE links your performance will always be low to horrible.
About the disk is 8 disk in RAID 1+0 in sda and i guess in sdc and sdb is RAID5 with 12 disk ( those are test )
RAID 10 (1+0) is EXCELLENT for maildir. Any parity RAID (5/6) will have less than *half* the random write IOPs of RAID 10. Currently you only have a stripe width of *only 4* with your current RAID 10 which is a big part of your problem. You *NEED* to redo the CX4. The maximum member count for RAID 10 on the CX4 is 16 drives. That is your target.
Assign two spares. If you still have 16 drives remaining, create a single RAID 10 array of those 16 drives with a stripe depth of 64. If you have 14 drives remaining, do it with 14. You *NEED* to maximize the RAID 10 with as many drives as you can. Then, slice appropriately sized LUNs, one for maildir use, one for testing, etc. Export each as a separate LUN.
The reason for this is that you are currently spindle stripe starved. You need to use RAID 10, but your current stripe width of 4 doesn't yield enough IOPS to keep up with your maildir data write load. Moving to a stripe with of 7 (14/2) or 8 (16/2) will double your sustained IOPs over what you have now.
Sorry don't know spec form the disks.
That's ok as it's not critical information.
We think it is the ocfs2 and the size of the partition, becasue. <snip>
With only 4 OCFS clients I'm pretty sure this is not the cause of your problems. The issues appear all hardware and network design related. I've identified what seem to be the problem areas and presented you the solutions above. Thankfully none of them will be expensive, as all you need is one good quality managed switch, if you don't already have one.
*BUT*, you will have a day, maybe two, of horrible user performance as you move all the maildir data off the CX4 and reconfigure it for a 14 or 16 drive RAID 10. Put a couple of fast disks in one of the Xen servers or a fast spare bare metal server and run Dovecot on it while you're fixing the CX4. You'll also have to schedule an outage while you install the new switch and reconfigure all the ports. Sure, performance will suck for your users for a day or two, but better that it sucks only one or two more days than for months into the future if you don't take the necessary steps to solve the problem permanently.
Do you have any idea how to test the storage from maildir usage ? We made a bashscript that write some diretores and lots of files and after it removes and etc.
I'm pretty sure I've already identified your problems without need for testing, thanks to the information you provided about your hardware. Here's an example of a suitable managed switch with link aggregation and jumbo frame support, if you don't already have one:
http://h10144.www1.hp.com/products/switches/HP_ProCurve_Switch_2810_Series/o... http://www.newegg.com/Product/Product.aspx?Item=N82E16833316041
This switch has plenty of processing power to handle your iSCSI and metadata traffic on just one switch. But remember, you need two GbE network links into this switch from each Xen host--one for OCFS metadata and one for iSCSI. You should use distinct RFC1918 IP subnets for each, if you aren't already, such as 192.168.1.0/24 for the metadata network, and 172.16.1.0/24 for the iSCSI network. You'll need a third GbE connection to your user traffic network. Again, keep metadata/iSCSI traffic on a separate physical network infrastructure from client traffic.
Hope this helps. I know you're going to cringe at the idea of reconfiguring the CX4 for a single large RAID 10, but it *must* be done if you're going to get the performance you need. Either that or you need to expand it with another 16 drives, configure those as RAID 10, stop all Dovecot services, copy the mailstore over, and point Dovecot to the new location. This method would prevent downtime, but as significant cost.
-- Stan
Henrique Fernandes put forth on 1/21/2011 12:53 PM:
We think it is the ocfs2 and the size of the partition, becasue. We can write an big file in a accetable speed. But if we try to delete or create or read lots of small files the speed is horrible. We think is an DLM problem in propagate the locks and etc.
It's not the size of the filesystem that's the problem. But it is an issue with the DLM, and with the small RAID 10 set. This is why I recommended putting DLM on its own dedicated network segment, same with the iSCSI traffic, and making sure you're running full duplex GbE all round. DLM doesn't require GbE bandwidth, but the latency of GbE is less than fast ethernet. I'm also assuming, since you didn't say, that you were running all your ethernet traffic over a single GbE port on each Xen host. That just doesn't scale when doing filesystem clustering. The traffic load is too great, unless you're idling all the time, in which case, why did you go OCFS? :)
Do you have any idea how to test the storage from maildir usage ? We made a bashscript that write some diretores and lots of files and after it removes and etc.
This only does you any good if you have instrumentation setup to capture metrics while you run your test. You''ll need to run iostat on the host running the script tests, along with iftop, and any OCFS monitoring tools. You'll need to use the EMC software to gather IOPS and bandwidth metrics from the CX4 during the test. You'll also need to make sure your aggregate test data size is greater than 6GB which is 2x the size of the cache in the CX4. You need to hit the disks, hard, not the cache.
The best "test" is to simply instrument your normal user load and collect the performance data I mentioned.
Any better ideias ?
Ditch iSCSI and move to fiber channel. A Qlogic 14 port 4Gb FC switch with all SFPs included is less than $2500 USD. You already have the FC ports in your CX4. You'd instantly quadruple the bandwidth of the CX4 and that of each Xen host, from 200 to 800 MB/s and 100 to 400 MB/s respectively. Four single port 4Gb FC HBAs, one for each server, will run you $2500-3000 USD. So for about $5k USD you can quadruple your bandwidth, and lower your latency.
I don't recall if you ever told us what your user load is. How many concurrent Dovecot user sessions are you supporting on average?
Apreciate your help!
No problem. SANs are one of my passions. :)
-- Stan
So i read all emails and stuff.
But i am sorry to say, much of the things you said we are not able to do.
About change the EMC to raid 10, can not do it because other people are using it. So we can not edit anything one strage. Those other luns i talk about, they are ment to be for WEB but as we are testing we are allowed to use it.
Much that you said, we are trying to do, but we don't have the hardware.
Our XEN just have 2 1gbE enterfaces, and we are using one for get external and another one to use the storage ( ours vms is also in ocfs2 to be able to migrate and etc to any host )
Anyway, we are considering very much the idea of making DLM on a dedicated network. We gonna study some way to do it with or hardware.
Apreciate your help, i guess i lernad a lot. ( i did forward this emails for some of my bosses, hope will change anything about hardware, but who knows. )
Anotehr thing to say, the email is not very tunning yet, but as we gonna improving we start to get more money to buy more stuff to the service. Thats is we we try to make it better with poor hardware. AS i said, before it was everything on fisica desktop with 500gb of disk. So right now we make a really big improvement.
Thanks a lot to all, i still apreciate any help. I am reading it all and trying to take the best of it!
[]'sf.rique
On Fri, Jan 21, 2011 at 8:06 PM, Stan Hoeppner <stan@hardwarefreak.com>wrote:
Henrique Fernandes put forth on 1/21/2011 12:53 PM:
We think it is the ocfs2 and the size of the partition, becasue. We can write an big file in a accetable speed. But if we try to delete or create or read lots of small files the speed is horrible. We think is an DLM problem in propagate the locks and etc.
It's not the size of the filesystem that's the problem. But it is an issue with the DLM, and with the small RAID 10 set. This is why I recommended putting DLM on its own dedicated network segment, same with the iSCSI traffic, and making sure you're running full duplex GbE all round. DLM doesn't require GbE bandwidth, but the latency of GbE is less than fast ethernet. I'm also assuming, since you didn't say, that you were running all your ethernet traffic over a single GbE port on each Xen host. That just doesn't scale when doing filesystem clustering. The traffic load is too great, unless you're idling all the time, in which case, why did you go OCFS? :)
Do you have any idea how to test the storage from maildir usage ? We made a bashscript that write some diretores and lots of files and after it removes and etc.
This only does you any good if you have instrumentation setup to capture metrics while you run your test. You''ll need to run iostat on the host running the script tests, along with iftop, and any OCFS monitoring tools. You'll need to use the EMC software to gather IOPS and bandwidth metrics from the CX4 during the test. You'll also need to make sure your aggregate test data size is greater than 6GB which is 2x the size of the cache in the CX4. You need to hit the disks, hard, not the cache.
The best "test" is to simply instrument your normal user load and collect the performance data I mentioned.
Any better ideias ?
Ditch iSCSI and move to fiber channel. A Qlogic 14 port 4Gb FC switch with all SFPs included is less than $2500 USD. You already have the FC ports in your CX4. You'd instantly quadruple the bandwidth of the CX4 and that of each Xen host, from 200 to 800 MB/s and 100 to 400 MB/s respectively. Four single port 4Gb FC HBAs, one for each server, will run you $2500-3000 USD. So for about $5k USD you can quadruple your bandwidth, and lower your latency.
I don't recall if you ever told us what your user load is. How many concurrent Dovecot user sessions are you supporting on average?
Apreciate your help!
No problem. SANs are one of my passions. :)
-- Stan
Henrique Fernandes put forth on 1/22/2011 2:59 PM:
About change the EMC to raid 10, can not do it because other people are using it. So we can not edit anything one strage. Those other luns i talk about, they are ment to be for WEB but as we are testing we are allowed to use it.
You need to look at this: http://www.hardwarefreak.com/thin-provisioning.jpg
The exported LUNS need to be configured as independent of the underlying physical disk RAID level. You can reconfigure up to 16 disks of the CX4 as RAID 10, and export as many LUNS of any size as you like. I.e. you can keep the same exported LUNS you have now, although you may have to adjust their sizes a bit. What you gain by doing this is a doubling of random IOPS.
You will need to backup any real/live data on the current LUNS. You only have 16 disks in the CX4 correct? So reconfigure with 2 spares and 14 disks in a RAID 10. Then create the same LUNS you had before. I don't use EMC products so I don't know their terminology. But, this is usually called something like "virtual disks" or "virtual LUNS". The industry calls this "thin provisioning". You *need* to do this if you're going to support hundreds or more concurrent users. An effective stripe width of only 4 spindles, which is what you currently have with an 8 disk RAID10, isn't enough. That gives you only 600-1200 IOPS depending on the RPM of the disks in the array: 600 for 7.2k disks, and 1200 for 15k disks. With a 14 disk RAID10 you'll have IOPS of 1050 to 2100.
Much that you said, we are trying to do, but we don't have the hardware.
Our XEN just have 2 1gbE enterfaces, and we are using one for get external and another one to use the storage ( ours vms is also in ocfs2 to be able to migrate and etc to any host )
In that case you _really really_ need more IOPS and throughput from the CX4.
You didn't answer my previous question regarding the physical connections from the CX4 to your ethernet switch. Do you have both ports connected, and do you have them link aggregated? If not, this is probably the single most important change you could make at this point. I seriously doubt that 100 MB/s is sufficient for the load you're putting on the CX4.
Anyway, we are considering very much the idea of making DLM on a dedicated network. We gonna study some way to do it with or hardware.
As I said a single decent quality 24 port GbE managed switch will carry the DLM and iSCSI traffic just fine, especially if all the devices can support jumbo frames. Cheap desktop/soho switches such as DLink are going to cripple your operation.
Apreciate your help, i guess i lernad a lot. ( i did forward this emails for some of my bosses, hope will change anything about hardware, but who knows.
Glad to be of assistance. Hope you get all the problems worked out. The only money you really need to spend is on a decent GbE managed switch with 16-24 ports.
Anotehr thing to say, the email is not very tunning yet, but as we gonna improving we start to get more money to buy more stuff to the service. Thats is we we try to make it better with poor hardware. AS i said, before it was everything on fisica desktop with 500gb of disk. So right now we make a really big improvement.
It's still better than before, even with all the current performance problems? Well at least you're making some progress. :)
Thanks a lot to all, i still apreciate any help. I am reading it all and trying to take the best of it!
Are you running OCFS in both the Xen guests and the Xen hosts? If so that may also be part of the performance problem. You need to look at ways to optimize (i.e. decrease) the OCFS metadata load. Can Xen export a filesystem up to the guest via some virtual mechanism, such as ESX presents virtual disks to a guest? If so you should do that.
-- Stan
[]'sf.rique
On Sun, Jan 23, 2011 at 1:20 AM, Stan Hoeppner <stan@hardwarefreak.com>wrote:
Henrique Fernandes put forth on 1/22/2011 2:59 PM:
About change the EMC to raid 10, can not do it because other people are using it. So we can not edit anything one strage. Those other luns i talk about, they are ment to be for WEB but as we are testing we are allowed to use it.
You need to look at this: http://www.hardwarefreak.com/thin-provisioning.jpg
The exported LUNS need to be configured as independent of the underlying physical disk RAID level. You can reconfigure up to 16 disks of the CX4 as RAID 10, and export as many LUNS of any size as you like. I.e. you can keep the same exported LUNS you have now, although you may have to adjust their sizes a bit. What you gain by doing this is a doubling of random IOPS.
Will not be that easy because the people who actually got the storage need space not performance, we are just using the storage because we needed, it wasn't bought to us! ( but this is not your problens )
You will need to backup any real/live data on the current LUNS. You only have 16 disks in the CX4 correct? So reconfigure with 2 spares and 14 disks in a RAID 10. Then create the same LUNS you had before. I don't use EMC products so I don't know their terminology. But, this is usually called something like "virtual disks" or "virtual LUNS". The industry calls this "thin provisioning". You *need* to do this if you're going to support hundreds or more concurrent users. An effective stripe width of only 4 spindles, which is what you currently have with an 8 disk RAID10, isn't enough. That gives you only 600-1200 IOPS depending on the RPM of the disks in the array: 600 for 7.2k disks, and 1200 for 15k disks. With a 14 disk RAID10 you'll have IOPS of 1050 to 2100.
Have more i guess, need to talk with the person who know about the
storage. I will look about thin provisioning
Much that you said, we are trying to do, but we don't have the hardware.
Our XEN just have 2 1gbE enterfaces, and we are using one for get external and another one to use the storage ( ours vms is also in ocfs2 to be able to migrate and etc to any host )
In that case you _really really_ need more IOPS and throughput from the CX4.
The vms disks are in another LUN, but they don't suffer of performance issues, this is one of the reason we don't think it the storage problem but an filesystem.
You didn't answer my previous question regarding the physical connections from the CX4 to your ethernet switch. Do you have both ports connected, and do you have them link aggregated? If not, this is probably the single most important change you could make at this point. I seriously doubt that 100 MB/s is sufficient for the load you're putting on the CX4.
Only one cabe! Nothing agragated There is 2 connection i guess to the CX4, it is one for each SPARE , is something about the storage that saparete the LUNS not sure about why this.
We have been analising the througput and is not that high as i said, we now are having someproblens with monitoring and etc. ( another sector where i work, it does not belowng to me any of this conf ) But we did not see any port on the swicth with more than half of use!
Anyway, we are considering very much the idea of making DLM on a dedicated network. We gonna study some way to do it with or hardware.
As I said a single decent quality 24 port GbE managed switch will carry the DLM and iSCSI traffic just fine, especially if all the devices can support jumbo frames. Cheap desktop/soho switches such as DLink are going to cripple your operation.
About jumbo frames, the person had some problens configuring it, so now i am not sure if we are using jumbo frames or not, you be much better if we were using it ?
I migh look for the siwtch but is not that bad i guess.
Apreciate your help, i guess i lernad a lot. ( i did forward this emails for some of my bosses, hope will change anything about hardware, but who knows.
Glad to be of assistance. Hope you get all the problems worked out. The only money you really need to spend is on a decent GbE managed switch with 16-24 ports.
Anotehr thing to say, the email is not very tunning yet, but as we gonna improving we start to get more money to buy more stuff to the service. Thats is we we try to make it better with poor hardware. AS i said, before it was everything on fisica desktop with 500gb of disk. So right now we make a really big improvement.
It's still better than before, even with all the current performance problems? Well at least you're making some progress. :)
It is better, because now we have an decent webmail ( horde with dimp enable, before were just imp ) , and most people use to have pop configured, becasue of quota of 200mb, and little user use webmail. Now much more people use the webmail and imap cause quota is 1gb now. Any better free webmail to point out tu us test ?
Thanks a lot to all, i still apreciate any help. I am reading it all and trying to take the best of it!
Are you running OCFS in both the Xen guests and the Xen hosts? If so that may also be part of the performance problem. You need to look at ways to optimize (i.e. decrease) the OCFS metadata load. Can Xen export a filesystem up to the guest via some virtual mechanism, such as ESX presents virtual disks to a guest? If so you should do that.
Let me try explain how it is configured.
or 4 Xen host mount one disk exported over iSCSI from CX4. Each virtual machine has a disk.iso ( is an dd from /dev/zero ) places in the mounted CX4 and this iso is exported to xen as the disk of virtual machine. This has one xen interface ( eth0 ) to it self. the other interfafce eth1 have several vlans. This same interface eth0 is expored to an virtual machine that mount the CX4.
We are thinking in someway to improve performance by doing anything in dovecot index files. Diferenet lun or local for each host or something like it. Don't know with one is the best.
Right now we don't know what is cause the performance problems.
Oh i forgot to tellyou before: we have 3 nodes in ocfs2, 2 function as imap pop lda, and the othe rone is just for mailman. Before when some email where sent to an lis with all emails, the other 2 serves just stop working and get IO/wait about 90%. The same problens happens if i got to any host and simple do a big rm -rf on a big email account. Another thing that makes we think that is an ocfs2 problem.
Glad for you help!
-- Stan
On Sun, Jan 23, 2011 at 02:01:49AM -0200, Henrique Fernandes wrote:
It is better, because now we have an decent webmail ( horde with dimp enable, before were just imp ) , and most people use to have pop configured, becasue of quota of 200mb, and little user use webmail. Now much more people use the webmail and imap cause quota is 1gb now. Any better free webmail to point out tu us test ?
We were considering horde, and the upcoming horde-4, but IMHO the interface is too old fashion.. I stumbled over SOGo a few months ago, and IMHO it looks great, and are doing (almost) everything right.
http://www.sogo.nu/
http://www.sogo.nu/english/tour/online_demo.html
-jf
On 23 Jan 2011, at 08:51, Jan-Frode Myklebust wrote:
On Sun, Jan 23, 2011 at 02:01:49AM -0200, Henrique Fernandes wrote:
It is better, because now we have an decent webmail ( horde with dimp enable, before were just imp ) , and most people use to have pop configured, becasue of quota of 200mb, and little user use webmail. Now much more people use the webmail and imap cause quota is 1gb now. Any better free webmail to point out tu us test ?
We were considering horde, and the upcoming horde-4, but IMHO the interface is too old fashion.. I stumbled over SOGo a few months ago, and IMHO it looks great, and are doing (almost) everything right.
http://www.sogo.nu/ http://www.sogo.nu/english/tour/online_demo.html
Have a look at roundcube
On Sun, Jan 23, 2011 at 09:54:56AM +0000, John Moorhouse wrote:
http://www.sogo.nu/ http://www.sogo.nu/english/tour/online_demo.html
Have a look at roundcube
Yes, roundcube is looking good, but AFAIK it's missing an integrated calendar.
-jf
- Jan-Frode Myklebust <janfrode@tanso.net>:
On Sun, Jan 23, 2011 at 02:01:49AM -0200, Henrique Fernandes wrote:
It is better, because now we have an decent webmail ( horde with dimp enable, before were just imp ) , and most people use to have pop configured, becasue of quota of 200mb, and little user use webmail. Now much more people use the webmail and imap cause quota is 1gb now. Any better free webmail to point out tu us test ?
We were considering horde, and the upcoming horde-4, but IMHO the interface is too old fashion.. I stumbled over SOGo a few months ago, and IMHO it looks great, and are doing (almost) everything right.
+1
p@rick
-- state of mind Digitale Kommunikation
Franziskanerstraße 15 Telefon +49 89 3090 4664 81669 München Telefax +49 89 3090 4666
Amtsgericht München Partnerschaftsregister PR 563
thanks to all!
I am already have considering sogo. Gonna test some day!
[]'sf.rique
On Sun, Jan 23, 2011 at 9:52 AM, Patrick Ben Koetter <p@state-of-mind.de>wrote:
- Jan-Frode Myklebust <janfrode@tanso.net>:
On Sun, Jan 23, 2011 at 02:01:49AM -0200, Henrique Fernandes wrote:
It is better, because now we have an decent webmail ( horde with dimp enable, before were just imp ) , and most people use to have pop
becasue of quota of 200mb, and little user use webmail. Now much more
configured, people
use the webmail and imap cause quota is 1gb now. Any better free webmail to point out tu us test ?
We were considering horde, and the upcoming horde-4, but IMHO the interface is too old fashion.. I stumbled over SOGo a few months ago, and IMHO it looks great, and are doing (almost) everything right.
+1
p@rick
-- state of mind Digitale Kommunikation
Franziskanerstraße 15 Telefon +49 89 3090 4664 81669 München Telefax +49 89 3090 4666
Amtsgericht München Partnerschaftsregister PR 563
[]'sf.rique
On Fri, Jan 21, 2011 at 8:06 PM, Stan Hoeppner <stan@hardwarefreak.com>wrote:
Henrique Fernandes put forth on 1/21/2011 12:53 PM:
We think it is the ocfs2 and the size of the partition, becasue. We can write an big file in a accetable speed. But if we try to delete or create or read lots of small files the speed is horrible. We think is an DLM problem in propagate the locks and etc.
It's not the size of the filesystem that's the problem. But it is an issue with the DLM, and with the small RAID 10 set. This is why I recommended putting DLM on its own dedicated network segment, same with the iSCSI traffic, and making sure you're running full duplex GbE all round. DLM doesn't require GbE bandwidth, but the latency of GbE is less than fast ethernet. I'm also assuming, since you didn't say, that you were running all your ethernet traffic over a single GbE port on each Xen host. That just doesn't scale when doing filesystem clustering. The traffic load is too great, unless you're idling all the time, in which case, why did you go OCFS? :)
Yeah, pretty much all trafinc is on 2 GbE on xen hosts. The trafic is not high, on the xen host is about 30% iof the link inm hush time.. I did not get your question, why did i go to ocfs ? Only need an clustred file system to make may host to help in HA nd performance.
Do you have any idea how to test the storage from maildir usage ? We made a bashscript that write some diretores and lots of files and after it removes and etc.
This only does you any good if you have instrumentation setup to capture metrics while you run your test. You''ll need to run iostat on the host running the script tests, along with iftop, and any OCFS monitoring tools. You'll need to use the EMC software to gather IOPS and bandwidth metrics from the CX4 during the test. You'll also need to make sure your aggregate test data size is greater than 6GB which is 2x the size of the cache in the CX4. You need to hit the disks, hard, not the cache.
The emc told us that once you start the analyizer it makes the performqance MUCH WORSE so we are not ocnsidering using it right now. But thanks for the tips.
The best "test" is to simply instrument your normal user load and collect the performance data I mentioned.
Any better ideias ?
Ditch iSCSI and move to fiber channel. A Qlogic 14 port 4Gb FC switch with all SFPs included is less than $2500 USD. You already have the FC ports in your CX4. You'd instantly quadruple the bandwidth of the CX4 and that of each Xen host, from 200 to 800 MB/s and 100 to 400 MB/s respectively. Four single port 4Gb FC HBAs, one for each server, will run you $2500-3000 USD. So for about $5k USD you can quadruple your bandwidth, and lower your latency.
I am form brazil, stuff her is a little more expensive than this. And still, were i worked not easy to get money to buy ardware.
I don't recall if you ever told us what your user load is. How many concurrent Dovecot user sessions are you supporting on average?
Last time i checked in my IPVS server ( the one that balance between the nodes with ocfs2 )
Was some thing like 35 active connections each host and about 180 inActConn each host also.
If i run doveadm in each server it gives me about 25 users in each host, but as most of system is webmail, it connects and disconects pretty fast as IMAP protocol tell us to do it, so doveadm keeps changing a lot the numbers.
Apreciate your help!
No problem. SANs are one of my passions. :)
-- Stan
-----Original Message-----
I'm sorry I don't follow this. It would be appreciated if you could include a simpler example. The way I see it, a VM disk is just a small chunck "LVM LV in my case" of a real disk.
Perhaps if you were to compare and contrast a virtual disk to a raw disk, that would help. If you wanted to use drbd with a raw disk being accessed via a VM guest, that would probably be all right. Might not be "supported" though.
Depending on your virtualization method, raw device passthrough would probably be OK. Otherwise, think about what you're doing - putting a filesystem - on a replicated block device - that's presented through a virtualization layer - that's on a filesystem - that's on a block device. If you're running GFS/GlusterFS/etc on the DRBD disk, and the VM is on VMFS, then you're actually using two clustered filesystems!
Each layer adds a bit of overhead, and each block-on-filesystem layering adds the potential for block misalignments and other issues that will affect your overall performance and throughput. It's just hard to do right.
-Brad
On 15/01/11 01:14, Brad Davidson wrote:
I'm sorry I don't follow this. It would be appreciated if you could include a simpler example. The way I see it, a VM disk is just a small chunck "LVM LV in my case" of a real disk. Perhaps if you were to compare and contrast a virtual disk to a raw disk, that would help. If you wanted to use drbd with a raw disk being accessed via a VM guest, that would probably be all right. Might not be "supported" though. Depending on your virtualization method, raw device passthrough would
-----Original Message----- probably be OK. Otherwise, think about what you're doing - putting a filesystem - on a replicated block device - that's presented through a virtualization layer - that's on a filesystem - that's on a block device. If you're running GFS/GlusterFS/etc on the DRBD disk, and the VM is on VMFS, then you're actually using two clustered filesystems!
Each layer adds a bit of overhead, and each block-on-filesystem layering adds the potential for block misalignments and other issues that will affect your overall performance and throughput. It's just hard to do right.
-Brad Generally, I would give an LVM LV to each of my Xen guests, which according to the DRBD site, is ok:
http://www.drbd.org/users-guide/s-lvm-lv-as-drbd-backing-dev.html
I do not use img files with loopback devices
Is this a bit better now?
Quoting Jonathan Tripathy <jonnyt@abpni.co.uk>:
Generally, I would give an LVM LV to each of my Xen guests, which
according to the DRBD site, is ok:http://www.drbd.org/users-guide/s-lvm-lv-as-drbd-backing-dev.html
I do not use img files with loopback devices
Is this a bit better now?
There are implications of whether you do drbd+lvm or lvm+drbd when it comes to things like lvm snapshots, growing/shrinking lvm volumes, etc. Some thought may be needed to make sure you configure it in such a way as to meet your needs...
-- Eric Rostetter The Department of Physics The University of Texas at Austin
Go Longhorns!
Am 13.01.2011 08:22, schrieb Jonathan Tripathy:
Hi Everyone,
I wish to create a Postfix/Dovecot active-active cluster (each node will run Postfix *and* Dovecot), which will obviously have to use central storage. I'm looking for ideas to see what's the best out there. All of this will be running on multiple Xen hosts, however I don't think that matters as long as I make sure that the cluster nodes are on different physical boxes.
Here are my ideas so far for the central storage:
NFS Server using DRBD+LinuxHA. Export the same NFS share to each mail server. Which this seems easy, how well does Dovecot work with NFS? I've read the wiki page, and it doesn't sound promising. But it may be outdated..
Export block storage using iSCSI from targets which have GFS2 on DRBD+LinuxHA. This is tricky to get working well, and it's only a theory.
GlusterFS. Easy to set up, but apparently very slow to run.
So what's everybody using? I know that Postfix runs well on NFS (according to their docs). I intend to use Maildir
Thanks
i have drbd and ocfs with keepalive on ubuntu lucid, 2 loadbalancers, 2 mailservers with postfix and dovecot2 maildirs,clamav-milter,spamass-milter,sqlgrey, master-master mysql additional horde webmail on apache at both servers no problem so far, but for now i have only ca 100 mailboxes yet
i wouldnt recommend nfs for mailstore if you want use gfs you might use better some redhat (clone) last time i tested it on ubuntu , i couldnt get it running as i expected it ( this may changed now...)
i dont think there is some best solution depends on you hardware, finance resources, number of wanted mailboxes etc
-- Best Regards
MfG Robert Schetterer
Germany/Munich/Bavaria
Where are all the glusterfs users in this thread... There are at least a couple of folks here using such a system? Any comments on how it's working out for you?
Ed W
On 14 January 2011 17:06, Ed W <lists@wildgooses.com> wrote:
Where are all the glusterfs users in this thread... There are at least a couple of folks here using such a system? Any comments on how it's working out for you?
Not on production systems yet. Still getting bugs fixed.
-Naresh V.
i've actually been reading on ocfs2 and it looks quite promising. According to this presentation:
http://www.gpaterno.com/publications/2010/dublin_ossbarcamp_2010_fs_comparis...
ocfs2 seems to work quite well with lots of small files (typical of maildir). I'm guessing that since ocfs2 reboot a system automatically, it doesn't require any additional fencing?
I did not see that on my tests, chmod take really long, as to rm files is even longer.
how many users are we talking about ?
And we are not using any fence device with the ocfs2 because it reboots when anything goes wrong.
[]'sf.rique
On Fri, Jan 14, 2011 at 9:36 AM, Ed W <lists@wildgooses.com> wrote:
Where are all the glusterfs users in this thread... There are at least a couple of folks here using such a system? Any comments on how it's working out for you?
Ed W
participants (17)
-
alex handle
-
Andre Nathan
-
Brad Davidson
-
Ed W
-
Eric Rostetter
-
Eric Shubert
-
Henrique Fernandes
-
Jan-Frode Myklebust
-
John Moorhouse
-
Jonathan Tripathy
-
Luben Karavelov
-
Naresh V
-
Patrick Ben Koetter
-
Patrick Westenberg
-
Robert Schetterer
-
Spyros Tsiolis
-
Stan Hoeppner