[Dovecot] Providing shared folders with multiple backend servers

older
[Dovecot] Dovecot under Virtual...

Sven Hartge

7 Jan 2012 7 Jan '12

11:20 p.m.

Hi *,

I am currently in the planning stage for a "new and improved" mail system at my university.

Right now, everything is on one big backend server but this is causing me increasing amounts of pain, beginning with the time a full backup takes.

So naturally, I want to split this big server into smaller ones.

To keep things simple, I want to pin a user to a server so I can avoid things like NFS or cluster aware filesystems. The mapping for each account is then inserted into the LDAP object for each user and the frontend proxy (perdition at the moment) then uses this information to route each access to the correct backend storage server running dovecot.

So far this has been working nice with my test setup.

But: I also have to provide shared folders for users. Thankfully users don't have the right to share their own folders, which makes things easier (I hope).

Right now, the setup works like this, using Courier:

complete virtual mail setup
global shared folders configured in /etc/courier/shared/index
inside /home/shared-folder-name/Maildir/courierimapacl specific user get access to a folder
each folder a user has access is mapped to the namespace #shared like #shared.shared-folder-name

Now, if I split my backend storage server into multiple ones and user-A is on server-1 and user-B is on server-2, but both need to access the same shared folder, I have a problem.

I could of course move all users needing access to a shared folder to the same server, but in the end, this will be a nightmare for me, because I forsee having to move users around on a daily basis.

Right now, I am pondering with using an additional server with just the shared folders on it and using NFS (or a cluster FS) to mount the shared folder filesystem to each backend storage server, so each user has potential access to a shared folders data.

Ideas? Suggestions? Nudges in the right direction?

Grüße, Sven.

-- Sigmentation fault. Core dumped.

Show replies by date

Stan Hoeppner

8 Jan 8 Jan

1:35 a.m.

On 1/7/2012 4:20 PM, Sven Hartge wrote:

...

Hi *,

I am currently in the planning stage for a "new and improved" mail system at my university.

Right now, everything is on one big backend server but this is causing me increasing amounts of pain, beginning with the time a full backup takes.

You failed to mention your analysis and diagnosis identifying the source of the slow backup, and other issues your eluded to but didn't mention specifically. You also didn't mention how you're doing this full backup (tar, IMAP; D2D or tape), where the backup bottleneck is, what mailbox storage format you're using, total mailbox count and filesystem space occupied. What is your disk storage configuration? Direct attach? Hardware or software RAID? What RAID level? How many disks? SAS or SATA?

It's highly likely your problems can be solved without the drastic architecture change, and new problems it will introduce, that you describe below.

...

So naturally, I want to split this big server into smaller ones.

Naturally? Many OPs spend significant x/y/z resources trying to avoid the "shared nothing" storage backend setup below.

...

To keep things simple, I want to pin a user to a server so I can avoid things like NFS or cluster aware filesystems. The mapping for each account is then inserted into the LDAP object for each user and the frontend proxy (perdition at the moment) then uses this information to route each access to the correct backend storage server running dovecot.

Splitting the IMAP workload like this isn't keeping things simple, but increases complexity, on many levels. And there's nothing wrong with NFS and cluster filesystems if they are used correctly.

...

So far this has been working nice with my test setup.

But: I also have to provide shared folders for users. Thankfully users don't have the right to share their own folders, which makes things easier (I hope).

Right now, the setup works like this, using Courier:

complete virtual mail setup

global shared folders configured in /etc/courier/shared/index

inside /home/shared-folder-name/Maildir/courierimapacl specific user get access to a folder

each folder a user has access is mapped to the namespace #shared like #shared.shared-folder-name

Now, if I split my backend storage server into multiple ones and user-A is on server-1 and user-B is on server-2, but both need to access the same shared folder, I have a problem.

Yes, you do.

...

I could of course move all users needing access to a shared folder to the same server, but in the end, this will be a nightmare for me, because I forsee having to move users around on a daily basis.

See my comments above.

...

Right now, I am pondering with using an additional server with just the shared folders on it and using NFS (or a cluster FS) to mount the shared folder filesystem to each backend storage server, so each user has potential access to a shared folders data.

So you're going to implement a special case of what you're desperately trying to avoid? This makes no sense.

...

Ideas? Suggestions? Nudges in the right direction?

Yes. We need more real information. Please provide:

Mailbox count, total maildir file count and size
Average/peak concurrent user connections
CPU type/speed/total core count, total RAM, free RAM (incl buffers)
Storage configuration--total spindles, RAID level, hard or soft RAID
Filesystem type
Backup software/method
Operating system

Instead of telling us what you think the solution to your unidentified bottleneck is and then asking "yeah or nay", tell us what the problem is and allow us to recommend solutions. This way you'll get some education and multiple solutions that may very well be a better fit, will perform better, and possibly cost less in capital outlay and administration time/effort.

-- Stan

Sven Hartge

2:55 a.m.

Stan Hoeppner <stan@hardwarefreak.com> wrote:

...

It's highly likely your problems can be solved without the drastic architecture change, and new problems it will introduce, that you describe below.

The main reason is I need to replace the hardware as its service contract ends this year and I am not able to extend it further.

The box so far is fine, there are normally no problems during normal operations with speed or responsiveness towards the end-user.

Sometimes, higher peak loads tend to strain the system a bit and this is starting to occur more often.

First thought was to move this setup into our VMware cluster (yeah, I know, spare me the screams), since the hardware used there is way more powerfull than the hardware used now and I wouldn't have to buy new servers for my mail system (which is kind of painful to do in an universitary environment, especially in Germany, if you want to invest an amount of money above a certain amount).

But then I thought about the problems with VMs this size and got to the idea with the distributed setup, splitting the one server into 4 or 6 backend servers.

As I said: "idea". Other ideas making my life easier are more than welcome.

...

...
Ideas? Suggestions? Nudges in the right direction?

...

Yes. We need more real information. Please provide:

...

Mailbox count, total maildir file count and size

about 10,000 Maildir++ boxes

900GB for 1300GB used, "df -i" says 11 million inodes used

I know, this is very _tiny_ compared to the systems ISPs are using.

...

Average/peak concurrent user connections

IMAP: Average 800 concurrent user connections, peaking at about 1400. POP3: Average 300 concurrent user connections, peaking at about 600.

...

CPU type/speed/total core count, total RAM, free RAM (incl buffers)

Currently dual-core AMD Opteron 2210, 1.8GHz.

Right now, in the middle of the night (2:30 AM here) on a Sunday, thus a low point in the usage pattern:

         total       used       free     shared    buffers     cached

Mem: 12335820 9720252 2615568 0 53112 680424 -/+ buffers/cache: 8986716 3349104 Swap: 5855676 10916 5844760

System reaches its 7 year this summer which is the end of its service contract.

...

Storage configuration--total spindles, RAID level, hard or soft RAID

RAID 6 with 12 SATA1.5 disks, external 4Gbit FC

Back in 2005, a SAS enclosure was way to expensive for us to afford.

...

Filesystem type

XFS in a LVM to allow snapshots for backup

I of course aligned the partions on the RAID correctly and of course created a filesystem with the correct parameters wrt. spindels, chunk size, etc.

...

Backup software/method

Full backup with Bacula, taking about 24 hours right now. Because of this, I switched to virtual full backups, only ever doing incremental and differental backups off of the real system and creating synthetic full backups inside Bacula. Works fine though, incremental taking 2 hours, differential about 4 hours.

The main problem of the backup time is Maildir++. During a test, I copied the mail storage to a spare box, converted it to mdbox (50MB file size) and the backup was lightning fast compared to the Maildir++ format.

Additonally compressing the mails inside the mdbox and not having Bacula compress them for me reduce the backup time further (and speeding up the access through IMAP and POP3).

So this is the way to go, I think, regardless of which way I implement the backend mail server.

...

Operating system

Debian Linux Lenny, currently with kernel 2.6.39

...

Instead of telling us what you think the solution to your unidentified bottleneck is and then asking "yeah or nay", tell us what the problem is and allow us to recommend solutions.

I am not asking for "yay or nay", I just pointed out my idea, but I am open to other suggestions.

If the general idea is to buy a new big single storage system, I am more than happy to do just this, because this will prevent any problems I might have with a distributed one before they even can occur.

Maybe two HP DL180s (one for production and one as test/standby-system) with an SAS attached enclosure for storage?

Keeping in mind the new system has to work for some time (again 5 to 7 years) I have to be able to extend the storage space without to much hassle.

Grüße, S°

-- Sigmentation fault. Core dumped.

Stan Hoeppner

2:09 p.m.

On 1/7/2012 7:55 PM, Sven Hartge wrote:

...

Stan Hoeppner <stan@hardwarefreak.com> wrote:

...
It's highly likely your problems can be solved without the drastic architecture change, and new problems it will introduce, that you describe below.

The main reason is I need to replace the hardware as its service contract ends this year and I am not able to extend it further.

The box so far is fine, there are normally no problems during normal operations with speed or responsiveness towards the end-user.

Sometimes, higher peak loads tend to strain the system a bit and this is starting to occur more often. ... First thought was to move this setup into our VMware cluster (yeah, I know, spare me the screams), since the hardware used there is way more powerfull than the hardware used now and I wouldn't have to buy new servers for my mail system (which is kind of painful to do in an universitary environment, especially in Germany, if you want to invest an amount of money above a certain amount).

What's wrong with moving it onto VMware? This actually seems like a smart move given your description of the node hardware. It also gives you much greater backup flexibility with VCB (or whatever they call it today). You can snapshot the LUN over the SAN during off peak hours to a backup server and do the actual backup to the library at your leisure. Forgive me if the software names have changed as I've not used VMware since ESX3 back in 07.

...

But then I thought about the problems with VMs this size and got to the idea with the distributed setup, splitting the one server into 4 or 6 backend servers.

Not sure what you mean by "VMs this size". Do you mean memory requirements or filesystem size? If the nodes have enough RAM that's no issue. And surely you're not thinking of using a .vmdk for the mailbox storage. You'd use an RDM SAN LUN. In fact you should be able to map in the existing XFS storage LUN and use it as is. Assuming it's not going into retirement as well.

If an individual VMware node don't have sufficient RAM you could build a VM based Dovecot cluster, run these two VMs on separate nodes, and thin out the other VMs allowed to run on these nodes. Since you can't directly share XFS, build a tiny Debian NFS server VM and map the XFS LUN to it, export the filesystem to the two Dovecot VMs. You could install the Dovecot director on this NFS server VM as well. Converting from maildir to mdbox should help eliminate the NFS locking problems. I would do the conversion before migrating to this VM setup with NFS.

Also, run the NFS server VM on the same physical node as one of the Dovecot servers. The NFS traffic will be a memory-memory copy instead of going over the GbE wire, decreasing IO latency and increasing performance for that Dovecot server. If it's possible to have Dovecot director or your fav load balancer weight more connections to one Deovecot node, funnel 10-15% more connections to this one. (I'm no director guru, in fact haven't use it yet).

Assuming the CPUs in the VMware cluster nodes are clocked a decent amount higher than 1.8GHz I wouldn't monkey with configuring virtual smp for these two VMs, as they'll be IO bound not CPU bound.

...

As I said: "idea". Other ideas making my life easier are more than welcome.

I hope my suggestions contribute to doing so. :)

...

...
...
Ideas? Suggestions? Nudges in the right direction?

...
Yes. We need more real information. Please provide:

...

Mailbox count, total maildir file count and size

about 10,000 Maildir++ boxes

900GB for 1300GB used, "df -i" says 11 million inodes used

Converting to mdbox will take a large burden off your storage, as you've seen. With ~1.3TB consumed of ~15TB you should have plenty of space to convert to mdbox while avoiding filesystem fragmentation. With maildir you likely didn't see heavy fragmentation due to small file sizes. With mdbox, especially at 50MB, you'll likely start seeing more fragmentation. Use this to periodically check the fragmentation level:

$ xfs_db -c frag [device] -r e.g. $ xfs_db -c frag /dev/sda7 -r actual 76109, ideal 75422, fragmentation factor 0.90%

I'd recommend running xfs_fsr when frag factor exceeds ~20-30%. The XFS developers recommend against running xfs_fsr too often as it can actually increases free space fragmentation while it decreases file fragmentation, especially on filesystems that are relatively full. Having heavily fragmented free space is worse than having fragmented files, as newly created files will automatically be fragged.

...

I know, this is very _tiny_ compared to the systems ISPs are using.

Not everyone is an ISP, including me. :)

...

...

Average/peak concurrent user connections

IMAP: Average 800 concurrent user connections, peaking at about 1400. POP3: Average 300 concurrent user connections, peaking at about 600.

...

CPU type/speed/total core count, total RAM, free RAM (incl buffers)

Currently dual-core AMD Opteron 2210, 1.8GHz.

Heheh, yeah, a bit long in the tooth, but not horribly underpowered for 1100 concurrent POP/IMAP users. Though this may be the reason for the sluggishness when you hit that 2000 concurrent user peak. Any chance you have some top output for the peak period?

...

Right now, in the middle of the night (2:30 AM here) on a Sunday, thus a low point in the usage pattern:
         total       used       free     shared    buffers     cached
Mem: 12335820 9720252 2615568 0 53112 680424 -/+ buffers/cache: 8986716 3349104 Swap: 5855676 10916 5844760

Ugh... "-m" and "-g" options exist for a reason. :) So this box has 12GB RAM, currently ~2.5GB free during off peak hours. It would be interesting to see free RAM and swap usage values during peak. That would tell use whether we're CPU or RAM starved. If both turned up clean then we'd need to look at iowait. If you're not RAM starved then moving to VMware nodes with 16/24/32GB RAM should work fine, as long as you don't stack many other VMs on top. Enabling memory dedup may help a little.

...

System reaches its 7 year this summer which is the end of its service contract.

Enjoy your retirement old workhorse. :)

...

...

Storage configuration--total spindles, RAID level, hard or soft RAID

RAID 6 with 12 SATA1.5 disks, external 4Gbit FC

I assume this means a LUN on a SAN array somewhere on the other end of that multi-mode cable, yes? Can you tell us what brand/model the box is?

...

Back in 2005, a SAS enclosure was way to expensive for us to afford.

How one affords an FC SAN array but not a less expensive direct attach SAS enclosure is a mystery... :)

...

...

Filesystem type

XFS in a LVM to allow snapshots for backup

XFS is the only way to fly, IMNSHO.

...

I of course aligned the partions on the RAID correctly and of course created a filesystem with the correct parameters wrt. spindels, chunk size, etc.

Which is critical for mitigating the RMW penalty of parity RAID. Speaking of which, why RAID6 for maildir? Given that your array is 90% vacant, why didn't you go with RAID10 for 3-5 times the random write performance?

...

...

Backup software/method

Full backup with Bacula, taking about 24 hours right now. Because of this, I switched to virtual full backups, only ever doing incremental and differental backups off of the real system and creating synthetic full backups inside Bacula. Works fine though, incremental taking 2 hours, differential about 4 hours.

Move to VMware and use VCB. You'll fall in love.

...

The main problem of the backup time is Maildir++. During a test, I copied the mail storage to a spare box, converted it to mdbox (50MB file size) and the backup was lightning fast compared to the Maildir++ format.

Well of course. You were surprised by this? How long has it been since you used mbox? mbox backs up even faster than mdbox. Why? Larger files and fewer of them. Which means the disks can actually do streaming reads, and don't have to beat their heads to death jumping all over the platters to read maildir files, which are scattered all over the place when created. Which is while maildir is described as a "random" IO workload.

...

Additonally compressing the mails inside the mdbox and not having Bacula compress them for me reduce the backup time further (and speeding up the access through IMAP and POP3).

Again, no surprise here. When files exist on disk already compressed it takes less IO bandwidth to read the file data for a given actual file size. So if you have say 10MB files that compress down to 5MB, you can read twice as many files when the pipe is saturated, twice as much file data.

...

So this is the way to go, I think, regardless of which way I implement the backend mail server.

Which is why I asked my questions. :) mdbox would have been one of my recommendations, but you already discovered it.

...

...

Operating system

Debian Linux Lenny, currently with kernel 2.6.39

:) Debian, XFS, Dovecot, FC SAN storage--I like your style. Lenny with 2.6.39? Is that a backport or rolled kernel? Not Squeeze? Interesting. I'm running Squeeze with rolled vanilla 2.6.38.6. It's been about 6 months so it's 'bout time I roll a new one. :)

...

...
Instead of telling us what you think the solution to your unidentified bottleneck is and then asking "yeah or nay", tell us what the problem is and allow us to recommend solutions.

I am not asking for "yay or nay", I just pointed out my idea, but I am open to other suggestions.

I think you've already discovered the best suggestions on your own.

...

If the general idea is to buy a new big single storage system, I am more than happy to do just this, because this will prevent any problems I might have with a distributed one before they even can occur.

One box is definitely easier to administer and troubleshoot. Though I must say that even though it's more complex, I think the VM architecture I described is worth a serious look. If your current 12x1.5TB SAN array is being retired as well, you could piggy back onto the array(s) feeding the VMware farm, or expand them if necessary/possible. Adding drives is usually much cheaper than buying a new populated array chassis. Given your service contract comments it's unlikely you're the type to build your own servers. Being a hardwarefreak, I nearly always build my servers and storage from scratch. This may be worth a look merely for educational purposes. I just happened to have finished spec'ing out a new high volume 20TB IMAP server recently which should handle 5000 concurrent users without breaking a sweat, for only ~$7500 USD:

Full parts list: http://secure.newegg.com/WishList/PublicWishDetail.aspx?WishListNumber=17069...

Summary: 2GHz 8-core 12MB L3 cache Magny Cours Opteron SuperMicro MBD-H8SGL-O w/32GB qualified quad channel reg ECC DDR3/1333 dual Intel 82574 GbE ports LSI 512MB PCIe 2.0 x8 RAID, 24 port SAS expander, 20x1TB 7.2k WD RE4 20 bay SAS/SATA 6G hot swap Norco chassis

Create a RAID1 pair for /boot, the root filesystem, swap partition of say 8GB, 2GB partition for external XFS log, should have ~900GB left for utilitarian purposes. Configure two spares. Configure the remaining 16 drives as RAID10 with a 64KB stripe size (8KB, 16 sector strip size), yielding 8TB raw for the XFS backed mdbox mailstore. Enable the BBWC write cache (dang, forgot the battery module, +$175). This should yield approximately 8*150 = 1200 IOPS peak to/from disk, many thousands to BBWC, more than plenty for 5000 concurrent users given the IO behavior of most MUAs. Channel bond the NICs to the switch or round robin DNS the two IPs if pathing for redundancy.

What's that? You want to support 10K users? Simply drop in another 4 sticks of the 8GB Kingston Reg ECC RAM for 64GB total, and plug one of these into the external SFF8088 port on the LSI card: http://www.newegg.com/Product/Product.aspx?Item=N82E16816133047 populated with 18 of the 1TB RE4 drives. Configure 16 drives the same as the primary array, grow it into your existing XFS. Since you have two identical arrays comprising the filesystem, sunit/swidth values are still valid so you don't need to add mount options. Configure 2 drives as hot spares. The additional 16 drive RAID10 doubles our disk IOPS to ~2400, maintaining our concurrent user to IOPS ratio at ~4:1, and doubles our mail storage to ~16TB.

This expansion hardware will run an additional ~$6200. Grand total to support ~10K concurrent users (maybe more) with a quality DIY build is just over $14K USD, or ~$1.40 per mailbox. Not too bad for an 8-core, 64GB server with 32TB of hardware RAID10 mailbox storage and 38 total 1TB disks. I haven't run the numbers for a comparable HP system, but an educated guess says it would be quite a bit more expensive, not the server so much, but the storage. HP's disk drive prices are outrageous, though not approaching anywhere near the level of larceny EMC commits with it's drive sales. $2400 for a $300 Seagate drive wearing an EMC cape? Please....

...

Maybe two HP DL180s (one for production and one as test/standby-system) with an SAS attached enclosure for storage?

If you're hooked on 1U chassis (I hate em) go with the DL165 G7. If not I'd go 2U, the DL385 G7. Magny Cours gives you more bang for the buck in this class of machines. The performance is excellent, and, if everybody buys Intel, AMD goes bankrupt, and then Chipzilla charges whatever it desires. They've already been sanctioned, and fined by the FTC at least twice. They paid Intergraph $800 million in an antitrust settlement in 2000 after they forced them out of the hardware business. They recently paid AMD $1 Billion in an antitrust settlement. They're just like Microsoft, putting competitors out of business by any and all means necessary, even if their conduct is illegal. Yes, I'd much rather give AMD my business, given they had superior CPUs to Intel for many years, and their current chips are still more than competitive. /end rant. ;)

...

Keeping in mind the new system has to work for some time (again 5 to 7 years) I have to be able to extend the storage space without to much hassle.

Given you're currently only using ~1.3TB of ~15TB do you really see this as an issue? Will you be changing your policy or quotas? Will the university double its enrollment? If not I would think a new 12-16TB raw array would be more than plenty.

If you really want growth potential get a SATABeast and start with 14 2TB SATA drives. You'll still have 28 empty SAS/SATA slots in the 4U chassis, 42 total. Max capacity is 84TB. You get dual 8Gb/s FC LC ports and dual GbE iSCSI ports per controller, all ports active, two controllers max. The really basic SKU runs about $20-25K USD with the single controller and a few small drives, before institutional/educational discounts. www.nexsan.com/satabeast

I've used the SATABlade and SATABoy models (8 and 14 drives) and really like the simplicity of design and the httpd management interface. Good products, and one of the least expensive and feature rich in this class.

Sorry this was so windy. I am the hardwarefreak after all. :)

-- Stan

Sven Hartge

4:39 p.m.

Stan Hoeppner <stan@hardwarefreak.com> wrote:

...

On 1/7/2012 7:55 PM, Sven Hartge wrote:

...
Stan Hoeppner <stan@hardwarefreak.com> wrote:

...
It's highly likely your problems can be solved without the drastic architecture change, and new problems it will introduce, that you describe below.

The main reason is I need to replace the hardware as its service contract ends this year and I am not able to extend it further.

The box so far is fine, there are normally no problems during normal operations with speed or responsiveness towards the end-user.

Sometimes, higher peak loads tend to strain the system a bit and this is starting to occur more often. ... First thought was to move this setup into our VMware cluster (yeah, I know, spare me the screams), since the hardware used there is way more powerfull than the hardware used now and I wouldn't have to buy new servers for my mail system (which is kind of painful to do in an universitary environment, especially in Germany, if you want to invest an amount of money above a certain amount).

...

What's wrong with moving it onto VMware? This actually seems like a smart move given your description of the node hardware. It also gives you much greater backup flexibility with VCB (or whatever they call it today). You can snapshot the LUN over the SAN during off peak hours to a backup server and do the actual backup to the library at your leisure. Forgive me if the software names have changed as I've not used VMware since ESX3 back in 07.

VCB as it was back in the days is dead. But yes, one of the reasons to use a VM was to be able to easily backup the whole shebang.

...

...
But then I thought about the problems with VMs this size and got to the idea with the distributed setup, splitting the one server into 4 or 6 backend servers.

...

Not sure what you mean by "VMs this size". Do you mean memory requirements or filesystem size? If the nodes have enough RAM that's no issue.

Memory size. I am a bit hesistant to deploy a VM with 16GB of RAM. My cluster nodes each have 48GB, so no problem on this side though.

...

And surely you're not thinking of using a .vmdk for the mailbox storage. You'd use an RDM SAN LUN.

No, I was not planning to use a VMDK backed disk for this.

...

In fact you should be able to map in the existing XFS storage LUN and use it as is. Assuming it's not going into retirement as well.

It is going to be retired as well, as it is as old as the server.

It also is not connected to any SAN as well, only local to the backend server.

And our VMware SAN is iSCSI based, so no way to plug a FC-based storage into it.

...

If an individual VMware node don't have sufficient RAM you could build a VM based Dovecot cluster, run these two VMs on separate nodes, and thin out the other VMs allowed to run on these nodes. Since you can't directly share XFS, build a tiny Debian NFS server VM and map the XFS LUN to it, export the filesystem to the two Dovecot VMs. You could install the Dovecot director on this NFS server VM as well. Converting from maildir to mdbox should help eliminate the NFS locking problems. I would do the conversion before migrating to this VM setup with NFS.

...

Also, run the NFS server VM on the same physical node as one of the Dovecot servers. The NFS traffic will be a memory-memory copy instead of going over the GbE wire, decreasing IO latency and increasing performance for that Dovecot server. If it's possible to have Dovecot director or your fav load balancer weight more connections to one Deovecot node, funnel 10-15% more connections to this one. (I'm no director guru, in fact haven't use it yet).

So, this reads like my idea in the first place.

Only you place all the mails on the NFS server, whereas my idea was to just share the shared folders from a central point and keep the normal user dirs local to the different nodes, thus reducing network impact for the way more common user access.

...

Assuming the CPUs in the VMware cluster nodes are clocked a decent amount higher than 1.8GHz I wouldn't monkey with configuring virtual smp for these two VMs, as they'll be IO bound not CPU bound.

2.3GHz for most VMware nodes.

...

...
...
...
Ideas? Suggestions? Nudges in the right direction?

...
Yes. We need more real information. Please provide:

...

Mailbox count, total maildir file count and size

about 10,000 Maildir++ boxes

900GB for 1300GB used, "df -i" says 11 million inodes used

...

Converting to mdbox will take a large burden off your storage, as you've seen. With ~1.3TB consumed of ~15TB you should have plenty of space to convert to mdbox while avoiding filesystem fragmentation.

You got the numbers wrong. And I got a word wrong ;)

Should have read "900GB _of_ 1300GB used".

I am using 900GB of 1300GB. The disks are SATA1.5 (not SATA3 or SATA6) as in data transfer rate. The disks each are 150GB in size, so my maximum storage size of my underlying VG is 1500GB.

root@ms1:~# vgs VG #PV #LV #SN Attr VSize VFree
vg01 1 6 0 wz--n- 70.80G 40.80G vg02 1 1 0 wz--n- 1.45T 265.00G vg03 1 1 0 wz--n- 1.09T 0

Filesystem Size Used Avail Use% Mounted on /dev/mapper/vg02-home_lv 1.2T 867G 357G 71% /home /dev/mapper/vg03-backup_lv 1.1T 996G 122G 90% /backup

So not much wiggle room left.

But modifications to our systems are made, which allow me to temp-disable a user, convert and move his mailbox and re-enable him, which allows me to move them one at a time from the old system to the new one, without losing a mail or disrupting service to long and often.

...

...
Right now, in the middle of the night (2:30 AM here) on a Sunday, thus a low point in the usage pattern:
         total       used       free     shared    buffers     cached
Mem: 12335820 9720252 2615568 0 53112 680424 -/+ buffers/cache: 8986716 3349104 Swap: 5855676 10916 5844760

...

Ugh... "-m" and "-g" options exist for a reason. :) So this box has 12GB RAM, currently ~2.5GB free during off peak hours. It would be interesting to see free RAM and swap usage values during peak. That would tell use whether we're CPU or RAM starved. If both turned up clean then we'd need to look at iowait. If you're not RAM starved then moving to VMware nodes with 16/24/32GB RAM should work fine, as long as you don't stack many other VMs on top. Enabling memory dedup may help a little.

Well, peak hours are somewhat between 10:00 and 14:00 o'clock. Will check then.

...

...
System reaches its 7 year this summer which is the end of its service contract.

...

Enjoy your retirement old workhorse. :)

...

...
...

Storage configuration--total spindles, RAID level, hard or soft RAID

RAID 6 with 12 SATA1.5 disks, external 4Gbit FC

...

I assume this means a LUN on a SAN array somewhere on the other end of that multi-mode cable, yes? Can you tell us what brand/model the box is?

This is a Transtec Provigo 610. This is a 24 disk enclosure, 12 disks with 150GB (7.200k) each for the main mail storage in RAID6 and another 10 disks with 150GB (5.400k) for a backup LUN. I daily rsnapshot my /home onto this local backup (20 days of retention), because it is easier to restore from than firing up Bacula, which has the long retention time of 90 days. But must users need a restore of mails from $yesterday or $the_day_before.

...

...
Back in 2005, a SAS enclosure was way to expensive for us to afford.

...

How one affords an FC SAN array but not a less expensive direct attach SAS enclosure is a mystery... :)

Well, it was either Parallel-SCSI or FC back then, as far as I can remember. The price difference between the U320 version and the FC one was not so big and I wanted to avoid having to route those big SCSI-U320 through my racks.

...

...
...

Filesystem type

XFS in a LVM to allow snapshots for backup

...

XFS is the only way to fly, IMNSHO.

...

...
I of course aligned the partions on the RAID correctly and of course created a filesystem with the correct parameters wrt. spindels, chunk size, etc.

...

Which is critical for mitigating the RMW penalty of parity RAID. Speaking of which, why RAID6 for maildir? Given that your array is 90% vacant, why didn't you go with RAID10 for 3-5 times the random write performance?

See above, not 1500GB disks, but 150GB ones. RAID6, because I wanted the double security. I have been kind of burned by the previous system and I tend to get nervous while tinking about data loss in my mail storage, because I know my users _will_ give me hell if that happens.

...

...
...

Backup software/method

Full backup with Bacula, taking about 24 hours right now. Because of this, I switched to virtual full backups, only ever doing incremental and differental backups off of the real system and creating synthetic full backups inside Bacula. Works fine though, incremental taking 2 hours, differential about 4 hours.

...

Move to VMware and use VCB. You'll fall in love.

...

...
The main problem of the backup time is Maildir++. During a test, I copied the mail storage to a spare box, converted it to mdbox (50MB file size) and the backup was lightning fast compared to the Maildir++ format.

...

Well of course. You were surprised by this?

No, I was not surprised by the speedup, I _knew_ mdbox would backup faster. Just how big it was. That a backup of 100 big files is faster than a backup of 100,000 little files is not exactly rocket sience.

...

How long has it been since you used mbox? mbox backs up even faster than mdbox. Why? Larger files and fewer of them. Which means the disks can actually do streaming reads, and don't have to beat their heads to death jumping all over the platters to read maildir files, which are scattered all over the place when created. Which is while maildir is described as a "random" IO workload.

I never used mbox as an admin. The box before the box before this one uses uw-imapd with mbox and I experienced the system as a user and it was horriffic. Most users back then never heard of IMAP folders and just stored their mails inside of INBOX, which of course got huge. If one of those users with a big mbox then deleted mails, it would literally lock the box up for everyone, as uw-imapd was copying (for example) a 600MB mbox file around to delete one mail.

Of course, this was mostly because of the crappy uw-imapd and secondly by some poor design choices in the server itself (underpowered RAID controller, to small cache and a RAID5 setup, low RAM in the server).

So the first thing we did back then, in 2004, was to change to Courier and convert from mbox to maildir, which made the mailsystem fly again, even on the same hardware, only the disk setup changed to RAID10.

Then we bought new hardware (the one previous to the current one), this time with more RAM, better RAID controller, smarter disk setup. We outgrew this one really fast and a disk upgrade was not possible; it lasted only 2 years.

So the next one got this external 24 disk array with 12 disks used at deployment.

But Courier is showing its age and things like Sieve are only possible with great pain, so I want to avoid it.

...

...
So this is the way to go, I think, regardless of which way I implement the backend mail server.

...

Which is why I asked my questions. :) mdbox would have been one of my recommendations, but you already discovered it.

And this is why I kind of hold this upgrade back until dovecot 2.1 is released, as it has some optimizations here.

...

...
...

Operating system

Debian Linux Lenny, currently with kernel 2.6.39

...

:) Debian, XFS, Dovecot, FC SAN storage--I like your style. Lenny with 2.6.39? Is that a backport or rolled kernel? Not Squeeze?

That is a BPO-kernel. Not-yet Squeeze. I admin over 150 different systems here, plus I am the main VMware and SAN admin. So upgrades take some time until I grow an extra pair of eyes and arms. ;)

And since I have been planning to re-implement the mailsystem for some time now, I held the update to the storage backends back. No use in disrupting service for the end user if I'm going to replace the whole thing with a new one in the end.

...

...
...
Instead of telling us what you think the solution to your unidentified bottleneck is and then asking "yeah or nay", tell us what the problem is and allow us to recommend solutions.

I am not asking for "yay or nay", I just pointed out my idea, but I am open to other suggestions.

...

I think you've already discovered the best suggestions on your own.

...

...
If the general idea is to buy a new big single storage system, I am more than happy to do just this, because this will prevent any problems I might have with a distributed one before they even can occur.

...

One box is definitely easier to administer and troubleshoot. Though I must say that even though it's more complex, I think the VM architecture I described is worth a serious look. If your current 12x1.5TB SAN array is being retired as well, you could piggy back onto the array(s) feeding the VMware farm, or expand them if necessary/possible. Adding drives is usually much cheaper than buying a new populated array chassis. Given your service contract comments it's unlikely you're the type to build your own servers. Being a hardwarefreak, I nearly always build my servers and storage from scratch.

Naa, I have been doing this for too long. While I am perfectly capable of building such a server myself, I am now the kind of guy who wants to "yell" at a vendor, when their hardware fails.

Which does not mean I am using any "Express" package or preconfigured server, I still read the specs and pick the parts which make the most sense for a job and then have that one custom build by HP or IBM or Dell or ...

Personal build PCs and servers out of single parts have been nothing than a nightmare for me. And: my cowworkers need to be able to service them as well while I am not available and they are not as a hardware aficionado as I am.

So "professional" hardware with a 5 to 7 year support contract is the way to go for me.

...

If you're hooked on 1U chassis (I hate em) go with the DL165 G7. If not I'd go 2U, the DL385 G7. Magny Cours gives you more bang for the buck in this class of machines.

I have plenty space for 2U systems and already use DL385 G7s, I am not fixed on Intel or AMD, I'll gladly use the one which is the most fit for a given jobs.

Grüße, Sven

-- Sigmentation fault. Core dumped.

Sven Hartge

9:15 p.m.

Sven Hartge <sven@svenhartge.de> wrote:

...

Stan Hoeppner <stan@hardwarefreak.com> wrote:

...

...
If an individual VMware node don't have sufficient RAM you could build a VM based Dovecot cluster, run these two VMs on separate nodes, and thin out the other VMs allowed to run on these nodes. Since you can't directly share XFS, build a tiny Debian NFS server VM and map the XFS LUN to it, export the filesystem to the two Dovecot VMs. You could install the Dovecot director on this NFS server VM as well. Converting from maildir to mdbox should help eliminate the NFS locking problems. I would do the conversion before migrating to this VM setup with NFS.

...

...
Also, run the NFS server VM on the same physical node as one of the Dovecot servers. The NFS traffic will be a memory-memory copy instead of going over the GbE wire, decreasing IO latency and increasing performance for that Dovecot server. If it's possible to have Dovecot director or your fav load balancer weight more connections to one Deovecot node, funnel 10-15% more connections to this one. (I'm no director guru, in fact haven't use it yet).

...

So, this reads like my idea in the first place.

...

Only you place all the mails on the NFS server, whereas my idea was to just share the shared folders from a central point and keep the normal user dirs local to the different nodes, thus reducing network impact for the way more common user access.

To be a bit more concrete on this one:

a) X backend servers which my frontend (being perdition or dovecot director) redirects users to, fixed, no random redirects.

I might start with 4 backend servers, but I can easily scale them, either vertically by adding more RAM or vCPUs or horizontally by adding more VMs and reshuffling some mailboxes during the night.

Why 4 and not 2? If I'm going to build a cluster, I already have to do the work to implement this and with 4 backends, I can distribute the load even further without much additional administrative overhead. But the load impact on each node gets lower with more nodes, if I am able to evenly spread my users across those nodes (like md5'ing the username and using the first 2 bits from that to determine which node the user resides on).

b) 1 backend server for the public shared mailboxes, exporting them via NFS to the user backend servers

Configuration like this, from http://wiki2.dovecot.org/SharedMailboxes/Public

,---- | # User's private mail location | mail_location = mdbox:~/mdbox | | # When creating any namespaces, you must also have a private namespace: | namespace { | type = private | separator = . | prefix = INBOX. | #location defaults to mail_location. | inbox = yes | } | | namespace { | type = public | separator = . | prefix = #shared. | location = mdbox:/srv/shared/ | subscriptions = no | } `----

With /srv/shared being the NFS mountpoint from my central public shared mailbox server.

This setup would keep the amount of data transferred via NFS small (only a tiny fraction of my 10,000 users have access to a shared folder, mostly users in the IT-Team or in the administration of the university.

Wouldn't such a setup be the "Best of Both Worlds"? Having the main traffic going to local disks (being RDMs) and also being able to provide shared folders to every user who needs them without the need to move those users onto one server?

Grüße, Sven.

-- Sigmentation fault. Core dumped.

Sven Hartge

10:07 p.m.

Sven Hartge <sven@svenhartge.de> wrote:

...

Sven Hartge <sven@svenhartge.de> wrote:

...
Stan Hoeppner <stan@hardwarefreak.com> wrote:

...

...
...
If an individual VMware node don't have sufficient RAM you could build a VM based Dovecot cluster, run these two VMs on separate nodes, and thin out the other VMs allowed to run on these nodes. Since you can't directly share XFS, build a tiny Debian NFS server VM and map the XFS LUN to it, export the filesystem to the two Dovecot VMs. You could install the Dovecot director on this NFS server VM as well. Converting from maildir to mdbox should help eliminate the NFS locking problems. I would do the conversion before migrating to this VM setup with NFS.

...

...
...
Also, run the NFS server VM on the same physical node as one of the Dovecot servers. The NFS traffic will be a memory-memory copy instead of going over the GbE wire, decreasing IO latency and increasing performance for that Dovecot server. If it's possible to have Dovecot director or your fav load balancer weight more connections to one Deovecot node, funnel 10-15% more connections to this one. (I'm no director guru, in fact haven't use it yet).

...

...
So, this reads like my idea in the first place.

...

...
Only you place all the mails on the NFS server, whereas my idea was to just share the shared folders from a central point and keep the normal user dirs local to the different nodes, thus reducing network impact for the way more common user access.

...

To be a bit more concrete on this one:

...

a) X backend servers which my frontend (being perdition or dovecot director) redirects users to, fixed, no random redirects.

...

I might start with 4 backend servers, but I can easily scale them, either vertically by adding more RAM or vCPUs or horizontally by adding more VMs and reshuffling some mailboxes during the night.

...

Why 4 and not 2? If I'm going to build a cluster, I already have to do the work to implement this and with 4 backends, I can distribute the load even further without much additional administrative overhead. But the load impact on each node gets lower with more nodes, if I am able to evenly spread my users across those nodes (like md5'ing the username and using the first 2 bits from that to determine which node the user resides on).

Ah, I forgot: I _already_ have the mechanisms in place to statically redirect/route accesses for users to different backends, since some of the users are already redirected to a different mailsystem at another location of my university.

So using this mechanism to also redirect/route users internal to _my_ location is no big deal.

This is what got me into the idea of several independant backend storages without the need to share the _whole_ storage, but just the shared folders for some users.

(Are my words making any sense? I got the feeling I'm writing German with English words and nobody is really understanding anything ...)

Grüße, Sven.

-- Sigmentation fault. Core dumped.

Stan Hoeppner

9 Jan 9 Jan

2:38 p.m.

On 1/8/2012 3:07 PM, Sven Hartge wrote:

...

Ah, I forgot: I _already_ have the mechanisms in place to statically redirect/route accesses for users to different backends, since some of the users are already redirected to a different mailsystem at another location of my university.

I assume you mean IMAP/POP connections, not SMTP.

...

So using this mechanism to also redirect/route users internal to _my_ location is no big deal.

This is what got me into the idea of several independant backend storages without the need to share the _whole_ storage, but just the shared folders for some users.

(Are my words making any sense? I got the feeling I'm writing German with English words and nobody is really understanding anything ...)

You're making perfect sense, and frankly, if not for the .de TLD in your email address, I'd have thought you were an American. Your written English is probably better than mine, and it's my only language. To be fair to the Brits, I speak/write American English. ;)

I'm guessing no one else has interest in this thread, or maybe simply lost interest as the replies have been lengthy, and not wholly Dovecot related. I accept some blame for that.

-- Stan

Phil Turmel

2:50 p.m.

On 01/09/2012 08:38 AM, Stan Hoeppner wrote:

...

On 1/8/2012 3:07 PM, Sven Hartge wrote:

[...]

...

...
(Are my words making any sense? I got the feeling I'm writing German with English words and nobody is really understanding anything ...)

You're making perfect sense, and frankly, if not for the .de TLD in your email address, I'd have thought you were an American. Your written English is probably better than mine, and it's my only language. To be fair to the Brits, I speak/write American English. ;)

Concur. My American ear is also perfectly happy.

...

I'm guessing no one else has interest in this thread, or maybe simply lost interest as the replies have been lengthy, and not wholly Dovecot related. I accept some blame for that.

I've been following this thread with great interest, but no advice to offer. The content is entirely appropriate, and appreciated. Don't be embarrassed by your enthusiasm, Stan.

Sven, a follow-up report when you have it all working as desired would also be appreciated (and appropriate).

Thanks,

Phil

Joseba Torre

11 Jan 11 Jan

12:12 p.m.

El 09/01/12 14:50, Phil Turmel escribió:

...

I've been following this thread with great interest, but no advice to offer. The content is entirely appropriate, and appreciated. Don't be embarrassed by your enthusiasm, Stan.

Sven Hartge

9 Jan 9 Jan

2:52 p.m.

Stan Hoeppner <stan@hardwarefreak.com> wrote:

...

On 1/8/2012 3:07 PM, Sven Hartge wrote:

...

...
Ah, I forgot: I _already_ have the mechanisms in place to statically redirect/route accesses for users to different backends, since some of the users are already redirected to a different mailsystem at another location of my university.

...

I assume you mean IMAP/POP connections, not SMTP.

Yes. perdition uses its popmap feature to redirect users of the other location to the IMAP/POP servers there. So we only need one central mailserver for the users to configure while we are able to physically store their mails at different datacenters.

...

I'm guessing no one else has interest in this thread, or maybe simply lost interest as the replies have been lengthy, and not wholly Dovecot related. I accept some blame for that.

I will open a new thread with more concrete problems/questions after I setup my test setup. This will be more technical and less philosphical, I hope :)

Grüße, Sven

-- Sigmentation fault. Core dumped.

Stan Hoeppner

2:13 p.m.

On 1/8/2012 2:15 PM, Sven Hartge wrote:

...

Wouldn't such a setup be the "Best of Both Worlds"? Having the main traffic going to local disks (being RDMs) and also being able to provide shared folders to every user who needs them without the need to move those users onto one server?

The only problems I can see at this time are:

Some users will have much larger mailboxes than others. Each year ~1/4 of your student population rotates, so if you manually place existing mailboxes now based on current size you have no idea who the big users are in the next freshman class, or the next. So you may have to do manual re-balancing of mailboxes, maybe frequently.
If you lose a Dovecot VM guest due to image file or other corruption, or some other rare cause, you can't restart that guest, but will have to build a new image from a template. This could cause either minor or significant downtime for ~1/4 of your mail users w/4 nodes. This is likely rare enough it's not worth consideration.
You will consume more SAN volumes and LUNs. Most arrays have a fixed number of each. May or may not be an issue.

-- Stan

Sven Hartge

3:08 p.m.

Stan Hoeppner <stan@hardwarefreak.com> wrote:

...

On 1/8/2012 2:15 PM, Sven Hartge wrote:

...

...
Wouldn't such a setup be the "Best of Both Worlds"? Having the main traffic going to local disks (being RDMs) and also being able to provide shared folders to every user who needs them without the need to move those users onto one server?

...

The only problems I can see at this time are:

...

Some users will have much larger mailboxes than others. Each year ~1/4 of your student population rotates, so if you manually place existing mailboxes now based on current size you have no idea who the big users are in the next freshman class, or the next. So you may have to do manual re-balancing of mailboxes, maybe frequently.

The quota for students is 1GiB here. If I provide each of my 4 nodes with 500GiB of storage space, this gives me 2TiB now, which should be sufficient. If a nodes fills, I increase its storage space. Only if it fills too fast, I may have to rebalance users.

And I never wanted to place the users based on their current size. I knew this was not going to work because of the reasons you mentioned.

I just want to hash their username and use this as a function to distribute the users, keeping it simple and stupid.

...

If you lose a Dovecot VM guest due to image file or other corruption, or some other rare cause, you can't restart that guest, but will have to build a new image from a template. This could cause either minor or significant downtime for ~1/4 of your mail users w/4 nodes. This is likely rare enough it's not worth consideration.

Yes, I know. But right now, if I lose my one and only mail storage servers, all users mailboxes will be offline, until I am either a) able to repair the server, b) move the disks to my identical backup system (or the backup system to the location of the failed one) or c) start the backup system and lose all mails not rsynced since the last rsync-run.

It is not easy designing a mail system without a SPoF which still performs under load.

For example, once a time I had a DRDB (active/passive( setup between the two storage systems. This would allow me to start my standby system without losing (nearly) any mail. But this was awful slow and sluggish.

...

You will consume more SAN volumes and LUNs. Most arrays have a fixed number of each. May or may not be an issue.

Not really an issue here. The SAN is exclusive for the VMware cluster, so most LUNs are quite big (1TiB to 2TiB) but there are not many of them.

Grüße, Sven.

-- Sigmentation fault. Core dumped.

Stan Hoeppner

4:56 p.m.

On 1/9/2012 8:08 AM, Sven Hartge wrote:

...

Stan Hoeppner <stan@hardwarefreak.com> wrote:

...

The quota for students is 1GiB here. If I provide each of my 4 nodes with 500GiB of storage space, this gives me 2TiB now, which should be sufficient. If a nodes fills, I increase its storage space. Only if it fills too fast, I may have to rebalance users.

That should work.

...

And I never wanted to place the users based on their current size. I knew this was not going to work because of the reasons you mentioned.

I just want to hash their username and use this as a function to distribute the users, keeping it simple and stupid.

My apologies Sven. I just re-read your first messages and you did mention this method.

...

Yes, I know. But right now, if I lose my one and only mail storage servers, all users mailboxes will be offline, until I am either a) able to repair the server, b) move the disks to my identical backup system (or the backup system to the location of the failed one) or c) start the backup system and lose all mails not rsynced since the last rsync-run.

True. 3/4 of users remaining online is much better than none. :)

...

It is not easy designing a mail system without a SPoF which still performs under load.

And many other systems for that matter.

...

For example, once a time I had a DRDB (active/passive( setup between the two storage systems. This would allow me to start my standby system without losing (nearly) any mail. But this was awful slow and sluggish.

Eric Rostetter at University of Texas at Austin has reported good performance with his twin Dovecot DRBD cluster. Though in his case he's doing active/active DRBD with GFS2 sitting on top, so there is no failover needed. DRBD is obviously not an option for your current needs.

...

...

You will consume more SAN volumes and LUNs. Most arrays have a fixed number of each. May or may not be an issue.

Not really an issue here. The SAN is exclusive for the VMware cluster, so most LUNs are quite big (1TiB to 2TiB) but there are not many of them.

I figured this wouldn't be a problem. I'm just trying to be thorough, mentioning anything I can think of that might be an issue.

The more I think about your planned architecture the more it reminds me of a "shared nothing" database cluster--even a relatively small one can outrun a well tuned mainframe, especially doing decision support/data mining workloads (TPC-H).

As long as you're prepared for the extra administration, which you obviously are, this setup will yield better performance than the NFS setup I recommended. Performance may not be quite as good as 4 physical hosts with local storage, but you haven't mentioned the details of your SAN storage nor the current load on it, so obviously I can't say with any certainty. If the controller currently has plenty of spare IOPS then the performance difference would be minimal. And using the SAN allows automatic restart of a VM if a physical node dies.

As with Phil, I'm anxious to see how well it works in production. When you send an update please CC me directly as sometimes I don't read all the list mail.

I hope my participation was helpful to you Sven, even if only to a small degree. Best of luck with the implementation.

-- Stan

Sven Hartge

5:16 p.m.

Stan Hoeppner <stan@hardwarefreak.com> wrote:

...

The more I think about your planned architecture the more it reminds me of a "shared nothing" database cluster--even a relatively small one can outrun a well tuned mainframe, especially doing decision support/data mining workloads (TPC-H).

...

As long as you're prepared for the extra administration, which you obviously are, this setup will yield better performance than the NFS setup I recommended. Performance may not be quite as good as 4 physical hosts with local storage, but you haven't mentioned the details of your SAN storage nor the current load on it, so obviously I can't say with any certainty. If the controller currently has plenty of spare IOPS then the performance difference would be minimal.

This is the beauty of the HP P4500: every node is a controller, load is automagically balanced between all nodes of a storage cluster. The more nodes (up to ten) you add, the more performance you get.

So far, I have not been able to push our current SAN to its limits, even with totally artificial benchmarks, so I am quite confident in its performance for the given task.

But if everything fails and the performance is not good, I can still go ahead and buy dedicated hardware for the mailsystem.

The only thing left is the NFS problem with caching Timo mentioned, but since the accesses to a central public shared folder will be only a minor portion of a clients access, I am hoping the impact will be minimal. Only testing will tell.

Grüße, Sven.

-- Sigmentation fault. Core dumped.

Stan Hoeppner

1:28 p.m.

On 1/8/2012 9:39 AM, Sven Hartge wrote:

...

Memory size. I am a bit hesistant to deploy a VM with 16GB of RAM. My cluster nodes each have 48GB, so no problem on this side though.

Shouldn't be a problem if you're going to spread the load over 2 to 4 cluster nodes. 16/2 = 8GB per VM, 16/4 = 4GB per Dovecot VM. This, assuming you are able to evenly spread user load.

...

And our VMware SAN is iSCSI based, so no way to plug a FC-based storage into it.

There are standalone FC-iSCSI bridges but they're marketed to bridge FC SAN islands over an IP WAN. Director class SAN switches can connect anything to anything, just buy the cards you need. Both of these are rather pricey. These wouldn't make sense in your environment. I'm just pointing out that it can be done.

...

So, this reads like my idea in the first place.

Only you place all the mails on the NFS server, whereas my idea was to just share the shared folders from a central point and keep the normal user dirs local to the different nodes, thus reducing network impact for the way more common user access.

To be quite honest, after thinking this through a bit, many traditional advantages of a single shared mail store start to disappear. Whether you use NFS or a clusterFS, or 'local' disk (RDMs), all IO goes to the same array, so the traditional IO load balancing advantage disappears. The other main advantage, replacing a dead hardware node, simply mapping the LUNs to the new one and booting it up, also disappears due to VMware's unique abilities, including vmotion. Efficient use of storage isn't an issue as you can just as easily slice off a small LUN to each of 2/4 Dovecot VMs as a larger one to the NFS VM.

So the only disadvantages I see are with the 'local' disk RDM mailstore location. 'manual' connection/mailbox/size balancing, all increasing administrator burden.

...

2.3GHz for most VMware nodes.

How many total cores per VMware node (all sockets)?

...

You got the numbers wrong. And I got a word wrong ;)

Should have read "900GB _of_ 1300GB used".

My bad. I misunderstood.

...

So not much wiggle room left.

And that one is retiring anyway as you state below. So do you have plenty of space on your VMware SAN arrays? If not can you add disks or do you need another array chassis?

...

But modifications to our systems are made, which allow me to temp-disable a user, convert and move his mailbox and re-enable him, which allows me to move them one at a time from the old system to the new one, without losing a mail or disrupting service to long and often.

As it should be.

...

This is a Transtec Provigo 610. This is a 24 disk enclosure, 12 disks with 150GB (7.200k) each for the main mail storage in RAID6 and another 10 disks with 150GB (5.400k) for a backup LUN. I daily rsnapshot my /home onto this local backup (20 days of retention), because it is easier to restore from than firing up Bacula, which has the long retention time of 90 days. But must users need a restore of mails from $yesterday or $the_day_before.

And your current iSCSI SAN array(s) backing the VMware farm? Total disks? Is it monolithic, or do you have multiple array chassis from one or multiple vendors?

...

Well, it was either Parallel-SCSI or FC back then, as far as I can remember. The price difference between the U320 version and the FC one was not so big and I wanted to avoid having to route those big SCSI-U320 through my racks.

Can't blame you there. I take it you hadn't built the iSCSI SAN yet at that point?

...

See above, not 1500GB disks, but 150GB ones. RAID6, because I wanted the double security. I have been kind of burned by the previous system and I tend to get nervous while tinking about data loss in my mail storage, because I know my users _will_ give me hell if that happens.

And as it turns out RAID10 wouldn't have provided you enough bytes.

...

I never used mbox as an admin. The box before the box before this one uses uw-imapd with mbox and I experienced the system as a user and it was horriffic. Most users back then never heard of IMAP folders and just stored their mails inside of INBOX, which of course got huge. If one of those users with a big mbox then deleted mails, it would literally lock the box up for everyone, as uw-imapd was copying (for example) a 600MB mbox file around to delete one mail.

Yeah, ouch. IMAP with mbox works pretty well when users are marginally smart about organizing their mail, or a POP then delete setup. I'd bet if that was maildir in that era on that box it would have slowed things way down as well. Especially if the filesystem was XFS, which had horrible, abysmal really, unlink performance until 2.6.35 (2009).

...

Of course, this was mostly because of the crappy uw-imapd and secondly by some poor design choices in the server itself (underpowered RAID controller, to small cache and a RAID5 setup, low RAM in the server).

That's a recipe for disaster.

...

So the first thing we did back then, in 2004, was to change to Courier and convert from mbox to maildir, which made the mailsystem fly again, even on the same hardware, only the disk setup changed to RAID10.

I wonder how much gain you'd have seen if you stuck with RAID5 instead...

...

Then we bought new hardware (the one previous to the current one), this time with more RAM, better RAID controller, smarter disk setup. We outgrew this one really fast and a disk upgrade was not possible; it lasted only 2 years.

Did you need more space or more spindles?

...

But Courier is showing its age and things like Sieve are only possible with great pain, so I want to avoid it.

Don't blame ya. Lots of people migrate from Courier for Dovecot for similar reasons.

...

And this is why I kind of hold this upgrade back until dovecot 2.1 is released, as it has some optimizations here.

Sounds like it's going to be a bit more than an 'upgrade'. ;)

...

That is a BPO-kernel. Not-yet Squeeze. I admin over 150 different systems here, plus I am the main VMware and SAN admin. So upgrades take some time until I grow an extra pair of eyes and arms. ;)

/me nods

...

And since I have been planning to re-implement the mailsystem for some time now, I held the update to the storage backends back. No use in disrupting service for the end user if I'm going to replace the whole thing with a new one in the end.

/me nods

...

Naa, I have been doing this for too long. While I am perfectly capable of building such a server myself, I am now the kind of guy who wants to "yell" at a vendor, when their hardware fails.

At your scale it would simply be impractical, and impossible from a time management standpoint.

...

Personal build PCs and servers out of single parts have been nothing than a nightmare for me.

I've had nothing but good luck with "DIY" systems. My background is probably a bit different than most though. Hardware has been in my blood since I was a teenager in about '86. I used to design and build relatively high end custom -48vdc white box servers and SCSI arrays for telcos back in the day, along with standard 115v servers for SMBs. Also, note the RHS of my email address. ;) That is a nickname given to me about 13 years ago. I decided to adopt it for my vanity domain.

...

And: my cowworkers need to be able to service them as well while I am not available and they are not as a hardware aficionado as I am.

That's the biggest reason right there. DIY is only really feasible if you run your own show, and will likely continue to be running it for a while. Or if staff is similarly skilled. Most IT folks these days aren't hardware oriented people.

...

So "professional" hardware with a 5 to 7 year support contract is the way to go for me.

Definitely.

...

I have plenty space for 2U systems and already use DL385 G7s, I am not fixed on Intel or AMD, I'll gladly use the one which is the most fit for a given jobs.

Just out of curiosity do you have any Power or SPARC systems, or all x86?

-- Stan

Sven Hartge

2:48 p.m.

Stan Hoeppner <stan@hardwarefreak.com> wrote:

...

On 1/8/2012 9:39 AM, Sven Hartge wrote:

...

...
Memory size. I am a bit hesistant to deploy a VM with 16GB of RAM. My cluster nodes each have 48GB, so no problem on this side though.

...

Shouldn't be a problem if you're going to spread the load over 2 to 4 cluster nodes. 16/2 = 8GB per VM, 16/4 = 4GB per Dovecot VM. This, assuming you are able to evenly spread user load.

I think I will be able to do that. If I devide my users by using a hash like MD5 or SHA1 over their username, this should give me an even distribution.

...

...
So, this reads like my idea in the first place.

Only you place all the mails on the NFS server, whereas my idea was to just share the shared folders from a central point and keep the normal user dirs local to the different nodes, thus reducing network impact for the way more common user access.

...

To be quite honest, after thinking this through a bit, many traditional advantages of a single shared mail store start to disappear. Whether you use NFS or a clusterFS, or 'local' disk (RDMs), all IO goes to the same array, so the traditional IO load balancing advantage disappears. The other main advantage, replacing a dead hardware node, simply mapping the LUNs to the new one and booting it up, also disappears due to VMware's unique abilities, including vmotion. Efficient use of storage isn't an issue as you can just as easily slice off a small LUN to each of 2/4 Dovecot VMs as a larger one to the NFS VM.

Yes. Plus I can much more easily increase a LUNs size, if the need arises.

...

So the only disadvantages I see are with the 'local' disk RDM mailstore location. 'manual' connection/mailbox/size balancing, all increasing administrator burden.

Well, I don't see size balancing as a problem since I can increase the size of the disk for a node very easy.

Load should be fairly even, if I distribute the 10,000 users across the nodes. Even if there is a slight imbalance, the systems should have enough power to smooth that out. I could measure the load every user creates and use that as a distribution key, but I believe this to be a wee bit over-engineered for my scenario.

Initial placement of a new user will be automatic, during the activation of the account, so no administrative burden there.

It seems my initial idea was not so bad after all ;) Now I "just" need o built a little test setup, put some dummy users on it and see, if anything bad happens while accessing the shared folders and how the reaction of the system is, should the shared folder server be down.

...

...
2.3GHz for most VMware nodes.

...

How many total cores per VMware node (all sockets)?

...

...
You got the numbers wrong. And I got a word wrong ;)

Should have read "900GB _of_ 1300GB used".

...

My bad. I misunderstood.

Here the memory statistics an 14:30 o'clock:

         total       used       free     shared    buffers     cached

Mem: 12046 11199 847 0 88 7926 -/+ buffers/cache: 3185 8861 Swap: 5718 10 5707

...

...
So not much wiggle room left.

...

And that one is retiring anyway as you state below. So do you have plenty of space on your VMware SAN arrays? If not can you add disks or do you need another array chassis?

The SAN has plenty space. Over 70TiB at this time, with another 70TiB having just arrived and waiting to be connected.

...

...
This is a Transtec Provigo 610. This is a 24 disk enclosure, 12 disks with 150GB (7.200k) each for the main mail storage in RAID6 and another 10 disks with 150GB (5.400k) for a backup LUN. I daily rsnapshot my /home onto this local backup (20 days of retention), because it is easier to restore from than firing up Bacula, which has the long retention time of 90 days. But must users need a restore of mails from $yesterday or $the_day_before.

...

And your current iSCSI SAN array(s) backing the VMware farm? Total disks? Is it monolithic, or do you have multiple array chassis from one or multiple vendors?

The iSCSI storage nodes (HP P4500) use 600GB SAS6 at 15k rpm with 12 disks per node, configured in 2 RAID5 sets with 6 disks each.

But this is internal to each storage node, which are kind of a blackbox and have to be treated as such.

The HP P4500 is a but unique, since it does not consist of a head node which storage arrays connected to it, but of individual storage nodes forming a self balancing iSCSI cluster. (The nodes consist of DL320s G2.)

So far, I had no performance or other problems with this setup and it scales quite nice, as you <marketing> buy as you grow </marketing>.

And again, price was also a factor, deploying a FC-SAN would have cost us more than thrice the amount than the amount the deployment of an iSCSI solution did, because the latter is "just" ethernet, while the former would have needed a lot more totally new components.

...

...
Well, it was either Parallel-SCSI or FC back then, as far as I can remember. The price difference between the U320 version and the FC one was not so big and I wanted to avoid having to route those big SCSI-U320 through my racks.

...

Can't blame you there. I take it you hadn't built the iSCSI SAN yet at that point?

No, at that time (2005/2006) nobody thought of a SAN. That is a fairly "new" idea here, first implemented for the VMware cluster in 2008.

...

...
Then we bought new hardware (the one previous to the current one), this time with more RAM, better RAID controller, smarter disk setup. We outgrew this one really fast and a disk upgrade was not possible; it lasted only 2 years.

...

Did you need more space or more spindles?

More space. The IMAP usage became more prominent which caused a steep rise in space needed on the mail storage server. But 74GiB SCA drives where expensive and 130GiB SCA drives where not available at that time.

...

...
And this is why I kind of hold this upgrade back until dovecot 2.1 is released, as it has some optimizations here.

...

Sounds like it's going to be a bit more than an 'upgrade'. ;)

Well, yes. It is more a re-implementation than an upgrade.

...

...
I have plenty space for 2U systems and already use DL385 G7s, I am not fixed on Intel or AMD, I'll gladly use the one which is the most fit for a given jobs.

...

Just out of curiosity do you have any Power or SPARC systems, or all x86?

Central IT here this days only uses x86-based systems. There where some Sun SPARC systems, but both have been decomissioned. New SPARC hardware is just too expensive for our scale. And if you want to use virtualization, you can either use only SPARC systems and partition them or use x86 based systems. And then there is the need to virtualize Windows, so x86 is the only option.

Most bigger Universities in Germany make nearly exclusive use of SPARC systems, but they had a central IT with big irons (IBM, HP, etc.) since back in the 1960's, so naturally the continue on that path.

Grüße, Sven.

-- Sigmentation fault. Core dumped.

Stan Hoeppner

10 Jan 10 Jan

3:19 a.m.

On 1/9/2012 7:48 AM, Sven Hartge wrote:

...

It seems my initial idea was not so bad after all ;)

Yeah, but you didn't know how "not so bad" it really was until you had me analyze it, flesh it out, and confirm it. ;)

...

Now I "just" need o built a little test setup, put some dummy users on it and see, if anything bad happens while accessing the shared folders and how the reaction of the system is, should the shared folder server be down.

It won't be down. Because instead of using NFS you're going to use GFS2 for the shared folder LUN so each user accesses the shared folders locally just as they do their mailbox. Pat yourself on the back Sven, you just eliminated a SPOF. ;)

...

...
How many total cores per VMware node (all sockets)?

8

Fairly beefy. Dual socket quad core Xeons I'd guess.

...

Here the memory statistics an 14:30 o'clock:
         total       used       free     shared    buffers     cached
Mem: 12046 11199 847 0 88 7926 -/+ buffers/cache: 3185 8861 Swap: 5718 10 5707

That doesn't look too bad. How many IMAP user connections at that time? Is that a high average or low for that day? The RAM numbers in isolation only paint a partial picture...

...

The SAN has plenty space. Over 70TiB at this time, with another 70TiB having just arrived and waiting to be connected.

140TB of 15k storage. Wow, you're so under privileged. ;)

...

The iSCSI storage nodes (HP P4500) use 600GB SAS6 at 15k rpm with 12 disks per node, configured in 2 RAID5 sets with 6 disks each.

But this is internal to each storage node, which are kind of a blackbox and have to be treated as such.

I cringe every time I hear 'black box'...

...

The HP P4500 is a but unique, since it does not consist of a head node which storage arrays connected to it, but of individual storage nodes forming a self balancing iSCSI cluster. (The nodes consist of DL320s G2.)

The 'black box' is Lefthand Networks SAN/iQ software stack. I wasn't that impressed with it when I read about it 8 or so years ago. IIRC, load balancing across cluster nodes is accomplished by resending host packets from a receiving node to another node after performing special sauce calculations regarding cluster load. Hence the need, apparently, for a full power, hot running, multi-core x86 CPU instead of an embedded low power/wattage type CPU such as MIPS, PPC, i960 descended IOP3xx, or even the Atom if they must stick with x86 binaries. If this choice was merely due to economy of scale of their server boards, they could have gone with a single socket board instead of the dual, which would have saved money. So this choice of a dual socket Xeon board wasn't strictly based on cost or ease of manufacture.

Many/most purpose built SAN arrays on the market don't use full power x86 chips, but embedded RISC chips, to cut cost, power draw, and heat generation. These RISC chips are typically in order designs, don't have branch prediction or register renaming logic circuits and they have tiny caches. This is because block moving code handles streams of data and doesn't typically branch nor have many conditionals. For streaming apps, data caches simply get in the way, although an instruction cache is beneficial. HP's choice of full power CPUs that have such features suggests branching conditional code is used. Which makes sense when running algorithms that attempt to calculate the least busy node.

Thus, this 'least busy node' calculation and packet shipping adds non trivial latency to host SCSI IO command completion, compared to traditional FC/iSCSI SAN arrays, or DAS, and thus has implications for high IOPS workloads and especially those making heavy use of FSYNC, such as SMTP and IMAP servers. FSYNC performance may not be an issue if the controller instantly acks FSYNC before data hits platter, but then you may run into bigger problems as you have no guarantee data hit the disk. Or, you may not run into perceptible performance issues at all given the number of P4500s you have and the proportionally light IO load of your 10K mail users. Sheer horsepower alone may prove sufficient.

Just in case, it may prove beneficial to fire up ImapTest or some other synthetic mail workload generator to see if array response times are acceptable under heavy mail loads.

...

So far, I had no performance or other problems with this setup and it scales quite nice, as you <marketing> buy as you grow </marketing>.

I'm glad the Lefthand units are working well for you so far. Are you hitting the arrays with any high random IOPS workloads as of yet?

...

And again, price was also a factor, deploying a FC-SAN would have cost us more than thrice the amount than the amount the deployment of an iSCSI solution did, because the latter is "just" ethernet, while the former would have needed a lot more totally new components.

I guess that depends on the features you need, such as PIT backups, remote replication, etc. I expanded a small FC SAN about 5 years ago for the same cost as an iSCSI array, simply due to the fact that the least expensive _quality_ unit with a good reputation happened to have both iSCSI and FC ports included. It was a 1U 8x500GB Nexsan Satablade, their smallest unit (since discontinued). Ran about $8K USD IIRC. Nexsan continues to offer excellent products.

For anyone interested in high density high performance FC+iSCSI SAN arrays at a midrange price, add Nexsan to your vendor research list: http://www.nexsan.com

...

No, at that time (2005/2006) nobody thought of a SAN. That is a fairly "new" idea here, first implemented for the VMware cluster in 2008.

You must have slower adoption on that side of the pond. As I just mentioned, I was expanding an already existing small FC SAN in 2006 that had been in place since 2004 IIRC. And this was at a small private 6-12 school with enrollment of about 500. iSCSI SANs took off like a rocket in the States around 06/07, in tandem with VMware ESX going viral here.

...

More space. The IMAP usage became more prominent which caused a steep rise in space needed on the mail storage server. But 74GiB SCA drives where expensive and 130GiB SCA drives where not available at that time.

With 144TB of HP Lefthand 15K SAS drives it appears you're no longer having trouble funding storage purchases. ;)

...

...
...
And this is why I kind of hold this upgrade back until dovecot 2.1 is released, as it has some optimizations here.

...
Sounds like it's going to be a bit more than an 'upgrade'. ;)

Well, yes. It is more a re-implementation than an upgrade.

It actually sounds like fun. To me anyway. ;) I love this stuff.

...

Central IT here this days only uses x86-based systems. There where some Sun SPARC systems, but both have been decomissioned. New SPARC hardware is just too expensive for our scale. And if you want to use virtualization, you can either use only SPARC systems and partition them or use x86 based systems. And then there is the need to virtualize Windows, so x86 is the only option.

Definitely a trend for a while now.

...

Most bigger Universities in Germany make nearly exclusive use of SPARC systems, but they had a central IT with big irons (IBM, HP, etc.) since back in the 1960's, so naturally the continue on that path.

Siemens/Fujitsu machines or SUN machines? I've been under the impression that Fujitsu sold more SPARC boxen in Europe, or at least Germany, than SUN did, due to the Siemens partnership. I could be wrong here.

-- Stan

Timo Sirainen

9 Jan 9 Jan

3:51 p.m.

Too much text in the rest of this thread so I haven't read it, but:

On 8.1.2012, at 0.20, Sven Hartge wrote:

...

Right now, I am pondering with using an additional server with just the shared folders on it and using NFS (or a cluster FS) to mount the shared folder filesystem to each backend storage server, so each user has potential access to a shared folders data.

With NFS you'll run into problems with caching (http://wiki2.dovecot.org/NFS). Some cluster fs might work better.

The "proper" solution for this that I've been thinking about would be to use v2.1's imapc backend with master users. So that when user A wants to access user B's shared folder, Dovecot connects to B's IMAP server using master user login, and accesses the mailbox via IMAP. Probably wouldn't be a big job to implement, mainly I'd need to figure out how this should be configured..

Sven Hartge

3:58 p.m.

Timo Sirainen <tss@iki.fi> wrote:

...

On 8.1.2012, at 0.20, Sven Hartge wrote:

...

...
Right now, I am pondering with using an additional server with just the shared folders on it and using NFS (or a cluster FS) to mount the shared folder filesystem to each backend storage server, so each user has potential access to a shared folders data.

...

With NFS you'll run into problems with caching (http://wiki2.dovecot.org/NFS). Some cluster fs might work better.

...

The "proper" solution for this that I've been thinking about would be to use v2.1's imapc backend with master users. So that when user A wants to access user B's shared folder, Dovecot connects to B's IMAP server using master user login, and accesses the mailbox via IMAP. Probably wouldn't be a big job to implement, mainly I'd need to figure out how this should be configured..

Luckily, in my case, User A does not access anythin from User B, but instead both User A and User B access the same public folder, which is different from any folder of User A and User B.

Grüße, Sven.

-- Sigmentation fault. Core dumped.

Charles Marcus

4:14 p.m.

On 2012-01-09 9:51 AM, Timo Sirainen <tss@iki.fi> wrote:

...

The "proper" solution for this that I've been thinking about would be to use v2.1's imapc backend with master users. So that when user A wants to access user B's shared folder, Dovecot connects to B's IMAP server using master user login, and accesses the mailbox via IMAP. Probably wouldn't be a big job to implement, mainly I'd need to figure out how this should be configured.

Sounds interesting... would this be the new officially supported method for sharing mailboxes in all cases? Or is this just for shared mailboxes on NFS shares?

It sounds like this might be a proper (fully supported without kludges) way to get what I had asked about before, with respect to expanding on the concept of Master users for sharing an entire account with one or more other users...

Best regards,

Charles

Timo Sirainen

7:15 p.m.

On 9.1.2012, at 17.14, Charles Marcus wrote:

...

On 2012-01-09 9:51 AM, Timo Sirainen <tss@iki.fi> wrote:

...
The "proper" solution for this that I've been thinking about would be to use v2.1's imapc backend with master users. So that when user A wants to access user B's shared folder, Dovecot connects to B's IMAP server using master user login, and accesses the mailbox via IMAP. Probably wouldn't be a big job to implement, mainly I'd need to figure out how this should be configured.

Sounds interesting... would this be the new officially supported method for sharing mailboxes in all cases? Or is this just for shared mailboxes on NFS shares?

Well, it would be one officially supported way to do it. It would also help when using multiple UIDs.

Sven Hartge

7:25 p.m.

Timo Sirainen <tss@iki.fi> wrote:

...

On 8.1.2012, at 0.20, Sven Hartge wrote:

...

...
Right now, I am pondering with using an additional server with just the shared folders on it and using NFS (or a cluster FS) to mount the shared folder filesystem to each backend storage server, so each user has potential access to a shared folders data.

...

With NFS you'll run into problems with caching (http://wiki2.dovecot.org/NFS). Some cluster fs might work better.

Can "mmap_disable = yes" and the other NFS options be set per namespace or only globally?

Grüße, Sven.

-- Sigmentation fault. Core dumped.

Timo Sirainen

7:35 p.m.

On 9.1.2012, at 20.25, Sven Hartge wrote:

...

Timo Sirainen <tss@iki.fi> wrote:

...
On 8.1.2012, at 0.20, Sven Hartge wrote:

...
...
Right now, I am pondering with using an additional server with just the shared folders on it and using NFS (or a cluster FS) to mount the shared folder filesystem to each backend storage server, so each user has potential access to a shared folders data.

...
With NFS you'll run into problems with caching (http://wiki2.dovecot.org/NFS). Some cluster fs might work better.

Can "mmap_disable = yes" and the other NFS options be set per namespace or only globally?

Currently only globally.

Sven Hartge

7:47 p.m.

Timo Sirainen <tss@iki.fi> wrote:

...

On 9.1.2012, at 20.25, Sven Hartge wrote:

...
Timo Sirainen <tss@iki.fi> wrote:

...
On 8.1.2012, at 0.20, Sven Hartge wrote:

...

...
...
...
Right now, I am pondering with using an additional server with just the shared folders on it and using NFS (or a cluster FS) to mount the shared folder filesystem to each backend storage server, so each user has potential access to a shared folders data.

...
With NFS you'll run into problems with caching (http://wiki2.dovecot.org/NFS). Some cluster fs might work better.

Can "mmap_disable = yes" and the other NFS options be set per namespace or only globally?

...

Currently only globally.

Ah, too bad.

Back to the drawing board then.

Implementing my idea in my environment using a cluster filesystem would be a very big pain in the lower back, so I need a different idea to share the shared folders with all nodes but still keeping the user specific mailboxes fixed and local to a node.

The imapc backed namespace you mentioned sounds very interesting, but this is not implemented right now for shared folders, is it?

Grüße, Sven.

-- Sigmentation fault. Core dumped.

Timo Sirainen

8:16 p.m.

On 9.1.2012, at 20.47, Sven Hartge wrote:

...

...
...
Can "mmap_disable = yes" and the other NFS options be set per namespace or only globally?

...
Currently only globally.

Ah, too bad.

Back to the drawing board then.

mmap_disable=yes works pretty well even if you're only using it for local filesystems. It just spends some more memory when reading dovecot.index.cache files.

...

Implementing my idea in my environment using a cluster filesystem would be a very big pain in the lower back, so I need a different idea to share the shared folders with all nodes but still keeping the user specific mailboxes fixed and local to a node.

The imapc backed namespace you mentioned sounds very interesting, but this is not implemented right now for shared folders, is it?

Well.. If you don't need users sharing mailboxes to each others, then you can probably already do this with Dovecot v2.1:

Configure the user Dovecots:

namespace { type = public prefix = Shared/ location = imapc:~/imapc-shared } imapc_host = sharedmails.example.com imapc_password = master-user-password

With latest v2.1 hg you can do:

imapc_user = shareduser imapc_master_user = %u

With v2.1.rc2 and older you need to do:

imapc_user = shareduser*%u auth_master_user_separator = *

Configure the shared Dovecot:

You need master passdb that allows all existing users to log in as "shareduser" user. You can probably simply do (not tested):

passdb { type = static args = user=shareduser master = yes }

The "shareduser" owns all of the actual shared mailboxes and has the necessary ACLs set up for individual users. ACLs use the master username (= the real username in this case) to do the ACL checks.

Timo Sirainen

8:19 p.m.

On 9.1.2012, at 21.16, Timo Sirainen wrote:

...

passdb { type = static args = user=shareduser

Of course you should also require a password:

args = user=shareduser pass=master-user-password

Sven Hartge

8:31 p.m.

Timo Sirainen <tss@iki.fi> wrote:

...

On 9.1.2012, at 20.47, Sven Hartge wrote:

...

...
...
...
Can "mmap_disable = yes" and the other NFS options be set per namespace or only globally?

...
Currently only globally.

Ah, too bad.

Back to the drawing board then.

...

mmap_disable=yes works pretty well even if you're only using it for local filesystems. It just spends some more memory when reading dovecot.index.cache files.

...

...
Implementing my idea in my environment using a cluster filesystem would be a very big pain in the lower back, so I need a different idea to share the shared folders with all nodes but still keeping the user specific mailboxes fixed and local to a node.

The imapc backed namespace you mentioned sounds very interesting, but this is not implemented right now for shared folders, is it?

...

Well.. If you don't need users sharing mailboxes to each others,

God heavens, no! If I allowed users to share their mailboxes with other users, hell would break loose. Nononono, just shared folders set up by the admin team, statically assigned to groups of users (for example, the central postmaster@ mail alias ends in such a shared folder).

...

then you can probably already do this with Dovecot v2.1:

...

Configure the user Dovecots:

...

namespace { type = public prefix = Shared/ location = imapc:~/imapc-shared } imapc_host = sharedmails.example.com imapc_password = master-user-password

...

With latest v2.1 hg you can do:

imapc_user = shareduser imapc_master_user = %u

With v2.1.rc2 and older you need to do:

imapc_user = shareduser*%u auth_master_user_separator = *

So, in my case, this would look like this:

Where do I add "list = children"? In the user-dovecots shared namespace or on the shared-dovecots private namespace?

...

Configure the shared Dovecot:

...

You need master passdb that allows all existing users to log in as "shareduser" user. You can probably simply do (not tested):

...

passdb { type = static args = user=shareduser pass=master-user-password master = yes }

...

The "shareduser" owns all of the actual shared mailboxes and has the necessary ACLs set up for individual users. ACLs use the master username (= the real username in this case) to do the ACL checks.

So this is kind of "backwards", since normally the imapc_master_user would be the static user and imapc_user would be dynamic, right?

All in all, a _very_ interesting configuration.

Grüße, Sven.

-- Sigmentation fault. Core dumped.

Timo Sirainen

8:38 p.m.

On 9.1.2012, at 21.31, Sven Hartge wrote:

...

,---- | # User's private mail location | mail_location = mdbox:~/mdbox | | # When creating any namespaces, you must also have a private namespace: | namespace { | type = private | separator = . | prefix = INBOX. | #location defaults to mail_location. | inbox = yes | } | | namespace { | type = public | separator = . | prefix = #shared.

I'd probably just use "Shared." as prefix, since it is visible to users. Anyway if you want to use # you need to put the value in "quotes" or it's treated as comment.

...

| location = imapc:~/imapc-shared | subscriptions = no

list = children here

...

| } | | imapc_host = m-st-sh-01.foo.bar | imapc_password = master-user-password | imapc_user = shareduser | imapc_master_user = %u `----

Where do I add "list = children"? In the user-dovecots shared namespace or on the shared-dovecots private namespace?

Shared-dovecot always has mailboxes (at least INBOX), so list=children would equal list=yes.

...

...

Configure the shared Dovecot:

...
You need master passdb that allows all existing users to log in as "shareduser" user. You can probably simply do (not tested):

...
passdb { type = static args = user=shareduser pass=master-user-password master = yes }

...
The "shareduser" owns all of the actual shared mailboxes and has the necessary ACLs set up for individual users. ACLs use the master username (= the real username in this case) to do the ACL checks.

So this is kind of "backwards", since normally the imapc_master_user would be the static user and imapc_user would be dynamic, right?

Right. Also in this Dovecot you want a regular namespace without prefix:

namespace inbox { separator = / list = yes inbox = yes }

You might as well use the proper separator here in case you ever change it for users.

Sven Hartge

8:45 p.m.

Timo Sirainen <tss@iki.fi> wrote:

...

On 9.1.2012, at 21.31, Sven Hartge wrote:

...

...
,---- | # User's private mail location | mail_location = mdbox:~/mdbox | | # When creating any namespaces, you must also have a private namespace: | namespace { | type = private | separator = . | prefix = INBOX. | #location defaults to mail_location. | inbox = yes | } | | namespace { | type = public | separator = . | prefix = #shared.

...

I'd probably just use "Shared." as prefix, since it is visible to users. Anyway if you want to use # you need to put the value in "quotes" or it's treated as comment.

I have to use "#shared.", because this is what Courier uses. Unfortunately I have to stick to prefixes and seperators used currently.

...

...
| location = imapc:~/imapc-shared

What is the syntax of this location? What does "imapc-shared" do in this case?

...

...
| subscriptions = no

...

list = children here

...

...
| } | | imapc_host = m-st-sh-01.foo.bar | imapc_password = master-user-password | imapc_user = shareduser | imapc_master_user = %u `----

Where do I add "list = children"? In the user-dovecots shared namespace or on the shared-dovecots private namespace?

...

Shared-dovecot always has mailboxes (at least INBOX), so list=children would equal list=yes.

OK, seems logical.

...

...
...

Configure the shared Dovecot:

...
You need master passdb that allows all existing users to log in as "shareduser" user. You can probably simply do (not tested):

...
passdb { type = static args = user=shareduser pass=master-user-password master = yes }

...
The "shareduser" owns all of the actual shared mailboxes and has the necessary ACLs set up for individual users. ACLs use the master username (= the real username in this case) to do the ACL checks.

So this is kind of "backwards", since normally the imapc_master_user would be the static user and imapc_user would be dynamic, right?

...

Right. Also in this Dovecot you want a regular namespace without prefix:

...

namespace inbox { separator = / list = yes inbox = yes }

...

You might as well use the proper separator here in case you ever change it for users.

Is this seperator converted to '.' on the frontend? The department supporting our users will give me hell if anything visible changes in the layout of the folders for the end user.

Grüße, Sven.

-- Sigmentation fault. Core dumped.

Timo Sirainen

9:05 p.m.

On 9.1.2012, at 21.45, Sven Hartge wrote:

...

...
...
| location = imapc:~/imapc-shared

What is the syntax of this location? What does "imapc-shared" do in this case?

It's the directory for index files. The backend IMAP server is used as a rather dummy storage, so if for example you do a FETCH 1:* BODYSTRUCTURE command, all of the message bodies are downloaded to the user's Dovecot server which parses them. But with indexes this is done only once (same as with any other mailbox format). If you want SEARCH BODY to be fast, you'd also need to use some kind of full text search indexes.

If your users share the same UID (or 0666 mode would probably work too), you could share the index files rather than make them per-user. Then you could use imapc:/shared/imapc or something.

BTW. All message flags are shared between users. If you want per-user flags you'd need to modify the code.

...

...
Right. Also in this Dovecot you want a regular namespace without prefix:

...
namespace inbox { separator = / list = yes inbox = yes }

...
You might as well use the proper separator here in case you ever change it for users.

Is this seperator converted to '.' on the frontend?

Yes, as long as you explicitly specify the separator setting to the public namespace.

Sven Hartge

9:13 p.m.

Timo Sirainen <tss@iki.fi> wrote:

...

On 9.1.2012, at 21.45, Sven Hartge wrote:

...

...
...
...
| location = imapc:~/imapc-shared

What is the syntax of this location? What does "imapc-shared" do in this case?

...

It's the directory for index files. The backend IMAP server is used as a rather dummy storage, so if for example you do a FETCH 1:* BODYSTRUCTURE command, all of the message bodies are downloaded to the user's Dovecot server which parses them. But with indexes this is done only once (same as with any other mailbox format). If you want SEARCH BODY to be fast, you'd also need to use some kind of full text search indexes.

The bodies are downloaded but not stored, right? Just the index files are stored locally.

...

If your users share the same UID (or 0666 mode would probably work too), you could share the index files rather than make them per-user. Then you could use imapc:/shared/imapc or something.

Hmm. Yes, this is a fully virtual setup, every users mail is owned by the virtmail user. Does this sharing of index files have any security or privacy issues?

Not every user sees every shared folder, so an information leak has to be avoided at all costs.

...

BTW. All message flags are shared between users. If you want per-user flags you'd need to modify the code.

No, I need shared message flags, as this is the reason we introduced shared folders, so one user can see, if a mail has already been read or replied to.

...

...
...
Right. Also in this Dovecot you want a regular namespace without prefix:

...
namespace inbox { separator = / list = yes inbox = yes }

...
You might as well use the proper separator here in case you ever change it for users.

Is this seperator converted to '.' on the frontend?

...

Yes, as long as you explicitly specify the separator setting to the public namespace.

OK, good to know, one for my documentation with an '!' behind it.

Grüße, Sven

-- Sigmentation fault. Core dumped.

Timo Sirainen

9:20 p.m.

On 9.1.2012, at 22.13, Sven Hartge wrote:

...

Timo Sirainen <tss@iki.fi> wrote:

...
On 9.1.2012, at 21.45, Sven Hartge wrote:

...
...
...
...
| location = imapc:~/imapc-shared

What is the syntax of this location? What does "imapc-shared" do in this case?

...
It's the directory for index files. The backend IMAP server is used as a rather dummy storage, so if for example you do a FETCH 1:* BODYSTRUCTURE command, all of the message bodies are downloaded to the user's Dovecot server which parses them. But with indexes this is done only once (same as with any other mailbox format). If you want SEARCH BODY to be fast, you'd also need to use some kind of full text search indexes.

The bodies are downloaded but not stored, right? Just the index files are stored locally.

Right.

...

...
If your users share the same UID (or 0666 mode would probably work too), you could share the index files rather than make them per-user. Then you could use imapc:/shared/imapc or something.

Hmm. Yes, this is a fully virtual setup, every users mail is owned by the virtmail user. Does this sharing of index files have any security or privacy issues?

There are no privacy issues, at least currently, since there is no per-user data. If you had wanted per-user flags this wouldn't have worked.

...

Not every user sees every shared folder, so an information leak has to be avoided at all costs.

Oh, that reminds me, it doesn't actually work :) Because Dovecot deletes those directories it doesn't see on the remote server. You might be able to use imapc:~/imapc:INDEX=/shared/imapc though. The nice thing about shared imapc indexes is that each user doesn't have to re-index the message.

Sven Hartge

9:24 p.m.

Timo Sirainen <tss@iki.fi> wrote:

...

On 9.1.2012, at 22.13, Sven Hartge wrote:

...
Timo Sirainen <tss@iki.fi> wrote:

...
On 9.1.2012, at 21.45, Sven Hartge wrote:

...

...
...
...
...
...
| location = imapc:~/imapc-shared

What is the syntax of this location? What does "imapc-shared" do in this case?

...
It's the directory for index files. The backend IMAP server is used as a rather dummy storage, so if for example you do a FETCH 1:* BODYSTRUCTURE command, all of the message bodies are downloaded to the user's Dovecot server which parses them. But with indexes this is done only once (same as with any other mailbox format). If you want SEARCH BODY to be fast, you'd also need to use some kind of full text search indexes.

...
If your users share the same UID (or 0666 mode would probably work too), you could share the index files rather than make them per-user. Then you could use imapc:/shared/imapc or something.

...

...
Hmm. Yes, this is a fully virtual setup, every users mail is owned by the virtmail user. Does this sharing of index files have any security or privacy issues?

...

There are no privacy issues, at least currently, since there is no per-user data. If you had wanted per-user flags this wouldn't have worked.

OK. I think I will go with the per-user index files for now and pay the extra in bandwidth and processing power needed.

All in all, of 10,000 users, only about 100 use shared folders.

Grüße, Sven.

-- Sigmentation fault. Core dumped.

Sven Hartge

11 Jan 11 Jan

1:50 p.m.

Sven Hartge <sven@svenhartge.de> wrote:

...

I am currently in the planning stage for a "new and improved" mail system at my university.

OK, executive summary of the design ideas so far:

deployment of X (starting with 4, but easily scalable) virtual servers on VMware ESX
storage will be backed by a RDM on our iSCSI SAN.
- main mailbox storage will be on 15k SAS6 600GB disks
- backup rsnapshot storage will be on 7.2k SAS6 2TB disks
XFS filesystem on LVM, allowing easy local snapshots for rsnapshot
sharing folders from one user to another is not needed
central public shared folders reside on its own storage server and are accessed through the imapc-backend configured for the "#shared."-namespace (needs dovecot 2.1~rc3 or higher)
mdbox with compression (23h lifetime, 50MB max size)
quota in MySQL, allowing my MXes to check the quota for a user _before_ accepting any mail for him. This is a much needed feature, currently not possible and thus leading to backscatter right now.
- Backup with bacula for file level backup every 24 hours (120 days retention)
- rsnapshot to node local backup space for easier access (14 days retention)
- possibly SAN-based remote snapshots to different storage tier.

Because sharing a RDM (or VMDK) with multiple VMs pins the VM to an ESX server and prohibits HA and DRS in the ESX cluster and because of my bad experience with cluster FS I want to avoid one and use only local storage for the personal mailboxes of the users.

Each user is fixed to one server, routing/redirecting of IMAP/POP3 connections happens via perdition (popmap feature via LDAP lookup) in a frontend server (this component is already working since some 3-ish years).

So each node is isolated from the other nodes, knows only its users and does not care about users on other nodes. This prevents usage of the dovecot director, which only works if all nodes are able to access all mailboxes (correct?)

I am aware this creates a SPoF for an 1/X portion of my users in the case of a VM failure, but this is deemed acceptable, since the use of VMs will allow me to quickly deploy a new one and reattach the RDM. (And if my whole iSCSI storage or ESX cluster fails, I have other, bigger problems than a non-functional mail system.)

Comments?

Grüße, Sven.

-- Sigmentation fault. Core dumped.

5021

Age (days ago)

5025

Last active (days ago)

List overview

34 comments

6 participants

participants (6)

Charles Marcus
Joseba Torre
Phil Turmel
Stan Hoeppner
Sven Hartge
Timo Sirainen

[Dovecot] Providing shared folders with multiple backend servers

Stan Hoeppner

Stan Hoeppner

Stan Hoeppner

Joseba Torre

Stan Hoeppner

Stan Hoeppner

Stan Hoeppner

Stan Hoeppner

Charles Marcus

With latest v2.1 hg you can do:

With v2.1.rc2 and older you need to do:

With latest v2.1 hg you can do:

With v2.1.rc2 and older you need to do:

tags

participants (6)