[Dovecot] Compressing existing maildirs
I've just enabled zlib for our users, and am looking at how to compress the existing files. The routine for doing this at http://wiki2.dovecot.org/Plugins/Zlib seems a bit complicated. What do you think about simply doing:
find /var/vmail -type f -name "*,S=*" -mtime +1 -exec gzip -S Z -6 '{}' +
I.e. find all maildir-files:
- with size in the name ("*,S=*")
- modified before I enabled zlib plugin
- compress them
- add the Z suffix
- keep timestamps (gzip does that by default)
It's of course racy without the maildirlock, but are there any other problems with this approach ?
-jf
The cleanest (though not necessarily simplest) way to go about this would be to use dsync to create a new maildir and incrementally direct traffic to a separate Dovecot instance.
Unless you have a legacy application that relies on maildir, switching to mdbox would be a good idea too.
I expect that with Dovecot compression is something that can "just be turned on", but for fear of any possible issue, I chose to migrate mailboxes in batches with the way mentioned above.
On Dec 24, 2011, at 7:20 AM, Jan-Frode Myklebust wrote:
I've just enabled zlib for our users, and am looking at how to compress the existing files. The routine for doing this at http://wiki2.dovecot.org/Plugins/Zlib seems a bit complicated. What do you think about simply doing:
find /var/vmail -type f -name "*,S=*" -mtime +1 -exec gzip -S Z -6 '{}' +
I.e. find all maildir-files:
- with size in the name ("*,S=*")
- modified before I enabled zlib plugin
- compress them
- add the Z suffix
- keep timestamps (gzip does that by default)
It's of course racy without the maildirlock, but are there any other problems with this approach ?
-jf
On Wed, Dec 28, 2011 at 03:56:33PM -0800, Dovecot-GDH wrote:
The cleanest (though not necessarily simplest) way to go about this would be to use dsync to create a new maildir and incrementally direct traffic to a separate Dovecot instance.
Unless you have a legacy application that relies on maildir, switching to mdbox would be a good idea too.
We just got rid of the legacy app that worked directly against the maildirs, which is the reason we now can turn on compression. I intend to switch to mdbox, but first I need to free up some disks by compressing the existing maildirs (12 TB maildirs, should probably compress down to less than half).
I expect that with Dovecot compression is something that can "just be turned on", but for fear of any possible issue, I chose to migrate mailboxes in batches with the way mentioned above.
Migrating to mdbox is much scarier to me than an easily reversible compression of existing maildir files.
Could you please give a bit more details about how you did this migration? Did you change user home dirctory in the process? Seeing the scripts you used to run the migration would be very interesting..
-jf
On 12/29/2011 2:49 AM, Jan-Frode Myklebust wrote:
On Wed, Dec 28, 2011 at 03:56:33PM -0800, Dovecot-GDH wrote:
The cleanest (though not necessarily simplest) way to go about this would be to use dsync to create a new maildir and incrementally direct traffic to a separate Dovecot instance.
Unless you have a legacy application that relies on maildir, switching to mdbox would be a good idea too.
We just got rid of the legacy app that worked directly against the maildirs, which is the reason we now can turn on compression. I intend to switch to mdbox, but first I need to free up some disks by compressing the existing maildirs (12 TB maildirs, should probably compress down to less than half).
How much additional space do you expect the conversion process to compressed mdbox to consume? It shouldn't need much. Using dsync, the conversion will be done one mailbox at a time and the existing emails will be compressed when written into the new mdbox mailbox.
After you've converted a few mailboxes by hand and have confirmed you're happy with the results, simply add commands to your bulk conversion script to delete each user maildir and contents after the new mdbox mailbox has been created and populated. Using this method shouldn't require much more additional filesystem space than that equal to your largest single user maildir.
Given your 12TB of mailstore, I'd convert users in small batches over a period of weeks or a month, depending on your total mailbox count. Firing up a conversion script and having it run non-stop until all 12TB are converted is probably asking for trouble due to many factors I shouldn't need to put down here. Time your first few manual conversions. Divide that average time into your daily off-peak hours so you know approximately how many mailboxes you can convert during off-peak hours. Run your script daily against these small sets of mailboxes until the entire process is complete.
-- Stan
On Thu, Dec 29, 2011 at 07:00:03AM -0600, Stan Hoeppner wrote:
We just got rid of the legacy app that worked directly against the maildirs, which is the reason we now can turn on compression. I intend to switch to mdbox, but first I need to free up some disks by compressing the existing maildirs (12 TB maildirs, should probably compress down to less than half).
How much additional space do you expect the conversion process to compressed mdbox to consume?
Somewhere around 1/3 of the current usage, I expect..
It shouldn't need much. Using dsync, the conversion will be done one mailbox at a time and the existing emails will be compressed when written into the new mdbox mailbox.
Yes, I know, but I intend to do more than just convert to mdbox. I want to fix the whole folder structure*, in a new filesystem with different settings (turn on metadata-replication, and possibly also data replication). So I need to free up some disks before this can start.
[*] move away from @Mails /atmail/a/b/abuser@domain folder structure to mdbox:/srv/mailbackup/%256Hu/%d/%n, stop having home=inbox, possibly use many smaller fs's instead of one huge, move the indexes inside home...
-jf
On 12/30/2011 8:41 AM, Jan-Frode Myklebust wrote:
On Thu, Dec 29, 2011 at 07:00:03AM -0600, Stan Hoeppner wrote:
We just got rid of the legacy app that worked directly against the maildirs, which is the reason we now can turn on compression. I intend to switch to mdbox, but first I need to free up some disks by compressing the existing maildirs (12 TB maildirs, should probably compress down to less than half).
How much additional space do you expect the conversion process to compressed mdbox to consume?
Somewhere around 1/3 of the current usage, I expect..
It shouldn't need much. Using dsync, the conversion will be done one mailbox at a time and the existing emails will be compressed when written into the new mdbox mailbox.
Yes, I know, but I intend to do more than just convert to mdbox. I want to fix the whole folder structure*, in a new filesystem with different settings (turn on metadata-replication, and possibly also data replication). So I need to free up some disks before this can start.
[*] move away from @Mails /atmail/a/b/abuser@domain folder structure to mdbox:/srv/mailbackup/%256Hu/%d/%n, stop having home=inbox, possibly use many smaller fs's instead of one huge, move the indexes inside home...
Roger that. Good strategy. You using SAN storage or local RAID? What filesystem do you plan to use for the new mailbox location? What OS is the Dovecot host? Lastly, how many users you have? Sorry for prying, I'm always really curious about system details when someone states they have 12TB of mailbox data. ;)
-- Stan
On Fri, Dec 30, 2011 at 06:38:28PM -0600, Stan Hoeppner wrote:
Roger that. Good strategy. You using SAN storage or local RAID? What filesystem do you plan to use for the new mailbox location? What OS is the Dovecot host?
IBM DS4800 SAN-storage. Filesystem is IBM GPFS, which stripe all I/O over all the RAID5 LUNs it has assigned. Kind of like RAID5+0. To guard against disaster if one RAID5 array should fail, we plan on replicating the filesystem metadata on different sets for LUNs.
OS is RHEL (currently RHEL4 and RHEL5, but new servers are implemented on RHEL6).
Lastly, how many users you have? Sorry for prying,
I'd rather not say.. but we're an ISP, with about 250.000 residential customers and multiple mailboxes per customer.
I'm always really curious about system details when someone states they have 12TB of mailbox data. ;)
$ df -h /usr/local/atmail/users
Filesystem Size Used Avail Use% Mounted on
/dev/atmailusers 14T 12T 2.1T 85% /usr/local/atmail/users
$ df -hi /usr/local/atmail/users
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/atmailusers 145M 109M 37M 75% /usr/local/atmail/users
Looking forward to reducing the number of inodes when we finally move to mdbox.. Should do wonders to the backup process.
-jf
On 12/31/2011 12:56 AM, Jan-Frode Myklebust wrote:
On Fri, Dec 30, 2011 at 06:38:28PM -0600, Stan Hoeppner wrote:
Roger that. Good strategy. You using SAN storage or local RAID? What filesystem do you plan to use for the new mailbox location? What OS is the Dovecot host?
IBM DS4800 SAN-storage. Filesystem is IBM GPFS, which stripe all I/O over all the RAID5 LUNs it has assigned. Kind of like RAID5+0. To guard against disaster if one RAID5 array should fail, we plan on replicating the filesystem metadata on different sets for LUNs.
Nice setup. I've mentioned GPFS for cluster use on this list before, but I think you're the only operator to confirm using it. I'm sure others would be interested in hearing of your first hand experience: pros, cons, performance, etc. And a ball park figure on the licensing costs, whether one can only use GPFS on IBM storage or if storage from others vendors is allowed in the GPFS pool.
To this point IIRC everyone here doing clusters is using NFS, GFS, or OCFS. Each has its downsides, mostly because everyone is using maildir. NFS has locking issues with shared dovecot index files. GFS and OCFS have filesystem metadata performance issues. How does GPFS perform with your maildir workload?
OS is RHEL (currently RHEL4 and RHEL5, but new servers are implemented on RHEL6).
Lastly, how many users you have? Sorry for prying,
I'd rather not say.. but we're an ISP, with about 250.000 residential customers and multiple mailboxes per customer.
I'm always really curious about system details when someone states they have 12TB of mailbox data. ;)
$ df -h /usr/local/atmail/users Filesystem Size Used Avail Use% Mounted on /dev/atmailusers 14T 12T 2.1T 85% /usr/local/atmail/users $ df -hi /usr/local/atmail/users Filesystem Inodes IUsed IFree IUse% Mounted on /dev/atmailusers 145M 109M 37M 75% /usr/local/atmail/users
Looking forward to reducing the number of inodes when we finally move to mdbox.. Should do wonders to the backup process.
That will depend to a large degree on your mdbox_rotate_size value. The default is 2MB, which means you'll get multiple ~2MB mdbox files. If we assume the average email size including headers and attachments is 32KB, Dovecot will place ~64 such emails in a single mdbox file with the default 2MB setting. 32KB may be a high or low average depending on your particular users.
Considering there is no inherent performance downside to going larger than the default, and significant gains to be made, consider a setting of 8MB to 16MB. This will dramatically reduce both inode consumption and filesystem metadata IOPS vs maildir. Reducing IOPS on a shared SAN is always a plus, especially if you're going to be adding some extra GPFS replication traffic.
Timo, is there any technical or sanity based upper bound on mdbox size? Anything wrong with using 64MB, 128MB, or even larger for mdbox_rotate_size?
-- Stan
On Sat, Dec 31, 2011 at 01:54:32AM -0600, Stan Hoeppner wrote:
Nice setup. I've mentioned GPFS for cluster use on this list before, but I think you're the only operator to confirm using it. I'm sure others would be interested in hearing of your first hand experience: pros, cons, performance, etc. And a ball park figure on the licensing costs, whether one can only use GPFS on IBM storage or if storage from others vendors is allowed in the GPFS pool.
I used to work for IBM, so I've been a bit uneasy about pushing GPFS too hard publicly, for risk of being accused of being biased. But I changed job in November, so now I'm only a satisfied customer :-)
Pros: Extremely simple to configure and manage. Assuming root on all nodes can ssh freely, and port 1191/tcp is open between the nodes, these are the commands to create the cluster, create a NSD (network shared disks), and create a filesystem:
# echo hostname1:manager-quorum > NodeFile # "manager" means this node can be selected as filesystem manager
# echo hostname2:manager-quorum >> NodeFile # "quorum" means this node has a vote in the quorum selection
# echo hostname3:manager-quorum >> NodeFile # all my nodes are usually the same, so they all have same roles.
# mmcrcluster -n NodeFile -p $(hostname) -A
### sdb1 is either a local disk on hostname1 (in which case the other nodes will access it over tcp to
### hostname1), or a SAN-disk that they can access directly over FC/iSCSI.
# echo sdb1:hostname1::dataAndMetadata:: > DescFile # This disk can be used for both data and metadata
# mmcrnsd -F DescFile
# mmstartup -A # starts GPFS services on all nodes
# mmcrfs /gpfs1 gpfs1 -F DescFile
# mount /gpfs1
You can add and remove disks from the filesystem, and change most
settings without downtime. You can scale out your workload by adding
more nodes (SAN attached or not), and scale out your disk performance
by adding more disks on the fly. (IBM uses GPFS to create
scale-out NAS solutions http://www-03.ibm.com/systems/storage/network/sonas/ ,
which highlights a few of the features available with GPFS)
There's no problem running GPFS on other vendors disk systems. I've used Nexsan
SATAboy earlier, for a HPC cluster. One can easily move from one disksystem to
another without downtime.
Cons: It has it's own page cache, staticly configured. So you don't get the "all available memory used for page caching" behaviour as you normally do on linux.
There is a kernel module that needs to be rebuilt on every
upgrade. It's a simple process, but it needs to be done and means we
can't just run "yum update ; reboot" to upgrade.
% export SHARKCLONEROOT=/usr/lpp/mmfs/src
% cp /usr/lpp/mmfs/src/config/site.mcr.proto /usr/lpp/mmfs/src/config/site.mcr
% vi /usr/lpp/mmfs/src/config/site.mcr # correct GPFS_ARCH, LINUX_DISTRIBUTION and LINUX_KERNEL_VERSION
% cd /usr/lpp/mmfs/src/ ; make clean ; make World
% su - root
# export SHARKCLONEROOT=/usr/lpp/mmfs/src
# cd /usr/lpp/mmfs/src/ ; make InstallImages
To this point IIRC everyone here doing clusters is using NFS, GFS, or OCFS. Each has its downsides, mostly because everyone is using maildir. NFS has locking issues with shared dovecot index files. GFS and OCFS have filesystem metadata performance issues. How does GPFS perform with your maildir workload?
Maildir is likely a worst case type workload for filesystems. Millions of tiny-tiny files, making all IO random, and getting minimal controller read cache utilized (unless you can cache all active files). So I've concluded that our performance issues are mostly design errors (and the fact that there were no better mail storage formats than maildir at the time these servers were implemented). I expect moving to mdbox will fix all our performance issues.
I *think* GPFS is as good as it gets for maildir storage on clusterfs, but have no number to back that up ... Would be very interesting if we could somehow compare numbers for a few clusterfs'.
I believe our main limitation in this setup is the iops we can get from the backend storage system. It's hard to balance the IO over enough RAID arrays (the fs is spread over 11 RAID5 arrays of 5 disks each), and we're always having hotspots. Right now two arrays are doing <100 iops, while others are doing 4-500 iops. Would very much like to replace it by something smarter where we can utilize SSDs for active data and something slower for stale data. GPFS can manage this by itself trough it's ILM interface, but we don't have the very fast storage to put in as tier-1.
-jf
On 1/3/2012 2:14 AM, Jan-Frode Myklebust wrote:
On Sat, Dec 31, 2011 at 01:54:32AM -0600, Stan Hoeppner wrote:
Nice setup. I've mentioned GPFS for cluster use on this list before, but I think you're the only operator to confirm using it. I'm sure others would be interested in hearing of your first hand experience: pros, cons, performance, etc. And a ball park figure on the licensing costs, whether one can only use GPFS on IBM storage or if storage from others vendors is allowed in the GPFS pool.
I used to work for IBM, so I've been a bit uneasy about pushing GPFS too hard publicly, for risk of being accused of being biased. But I changed job in November, so now I'm only a satisfied customer :-)
Fascinating. And good timing. :)
Pros: Extremely simple to configure and manage. Assuming root on all nodes can ssh freely, and port 1191/tcp is open between the nodes, these are the commands to create the cluster, create a NSD (network shared disks), and create a filesystem:
# echo hostname1:manager-quorum > NodeFile # "manager" means this node can be selected as filesystem manager # echo hostname2:manager-quorum >> NodeFile # "quorum" means this node has a vote in the quorum selection # echo hostname3:manager-quorum >> NodeFile # all my nodes are usually the same, so they all have same roles. # mmcrcluster -n NodeFile -p $(hostname) -A ### sdb1 is either a local disk on hostname1 (in which case the other nodes will access it over tcp to ### hostname1), or a SAN-disk that they can access directly over FC/iSCSI. # echo sdb1:hostname1::dataAndMetadata:: > DescFile # This disk can be used for both data and metadata # mmcrnsd -F DescFile # mmstartup -A # starts GPFS services on all nodes # mmcrfs /gpfs1 gpfs1 -F DescFile # mount /gpfs1
You can add and remove disks from the filesystem, and change most settings without downtime. You can scale out your workload by adding more nodes (SAN attached or not), and scale out your disk performance by adding more disks on the fly. (IBM uses GPFS to create scale-out NAS solutions http://www-03.ibm.com/systems/storage/network/sonas/ , which highlights a few of the features available with GPFS)
There's no problem running GPFS on other vendors disk systems. I've used Nexsan SATAboy earlier, for a HPC cluster. One can easily move from one disksystem to another without downtime.
That's good to know. The only FC SAN arrays I've installed/used are IBM FasTt 600 and Nexsan SataBlade/Boy. I much prefer the web management interface on the Nexsan units, much more intuitive, more flexible. The FasTt is obviously much more suitable to random IOPS workloads with its 15k FC disks vs 7.2K SATA disks in the Nexsan units (although Nexsan has offered 15K SAS disks and SSDs for a while now).
Cons: It has it's own page cache, staticly configured. So you don't get the "all available memory used for page caching" behaviour as you normally do on linux.
Yep, that's ugly.
There is a kernel module that needs to be rebuilt on every upgrade. It's a simple process, but it needs to be done and means we can't just run "yum update ; reboot" to upgrade.
% export SHARKCLONEROOT=/usr/lpp/mmfs/src % cp /usr/lpp/mmfs/src/config/site.mcr.proto /usr/lpp/mmfs/src/config/site.mcr % vi /usr/lpp/mmfs/src/config/site.mcr # correct GPFS_ARCH, LINUX_DISTRIBUTION and LINUX_KERNEL_VERSION % cd /usr/lpp/mmfs/src/ ; make clean ; make World % su - root # export SHARKCLONEROOT=/usr/lpp/mmfs/src # cd /usr/lpp/mmfs/src/ ; make InstallImages
So is this, but it's totally expected since this is proprietary code and not in mainline.
To this point IIRC everyone here doing clusters is using NFS, GFS, or OCFS. Each has its downsides, mostly because everyone is using maildir. NFS has locking issues with shared dovecot index files. GFS and OCFS have filesystem metadata performance issues. How does GPFS perform with your maildir workload?
Maildir is likely a worst case type workload for filesystems. Millions of tiny-tiny files, making all IO random, and getting minimal controller read cache utilized (unless you can cache all active files). So I've
Yep. Which is the reason I've stuck with mbox everywhere I can over the years, minor warts and all, and will be moving to mdbox at some point. IMHO maildir solved one set of problems but created a bigger problem. Many sites hailed maildir as a savior in many ways, then decried it as their user base and IO demands exceeded their storage, scrambling for budget money for fix an "unforeseen" problem, which is absolutely clear from day one. At least for anyone with more than a cursory knowledge of filesystem design and hardware performance.
concluded that our performance issues are mostly design errors (and the fact that there were no better mail storage formats than maildir at the time these servers were implemented). I expect moving to mdbox will fix all our performance issues.
Yeah, it should decrease FS IOPS by a couple orders or magnitude, especially if you go with large mdbox files. The larger the better.
I *think* GPFS is as good as it gets for maildir storage on clusterfs, but have no number to back that up ... Would be very interesting if we could somehow compare numbers for a few clusterfs'.
Apparently no one (vendor) with the resources to do so has the desire to do so.
I believe our main limitation in this setup is the iops we can get from the backend storage system. It's hard to balance the IO over enough RAID arrays (the fs is spread over 11 RAID5 arrays of 5 disks each), and we're always having hotspots. Right now two arrays are doing <100 iops, while others are doing 4-500 iops. Would very much like to replace it by something smarter where we can utilize SSDs for active data and something slower for stale data. GPFS can manage this by itself trough it's ILM interface, but we don't have the very fast storage to put in as tier-1.
Obviously not news to you, balancing mail workload IO across large filesystems and wide disk farms will always be a problem, due to which users are logged in at a given moment, and the fact you can't stripe all users' small mail files across all disks. And this is true of all mailbox formats to one degree or another, obviously worst with maildir. A properly engineered XFS can get far closer to linear IO distribution across arrays than most filesystems due to its allocation group design, but it still won't be perfect. Simply getting away from maildir, with its extraneous metadata IOs, is a huge win for decreasing custerFS and SAN IOPs. I'm anxious to see your report on your SAN IOPs after you've converted to mdbox, especially if you go with 16/32MB or larger mdbox files.
-- Stan
On 31.12.2011, at 9.54, Stan Hoeppner wrote:
Timo, is there any technical or sanity based upper bound on mdbox size? Anything wrong with using 64MB, 128MB, or even larger for mdbox_rotate_size?
Should be fine. The only issue is the extra disk I/O required to recreate the files during doveadm purge.
On 24.12.2011, at 17.20, Jan-Frode Myklebust wrote:
I've just enabled zlib for our users, and am looking at how to compress the existing files. The routine for doing this at http://wiki2.dovecot.org/Plugins/Zlib seems a bit complicated. What do you think about simply doing:
find /var/vmail -type f -name "*,S=*" -mtime +1 -exec gzip -S Z -6 '{}' +
I.e. find all maildir-files:
- with size in the name ("*,S=*")
- modified before I enabled zlib plugin
As long as it doesn't find any already compressed mails..
- compress them
- add the Z suffix
Make sure there's also :2, suffix already. If someone hasn't logged in for a while there are such files in new/ directory.
It's of course racy without the maildirlock, but are there any other problems with this approach ?
Other than being racy, I guess it should work.
On Thu, Dec 29, 2011 at 02:55:40PM +0200, Timo Sirainen wrote:
I.e. find all maildir-files:
- with size in the name ("*,S=*")
- modified before I enabled zlib plugin
As long as it doesn't find any already compressed mails..
Can't I trust that no mails with timestamp before I enabled compression are uncompressed? Or will dovecot compress old messages keeping old timestamp when copying messages between folders, or something like that?
I want to avoid reading every file to check if it's compressed already, as that will add ages to an already slow process..
- compress them
- add the Z suffix
Make sure there's also :2, suffix already. If someone hasn't logged in for a while there are such files in new/ directory.
So, find /var/vmail -type f -name "*,S=*:2*" -mtime +6 -exec gzip -S Z -6 '{}' +
Right ? I don't care too much if I miss on a few percent of the files..
(I'll probably have to use "-newer /somefile" instead of -mtime since it will run for some days)
-jf
On 29.12.2011, at 15.36, Jan-Frode Myklebust wrote:
On Thu, Dec 29, 2011 at 02:55:40PM +0200, Timo Sirainen wrote:
I.e. find all maildir-files:
- with size in the name ("*,S=*")
- modified before I enabled zlib plugin
As long as it doesn't find any already compressed mails..
Can't I trust that no mails with timestamp before I enabled compression are uncompressed? Or will dovecot compress old messages keeping old timestamp when copying messages between folders, or something like that?
It's possible that a user saves a mail with an old IMAP INTERNALDATE (=file's mtime), which is already compressed. You could use ctime, but that could skip mails whose flags have been changed since compression.
I want to avoid reading every file to check if it's compressed already, as that will add ages to an already slow process..
You could use mtime, and just before compressing the mail check if it's already compressed. That won't add much overhead.
- compress them
- add the Z suffix
Make sure there's also :2, suffix already. If someone hasn't logged in for a while there are such files in new/ directory.
So, find /var/vmail -type f -name "*,S=*:2*" -mtime +6 -exec gzip -S Z -6 '{}' +
Right ? I don't care too much if I miss on a few percent of the files..
Yes.
On Thu, Dec 29, 2011 at 03:48:05PM +0200, Timo Sirainen wrote:
Can't I trust that no mails with timestamp before I enabled compression are uncompressed? Or will dovecot compress old messages keeping old timestamp when copying messages between folders, or something like that?
It's possible that a user saves a mail with an old IMAP INTERNALDATE (=file's mtime), which is already compressed. You could use ctime, but that could skip mails whose flags have been changed since compression.
Ok, if ctime is safer I think I'll use that and not worry too much about missing some messages.
I want to avoid reading every file to check if it's compressed already, as that will add ages to an already slow process..
You could use mtime, and just before compressing the mail check if it's already compressed. That won't add much overhead.
Ah, right.. I'll have to read the messages from disk anyway, so checking if they're compressed or not doesn't add much. So mtime + compression-check is what I'll need to do then.
-jf
participants (4)
-
Dovecot-GDH
-
Jan-Frode Myklebust
-
Stan Hoeppner
-
Timo Sirainen