[Dovecot] SSD drives are really fast running Dovecot

newer
[Dovecot] PAM problem with virtual...

older
[Dovecot] How to enable COPY and...

Marc Perkel

12 Jan 2011 12 Jan '11

7:53 p.m.

I just replaced my drives for Dovecot using Maildir format with a pair of Solid State Drives (SSD) in a raid 0 configuration. It's really really fast. Kind of expensive but it's like getting 20x the speed for 20x the price. I think the big gain is in the 0 seek time.

Here's what I bought.

Crucial RealSSD C300 CTFDDAC256MAG-1G1 2.5" 256GB SATA III MLC Internal Solid State Drive (SSD) <http://www.newegg.com/Product/Product.aspx?Item=N82E16820148349>

*   2.5"
*   256GB
*   SATA III

* *Sequential Access - Read:* 355MB/sec (SATA 6Gb/s) 265MB/sec (SATA
  3Gb/s)
* *Sequential Access - Write:* 215MB/sec (SATA 6Gb/s) 215MB/sec
  (SATA 3Gb/s)
* *Power Consumption (Active):* 2.1W READ, 4.3W WRITE
* *Power Consumption (Idle):* 0.094W

Running it on an Asus motherboard that supports SATA III - 6 core AMD CPU and 16 gigs of ram. Might be slightly off topic but this server screams!

Show replies by date

Rick Romero

12 Jan 12 Jan

7:58 p.m.

Quoting Marc Perkel <marc@perkel.com>:

...

...
I just replaced my drives for Dovecot using Maildir format with a pair of Solid State Drives (SSD) in a raid 0 configuration. It's really really fast. Kind of expensive but it's like getting 20x the speed for 20x the price. I think the big gain is in the 0 seek time.

Here's what I bought.

Crucial RealSSD C300 CTFDDAC256MAG-1G1 2.5" 256GB SATA III MLC Internal Solid State Drive (SSD) <http://www.newegg.com/Product/Product.aspx?Item=N82E16820148349>

* 2.5" * 256GB * SATA III

* *Sequential Access - Read:* 355MB/sec (SATA 6Gb/s) 265MB/sec (SATA 3Gb/s) * *Sequential Access - Write:* 215MB/sec (SATA 6Gb/s) 215MB/sec (SATA 3Gb/s) * *Power Consumption (Active):* 2.1W READ, 4.3W WRITE * *Power Consumption (Idle):* 0.094W

Running it on an Asus motherboard that supports SATA III - 6 core AMD CPU and 16 gigs of ram. Might be slightly off topic but this server screams! Hey Marc,

Just for testing purposes, what does a dd speed test give you?

http://it.toolbox.com/blogs/database-soup/testing-disk-speed-the-dd-test-310... IMHO, the key part is exceeding the RAM size, but for a "closer to Maildir" comparison, a decent file size that exceeds the drive cache is good too..

Rick

Marc Perkel

8:18 p.m.

On 1/12/2011 9:58 AM, Rick Romero wrote:

...

Quoting Marc Perkel <marc@perkel.com>:

...
...
I just replaced my drives for Dovecot using Maildir format with a pair of Solid State Drives (SSD) in a raid 0 configuration. It's really really fast. Kind of expensive but it's like getting 20x the speed for 20x the price. I think the big gain is in the 0 seek time.

Here's what I bought.

Crucial RealSSD C300 CTFDDAC256MAG-1G1 2.5" 256GB SATA III MLC Internal Solid State Drive (SSD) <http://www.newegg.com/Product/Product.aspx?Item=N82E16820148349>

2.5"

256GB

SATA III

*Sequential Access - Read:* 355MB/sec (SATA 6Gb/s) 265MB/sec (SATA 3Gb/s)

*Sequential Access - Write:* 215MB/sec (SATA 6Gb/s) 215MB/sec (SATA 3Gb/s)

*Power Consumption (Active):* 2.1W READ, 4.3W WRITE

*Power Consumption (Idle):* 0.094W

Running it on an Asus motherboard that supports SATA III - 6 core AMD CPU and 16 gigs of ram. Might be slightly off topic but this server screams! Hey Marc,

Just for testing purposes, what does a dd speed test give you?

http://it.toolbox.com/blogs/database-soup/testing-disk-speed-the-dd-test-310...

IMHO, the key part is exceeding the RAM size, but for a "closer to Maildir" comparison, a decent file size that exceeds the drive cache is good too..

Rick

Looks like a good test. Here's my results.

time sh -c "dd if=/dev/zero of=ddfile bs=8k count=2000000 && sync"

2000000+0 records in 2000000+0 records out 16384000000 bytes (16 GB) copied, 55.403 s, 296 MB/s

real 1m4.738s user 0m0.336s sys 0m20.199s

Stan Hoeppner

10:13 p.m.

Marc Perkel put forth on 1/12/2011 12:18 PM:

...

time sh -c "dd if=/dev/zero of=ddfile bs=8k count=2000000 && sync"

2000000+0 records in 2000000+0 records out 16384000000 bytes (16 GB) copied, 55.403 s, 296 MB/s

real 1m4.738s user 0m0.336s sys 0m20.199s

That's a horrible test case for a mail server, especially one using maildir storage. Streaming read/write b/w results are meaningless for mail I/O. You need a random I/O test such as bonnie++ or iozone to see your IOPS. We already know it'll be off the chart compared to mechanical drives though. Would still be neat to see the numbers. There's also the most realistic test, ironically, given the list on which you asked this question:

http://www.imapwiki.org/Benchmarking

-- Stan

Matt

9:15 p.m.

...

I just replaced my drives for Dovecot using Maildir format with a pair of Solid State Drives (SSD) in a raid 0 configuration. It's really really fast. Kind of expensive but it's like getting 20x the speed for 20x the price. I think the big gain is in the 0 seek time.

I thought about doing this on my email server since its troubles are mostly disk I/O saturation but I was concerned about reliability. Have heard that after so many read/writes SSD will go bad. There are an awful lot of read/writes on an email server.

I will be interested to hear how it stands up for you.

Timo Sirainen

9:24 p.m.

On 12.1.2011, at 21.15, Matt wrote:

...

I thought about doing this on my email server since its troubles are mostly disk I/O saturation but I was concerned about reliability. Have heard that after so many read/writes SSD will go bad.

There's no need to worry about that in any modern drives.

Robert Brockway

13 Jan 13 Jan

7:37 p.m.

On Wed, 12 Jan 2011, Timo Sirainen wrote:

...

On 12.1.2011, at 21.15, Matt wrote:

...
I thought about doing this on my email server since its troubles are mostly disk I/O saturation but I was concerned about reliability. Have heard that after so many read/writes SSD will go bad.

There's no need to worry about that in any modern drives.

Hi Timo. Wear levelling often isn't as good as is claimed on the box. Often wear levelling is only across subsets of the SSD not across the entire device.

I've seen several SSD drives fail in production after about 12 months of use, and this in low-write environments (eg, I log syslog to a remote syslog server). I'm refraining from deploying SSD for mail servers for now, much as I would love to.

Rob

-- Email: robert@timetraveller.org Linux counter ID #16440 IRC: Solver (OFTC & Freenode) Web: http://www.practicalsysadmin.com Contributing member of Software in the Public Interest (http://spi-inc.org/) Open Source: The revolution that silently changed the world

Timo Sirainen

8:01 p.m.

On 13.1.2011, at 19.37, Robert Brockway wrote:

...

Hi Timo. Wear levelling often isn't as good as is claimed on the box. Often wear levelling is only across subsets of the SSD not across the entire device.

I've seen several SSD drives fail in production after about 12 months of use, and this in low-write environments (eg, I log syslog to a remote syslog server). I'm refraining from deploying SSD for mail servers for now, much as I would love to.

How do they fail? Supposedly once a cell has reached its erase-limit it should become read-only. Maybe the failures had nothing to do with wearing?

Robert Brockway

14 Jan 14 Jan

4:18 p.m.

On Thu, 13 Jan 2011, Timo Sirainen wrote:

...

How do they fail? Supposedly once a cell has reached its erase-limit it should become read-only. Maybe the failures had nothing to do with wearing?

Hi Timo. I start seeing I/O errors on read or write. I must admit I don't have definitive proof that it is a wear level problem but there isn't any other obvious cause either. ie, no known heat or other problems have occured on the systems.

It's left me a bit gun-shy of deploying SSD more widely. I would certainly love to be able to trust it as an alternative to disk.

Yes cells certainly should go read-only when they run out of write-cycles, which would be a far less painful way for them to fail :)

One of the systems to fail was a firewall running off SSD. A Linux based firewall can lose entire filesystems and keep running[1] so I first noticed the problem when the backups started to fail.

[1] Although you can't change the firewall ruleset without userspace tools.

Cheers,

Rpb

Ed W

15 Jan 15 Jan

12:41 p.m.

...

One of the systems to fail was a firewall running off SSD.

SSD or CF?

It would appear it's also possible to damage some flash memory by powering off at the wrong moment? I had a router running on a nearly new SLC flash card and it kept suffering errors every 24 hours and perhaps it was filesystem corruption since it was kind of fixed when I rebooted. Then after a few more days it died completely, briefly I could repartition it and then an hour later I could no longer even get it detected to the OS and hence it appeared absolutely completely dead.

So that's a new 4GB SLC card, using around 500MB of it and a light writeable filesystem running pfsense (perhaps a few writes per minute) and it died inside a month... I don't have enough data to see if it died from wear or if I was just unlucky...

This is a cheap (ish) CF card though, not an SSD drive

Ed W

David Woodhouse

18 Jan 18 Jan

2:47 a.m.

On Sat, 2011-01-15 at 10:41 +0000, Ed W wrote:

...

...
One of the systems to fail was a firewall running off SSD.

SSD or CF?

That doesn't make a lot of difference. They're all broadly similar. There are better devices and worse devices, but they're mostly crap.

And as I said earlier, even if you think you've worked out which is which, it may change from batch to batch of what is allegedly the *same* product.

...

It would appear it's also possible to damage some flash memory by powering off at the wrong moment?

Almost all of them will fail hard it you do any serious power-fail testing on them. It's not a hardware failure; it's just that their *internal* file system is corrupt and needs a fsck (or just wiping and starting again). But since you can't *access* the underlying medium, all you can do is cry and buy a new one.

The fun thing is that their internal garbage collection could be triggered by a *read* from the host computer, or could even happen purely due to a timeout of some kind. So there is *never* a time when you know it's "safe to power off because I haven't written to it for 5 minutes".

Yes, it's perfectly possible to design journalling file systems that *are* resilient to power failure. But the "file systems" inside these devices are generally written by the same crack-smoking hobos that write PC BIOSes; you don't expect quality software here.

By putting a logic analyser on some of these devices to watch what they're *actually* doing on the flash when they garbage-collect, we've found some really nasty algorithms. When garbage-collecting, one of them would read from the 'victim' eraseblock into RAM, then erase the victim block while the data were still only held in RAM — so that a power failure at that moment would lose it. And then, just to make sure its race window was nice and wide, it would then pick a *second* victim block and copy data from there into the freshly-erased block, before erasing that second block and *finally* writing the data from RAM back to it. It's just scary :)

-- dwmw2

Stan Hoeppner

13 Jan 13 Jan

8:53 a.m.

Matt put forth on 1/12/2011 1:15 PM:

...

I thought about doing this on my email server since its troubles are mostly disk I/O saturation but I was concerned about reliability. Have heard that after so many read/writes SSD will go bad. There are an awful lot of read/writes on an email server.

From: http://www.storagesearch.com/ssdmyths-endurance.html

"As a sanity check - I found some data from Mtron (one of the few SSD oems who do quote endurance in a way that non specialists can understand). In the data sheet for their 32G product - which incidentally has 5 million cycles write endurance - they quote the write endurance for the disk as "greater than 85 years assuming 100G / day erase/write cycles" - which involves overwriting the disk 3 times a day."

That was written in 2007. SSD flash cell life has increased substantially in the 3-4 year period since.

...

From a flash cell longevity standpoint, any decent SSD with wear leveling is going to easily outlive the typical server replacement cycle of 3-5 years, and far beyond that. Note that striping two such SSDs (RAID 0) will double the wear cycle life, and striping 4 SSDs will quadruple it, so that 85 years becomes 340+ years of wear life with a 4 SSD stripe (RAID 0).

Your misgivings about using SSDs are based on obsolete data from many years ago.

-- Stan

Miha Vrhovnik

11:17 a.m.

New subject: [Dovecot] SSD drives are really fast running Dovecot

...

"As a sanity check - I found some data from Mtron (one of the few SSD oems who do quote endurance in a way that non specialists can understand). In the data sheet for their 32G product - which incidentally has 5 million cycles write endurance - they quote the write endurance for the disk as "greater than 85 years assuming 100G / day erase/write cycles" - which involves overwriting the disk 3 times a day."

That was written in 2007. SSD flash cell life has increased substantially in the 3-4 year period since. Stan you are wrong on that... flash cell life decreases when they shrink process in which they are made.

With the new generation of 25nm flash it's down to 3000/programming cycles. So the drive manufactures have to battle that with increasing a ECC length and better wear leaving. Or if I quote anand "When I first started reviewing SSDs IMFT was shipping 50nm MLC NAND rated at 10,000 program/erase cycles per cell. As I mentioned in a recent SSD article, the move to 3xnm cut that endurance rating in half. Current NAND shipping in SSDs can only last half as long, or approximately 5,000 program/erase cycles per cell. Things aren’t looking any better for 25nm. Although the first 25nm MLC test parts could only manage 1,000 P/E cycles, today 25nm MLC NAND is good for around 3,000 program/erase cycles per cell.

The reduction in P/E cycles is directly related to the physics of shrinking these NAND cells; the smaller they get, the faster they deteriorate with each write."

Please read the article on annandtech [1]

1 - http://www.anandtech.com/show/4043/micron-announces-clearnand-25nm-with-ecc

Regards, Miha

Stan Hoeppner

1:10 p.m.

Miha Vrhovnik put forth on 1/13/2011 3:17 AM:

...

...
"As a sanity check - I found some data from Mtron (one of the few SSD oems who do quote endurance in a way that non specialists can understand). In the data sheet for their 32G product - which incidentally has 5 million cycles write endurance - they quote the write endurance for the disk as "greater than 85 years assuming 100G / day erase/write cycles" - which involves overwriting the disk 3 times a day."

That was written in 2007. SSD flash cell life has increased substantially in the 3-4 year period since.

...

Stan you are wrong on that...

You are correct. I was wrong. Let me fix that with a two word edit:

SSD life has increased substantially in the 3-4 year period since.

Overall useful life of an SSD (the whole device) has continued to increase substantially thanks to better wear leveling controllers and RAISE, even though individual flash cell life has continued to decrease as process geometries continue to decrease.

-- Stan

Steve

9:19 p.m.

-------- Original-Nachricht --------

...

Datum: Thu, 13 Jan 2011 10:17:20 +0100 Von: "Miha Vrhovnik" <miha.vrhovnik@cordia.si> An: dovecot@dovecot.org Betreff: Re: [Dovecot] SSD drives are really fast running Dovecot

...

...
"As a sanity check - I found some data from Mtron (one of the few SSD oems who do quote endurance in a way that non specialists can understand). In the data sheet for their 32G product - which incidentally has 5 million cycles write endurance - they quote the write endurance for the disk as "greater than 85 years assuming 100G / day erase/write cycles" - which involves overwriting the disk 3 times a day."

That was written in 2007. SSD flash cell life has increased substantially in the 3-4 year period since. Stan you are wrong on that... flash cell life decreases when they shrink process in which they are made.

With the new generation of 25nm flash it's down to 3000/programming cycles. So the drive manufactures have to battle that with increasing a ECC length and better wear leaving. Or if I quote anand "When I first started reviewing SSDs IMFT was shipping 50nm MLC NAND rated at 10,000 program/erase cycles per cell. As I mentioned in a recent SSD article, the move to 3xnm cut that endurance rating in half. Current NAND shipping in SSDs can only last half as long, or approximately 5,000 program/erase cycles per cell. Things aren’t looking any better for 25nm. Although the first 25nm MLC test parts could only manage 1,000 P/E cycles, today 25nm MLC NAND is good for around 3,000 program/erase cycles per cell.

The reduction in P/E cycles is directly related to the physics of shrinking these NAND cells; the smaller they get, the faster they deteriorate with each write."

I would not use MLC in a server environment. SLC has much better program/erase cycles per cell.

...

Please read the article on annandtech [1]

1 - http://www.anandtech.com/show/4043/micron-announces-clearnand-25nm-with-ecc

Regards, Miha

-- NEU: FreePhone - kostenlos mobil telefonieren und surfen! Jetzt informieren: http://www.gmx.net/de/go/freephone

David Woodhouse

17 Jan 17 Jan

9:19 p.m.

On Thu, 2011-01-13 at 20:19 +0100, Steve wrote:

...

I would not use MLC in a server environment. SLC has much better program/erase cycles per cell.

I wouldn't be overly worried about the underlying medium.

I'm more worried about the translation layer they use on top of it, to make it pretend to be spinning rust. It is essentially a file system, on top of which you are expected to layer another file system. Not particularly efficient, but at least TRIM addresses one of the biggest inefficiencies of that gratuitous extra layering.

The inefficiency is one thing, but it's the reliability that worries me. It's generally accepted that it takes at least 5 years for a file system implementation to truly reach maturity. And that's for open source code that you can debug, on a medium that you can access directly to do diagnosis and data recovery.

But what we're talking about here is a file system implemented inside a black box where you can't do any of that. And what's more, they keep changing it. Even if you manage to find some device that passes your testing, you may find that the next batch of the *same* device (from your point of view) actually contains completely different software *and* hardware if you take it apart.

These translation layers are almost always a complete pile of crap. Especially in the face of power failures, since they so often completely fail to implement basic data integrity features (the same kind of journalling features that also have to be implemented in the 'real' file system on top of this fake disk).

The best way to use flash is to have a file system that's *designed* for use on flash. The only problem with that is that it wouldn't work with DOS; you can't provide an INT 13h DISK BIOS handler to use it...

-- dwmw2

David Jonas

13 Jan 13 Jan

2:37 a.m.

On 1/12/11 , Jan 12, 9:53 AM, Marc Perkel wrote:

...

I just replaced my drives for Dovecot using Maildir format with a pair of Solid State Drives (SSD) in a raid 0 configuration. It's really really fast. Kind of expensive but it's like getting 20x the speed for 20x the price. I think the big gain is in the 0 seek time.

I've been considering getting a pair of SSDs in raid1 for just the dovecot indexes. The hope would be to minimize the impact of pop3 users hammering the server. Proposed design is something like 2 drives (ssd or platter) for OS and logs, 2 ssds for indexes (soft raid1), 12 sata or sas drives in RAID5 or 6 (hw raid, probably 3ware) for maildirs. The indexes and mailboxes would be mirrored with drbd. Seems like the best of both worlds -- fast and lots of storage.

Does anyone run a configuration like this? How does it work for you?

Anyone have any improvements on the design? Suggestions?

Stan Hoeppner

9:46 a.m.

David Jonas put forth on 1/12/2011 6:37 PM:

...

I've been considering getting a pair of SSDs in raid1 for just the dovecot indexes. The hope would be to minimize the impact of pop3 users hammering the server. Proposed design is something like 2 drives (ssd or platter) for OS and logs, 2 ssds for indexes (soft raid1), 12 sata or sas drives in RAID5 or 6 (hw raid, probably 3ware) for maildirs. The indexes and mailboxes would be mirrored with drbd. Seems like the best of both worlds -- fast and lots of storage.

Let me get this straight. You're moving indexes to locally attached SSD for greater performance, and yet, you're going to mirror the indexes and store data between two such cluster hosts over a low bandwidth, high latency GigE network connection? If this is a relatively low volume environment this might work. But, if the volume is high enough that you're considering SSD for performance, I'd say using DRBD here might not be a great idea.

...

Anyone have any improvements on the design? Suggestions?

Yes. Go with a cluster filesystem such as OCFS or GFS2 and an inexpensive SAN storage unit that supports mixed SSD and spinning storage such as the Nexsan SATABoy with 2GB cache: http://www.nexsan.com/sataboy.php

Get the single FC controller model, two Qlogic 4Gbit FC PCIe HBAs, one for each cluster server. Attach the two servers to the two FC ports on the SATABoy controller. Unmask each LUN to both servers. This enabling the cluster filesystem.

Depending on the space requirements of your indexes, put 2 or 4 SSDs in a RAID0 stripe. RAID1 simply DECREASES the overall life of SSDs. SSDs don't have the failure modes of mechanical drives thus RAID'ing them is not necessary. You don't duplex your internal PCIe RAID cards do you? Same failure modes as SSDs.

Occupy the remaining 10 or 12 disk bays with 500GB SATA drives. Configure them as RAID10. RAID5/6 aren't suitable to substantial random write workloads such as mail and database. Additionally, rebuild times for parity RAID schemes (5/6) are up in the many hours, or even days category, and degraded performance of 5/6 is horrible. RAID10 rebuild times are a couple of hours and RAID10 suffers zero performance loss when a drive is down. Additionally, RAID10 can lose HALF the drives in the array as long as no two are both drives in a mirror pair. Thus, with a RAID10 of 10 disks, you could potentially lose 5 drives with no loss in performance. The probability of this is rare, but it demonstrates the point. With a 10 disk RAID 10 of 7.2k SATA drives, you'll have ~800 random read/write IOPS performance. That' may seem low, but that's an actual filesystem figure. The physical IOPS figure is double that, 1600. Since you'll have your indexes on 4 SSDs, and the indexes are where the bulk of IMAP IOPS take place (flags), you'll have over 50,000 random read/write IOPS.

Having both SSD and spinning drives in the same SAN controller eliminates the high latency low bandwidth link you were going to use with drbd. It also eliminates buying twice as many SSDs, PCIe RAID cards, and disks, one set for each cluster server. Total cost may end up being similar between the drbd and SAN based solutions, but you have significant advantages with the SAN solution beyond those already mentioned, such as using an inexpensive FC switch and attaching a D2D or tape backup host, installing the cluster filesystem software on it, and directly backing up the IMAP store while the cluster is online and running, or snapshooting it after doing a freeze at the VFS layer.

You should be able to acquire the FC SATABoy with all the drives and SSDs (depending on what size units you choose), plus the two server HBAs for $15-20k USD, maybe less. I've not purchased a unit with SSDs yet, only disk, and they've very reasonably priced compared to pretty much all other SAN arrays on the market. Nexsan's SSD pricing might be a little steep compared to Newegg, but the units they bundle are fully tested and certified with their array controllers. The performance is phenomenal for the price, but obviously there are higher performing units available.

-- Stan

David Jonas

14 Jan 14 Jan

10:08 p.m.

On 1/12/11 , Jan 12, 11:46 PM, Stan Hoeppner wrote:

...

David Jonas put forth on 1/12/2011 6:37 PM:

...
I've been considering getting a pair of SSDs in raid1 for just the dovecot indexes. The hope would be to minimize the impact of pop3 users hammering the server. Proposed design is something like 2 drives (ssd or platter) for OS and logs, 2 ssds for indexes (soft raid1), 12 sata or sas drives in RAID5 or 6 (hw raid, probably 3ware) for maildirs. The indexes and mailboxes would be mirrored with drbd. Seems like the best of both worlds -- fast and lots of storage.

Let me get this straight. You're moving indexes to locally attached SSD for greater performance, and yet, you're going to mirror the indexes and store data between two such cluster hosts over a low bandwidth, high latency GigE network connection? If this is a relatively low volume environment this might work. But, if the volume is high enough that you're considering SSD for performance, I'd say using DRBD here might not be a great idea.

First, thanks for taking the time to respond! I appreciate the good information.

Currently running DRBD for high availability over directly attached bonded GigE with jumbo frames. Works quite well. Though indexes and maildirs are on the same partition.

The reason for mirroring the indexes is just for HA failover. I can only imagine the hit of rebuilding indexes for every connection after failover.

...

...
Anyone have any improvements on the design? Suggestions?

Yes. Go with a cluster filesystem such as OCFS or GFS2 and an inexpensive SAN storage unit that supports mixed SSD and spinning storage such as the Nexsan SATABoy with 2GB cache: http://www.nexsan.com/sataboy.php

Get the single FC controller model, two Qlogic 4Gbit FC PCIe HBAs, one for each cluster server. Attach the two servers to the two FC ports on the SATABoy controller. Unmask each LUN to both servers. This enabling the cluster filesystem.

Depending on the space requirements of your indexes, put 2 or 4 SSDs in a RAID0 stripe. RAID1 simply DECREASES the overall life of SSDs. SSDs don't have the failure modes of mechanical drives thus RAID'ing them is not necessary. You don't duplex your internal PCIe RAID cards do you? Same failure modes as SSDs.

Interesting. I hadn't thought about it that way. We haven't had an SSD fail yet so I have no experience there yet. And I've been curious to try GFS2.

...

Occupy the remaining 10 or 12 disk bays with 500GB SATA drives. Configure them as RAID10. RAID5/6 aren't suitable to substantial random write workloads such as mail and database. Additionally, rebuild times for parity RAID schemes (5/6) are up in the many hours, or even days category, and degraded performance of 5/6 is horrible. RAID10 rebuild times are a couple of hours and RAID10 suffers zero performance loss when a drive is down. Additionally, RAID10 can lose HALF the drives in the array as long as no two are both drives in a mirror pair. Thus, with a RAID10 of 10 disks, you could potentially lose 5 drives with no loss in performance. The probability of this is rare, but it demonstrates the point. With a 10 disk RAID 10 of 7.2k SATA drives, you'll have ~800 random read/write IOPS performance. That' may seem low, but that's an actual filesystem figure. The physical IOPS figure is double that, 1600. Since you'll have your indexes on 4 SSDs, and the indexes are where the bulk of IMAP IOPS take place (flags), you'll have over 50,000 random read/write IOPS.

Raid10 is our normal go to, but giving up half the storage in this case seemed unnecessary. I was looking at SAS drives and it was getting pricy. I'll work SATA into my considerations.

...

Having both SSD and spinning drives in the same SAN controller eliminates the high latency low bandwidth link you were going to use with drbd. It also eliminates buying twice as many SSDs, PCIe RAID cards, and disks, one set for each cluster server. Total cost may end up being similar between the drbd and SAN based solutions, but you have significant advantages with the SAN solution beyond those already mentioned, such as using an inexpensive FC switch and attaching a D2D or tape backup host, installing the cluster filesystem software on it, and directly backing up the IMAP store while the cluster is online and running, or snapshooting it after doing a freeze at the VFS layer.

As long as the SATAboy is reliable I can see it. Probably would be easier to sell to the higher ups too. They won't feel like they're buying everything twice.

Stan Hoeppner

15 Jan 15 Jan

1:29 a.m.

David Jonas put forth on 1/14/2011 2:08 PM:

...

Raid10 is our normal go to, but giving up half the storage in this case seemed unnecessary. I was looking at SAS drives and it was getting pricy. I'll work SATA into my considerations.

That's because you're using the wrong equation for determining your disk storage needs. I posted a new equation on one of the lists a week or two ago. Performance and reliability are far more important now than total space. And today performance means transactional write IOPS not streaming reads. In today's world, specifically for transaction oriented applications (db and mail) smaller faster more expensive disks are less expensive in total ROI that big fat slow drives. The reason is that few if any organizations actually need 28TB (14 2TB Cavier Green drives--popular with idiots today) of mail storage in a single mail store. That's 50 years worth of mail storage for a 50,000 employee company, assuming your employees aren't allowed porn/video attachments, which which most aren't.

...

...
Having both SSD and spinning drives in the same SAN controller eliminates the high latency low bandwidth link you were going to use with drbd. It also eliminates buying twice as many SSDs, PCIe RAID cards, and disks, one set for each cluster server. Total cost may end up being similar between the drbd and SAN based solutions, but you have significant advantages with the SAN solution beyond those already mentioned, such as using an inexpensive FC switch and attaching a D2D or tape backup host, installing the cluster filesystem software on it, and directly backing up the IMAP store while the cluster is online and running, or snapshooting it after doing a freeze at the VFS layer.

As long as the SATAboy is reliable I can see it. Probably would be easier to sell to the higher ups too. They won't feel like they're buying everything twice.

Hit their website and look at their customer list and industry awards. They've won them all pretty much. Simple, reliable, inexpensive SAN storage arrays. No advanced features such as inbuilt snapshots and the like. Performance isn't the fastest on the market but it's far more than adequate. The performance per dollar ratio is very high. I've installed and used a SATABlade and SATABoy myself and they're extremely reliable, and plenty fast. Those were spinning models. I've not used SSDs in their chassis yet.

http://www.nexsan.com

You configure the controller and drives via a web interface over an ethernet port. There's a lot to love in the way Nexsan builds these things. At least, if you're a HardwareFreak like me.

-- Stan

Noel Butler

1:53 a.m.

On Fri, 2011-01-14 at 17:29 -0600, Stan Hoeppner wrote:

...

slow drives. The reason is that few if any organizations actually need 28TB (14 2TB Cavier Green drives--popular with idiots today) of mail storage in a single mail store. That's 50 years worth of mail storage for a 50,000 employee company, assuming your employees aren't allowed porn/video attachments, which which most aren't.

WTF? 28TB of mail storage for some is rather small. Good to see your still posting without a clue Stanley. Remember there is a bigger world out there from your tiny SOHO

Brad Davidson

2:25 a.m.

...

...
The reason is that few if any organizations actually need 28TB (14 2TB Cavier Green drives--popular with idiots today) of mail storage in a single mail store. That's 50 years worth of mail storage for a 50,000 employee company, assuming your employees aren't allowed porn/video attachments, which which most aren't.

WTF? 28TB of mail storage for some is rather small. Good to see your still posting without a clue Stanley. Remember there is a bigger world out there from your tiny SOHO

I'm with you Noel.

We just bought 252TB of raw disk for about 5k users. Given, this is going in to Exchange on Netapp with multi-site database replication, so this cooks down to about 53TB of usable space with room for recovery databases, defragmentation, archives, etc, but still... 28TB is not much anymore.

Of course, Exchange has also gone in a different direction than folks have been indicating. 2010 has some pretty high memory requirements, but the actual IOPS demands are quite low compared to earlier versions. We're using 1TB 7200RPM SATA drives, and at the number of spindles we've got, combined with the cache in the controllers, expect to have quite a good bit of excess IOPS.

Even on the Dovecot side though - if you use the Director to group your users properly, and equip the systems with enough memory, disk should not be a bottleneck if you do anything reasonably intelligent. We support 12k concurrent IMAP users at ~.75 IOPS/user/sec. POP3, SMTP, and shell access on top of that is negligible.

I'm also surprised by the number of people trying to use DRBD to make local disk look like a SAN so they can turn around and put a cluster filesystem on it - with all those complex moving parts, how do you diagnose poor performance? Who is going to be able to support it if you get hit by a bus? Seems like folks would be better off building or buying a sturdy NFS server. Heck, even at larger budgets, if you're probably just going to end up with something that's essentially a clustered NFS server with a SAN behind it.

-Brad

Stan Hoeppner

5:09 a.m.

Brad Davidson put forth on 1/14/2011 6:25 PM:

...

We just bought 252TB of raw disk for about 5k users. Given, this is going in to Exchange on Netapp with multi-site database replication, so this cooks down to about 53TB of usable space with room for recovery databases, defragmentation, archives, etc, but still... 28TB is not much anymore.

The average size of an email worldwide today is less than 4KB, less than one typical filesystem block.

28TB / 4KB = 28,000,000,000,000 bytes / 4096 bytes = 6,835,937,500 = 6.8 billion emails / 5,000 users = 1,367,188 emails per user

6.8 billion emails is "not much anymore" for a 5,000 seat org?

You work for the University of Oregon, correct?

From: http://sunshinereview.org/index.php/Oregon_state_budget

"Oregon's budget for FY2009-11 totals $61 billion.[1] The state faced a $3.8 billion biennium FY 2010-11 budget deficit, relying heavily on new taxes and federal stimulus money to close the gap in the final budget signed by Gov. Ted Kulongoski and passed by the Oregon Legislature.[2][3] In Aug. 2010, however, the state budget deficit increased and could top $1 billion.[4] As a result, the governor ordered 9% budget cuts.[5]"

How much did that 252TB NetApp cost the university? $300k? $700k? Just a drop in the bucket right? Do you think that was a smart purchasing decision, given your state's $3.8 Billion deficit? Ergo, do you think having an "unlimited email storage policy" is a smart decision, based on your $3.8 Billion deficit? Your fellow tax payers would probably suggest you need to reign in your email storage policy. Wouldn't you agree?

This is why people don't listen to Noel (I've had him kill filed for a year--but not to the extreme of body filtering him). They probably won't put much stock in what you say either Brad. Why?

You two don't live in reality. Either that, or the reality you live in is _VERY_ different from the rest of the sane world. Policies like U of O's unlimited email drive multi hundred thousand to million dollar systems and storage purchases, driving the state budget further into the red, and demanding more income tax from citizens to pay for it, since Oregon has no sales tax.

"28TB is not much anymore." Tell your fellow taxpayers what that 252TB cost them and I guarantee they'll think 28TB is overkill, especially after you tell them 28TB would store more than 1.3 million emails per each of those 5k students, faculty, staff, etc.

For comparison, as of Feb 2009, the entire digital online content of the Library of Congress was only 74TB. And you just purchased 252TB just for email for a 5,000 head count subsection of a small state university's population?

http://blogs.loc.gov/loc/2009/02/how-big-is-the-library-of-congress/

Sane email retention policies would allow any 5k seat organization to get 10+ years of life out of 28TB (assuming the hardware lived that long). Most could do it with 1TB, which would allow 244k emails per user mailbox. Assuming 28TB net, not raw. Subtract 20% for net and you're at 195k emails per user mailbox. Still overkill...

-- Stan

Brandon Davidson

6:59 a.m.

Stan,

On 1/14/11 7:09 PM, "Stan Hoeppner" <stan@hardwarefreak.com> wrote:

...

The average size of an email worldwide today is less than 4KB, less than one typical filesystem block.

28TB / 4KB = 28,000,000,000,000 bytes / 4096 bytes = 6,835,937,500 = 6.8 billion emails / 5,000 users = 1,367,188 emails per user

6.8 billion emails is "not much anymore" for a 5,000 seat org?

You obviously don't live in the same world I do. Have you ever been part of a grant approval process and seen what kinds of files are exchanged, and with what frequency? Complied with retention and archival policies? Dealt with folks who won't (or can't) delete an message once they've received it?

Blithely applying some inexplicable figure you've pulled out of who-knows-where and extrapolating from that hardly constitutes prudent planning. We based our requirement on real numbers observed in our environment, expected growth, and our budget cycle. How do you plan? More blind averaging?

...

How much did that 252TB NetApp cost the university? $300k? $700k? Just a drop in the bucket right? Do you think that was a smart purchasing decision, given your state's $3.8 Billion deficit?

You're close, if a bit high with one of your guesses. Netapp is good to Education. Not that it matters - you know very little about the financial state of my institution or how capital expenditures work within my department's funding model.

I suppose I shouldn't be surprised though, you seem to be very skilled at taking a little bit of information and making a convincing-sounding argument about it... regardless of how much you actually know.

...

For comparison, as of Feb 2009, the entire digital online content of the Library of Congress was only 74TB. And you just purchased 252TB just for email for a 5,000 head count subsection of a small state university's population?

I work for central IS, so this is the first stage of a consolidated service offering that we anticipate may encompass all of our staff and faculty. We bought what we could with what we had, anticipating that usage will grow over time as individual units migrate off their existing infrastructure. Again, you're guessing and casting aspersions.

This is enterprise storage; I'm not sure that you know what this actually means either. With Netapp you generally lose on the order of 35-45% due to right-sizing, RAID, spares, and aggregate/volume/snapshot reserves. What's left will be carved up into LUNs and presented to the hosts.

1/3 of the available capacity is passive 3rd-site disaster-recovery. The remaining 2 sites each host both an active and a passive copy of each mail store; we design to be able to sustain a site outage without loss of service. Each site has extra space for several years of growth, database restores, and archival / records retention reserves.

That's how 16TB of active mail can end up requiring 252TB of raw disk. Doing things right can be expensive, but it's usually cheaper in the long run than doing it wrong. It's like looking into a whole other world for you, isn't it? No Newegg parts here...

-Brad

Brandon Davidson

7:19 a.m.

On 1/14/11 8:59 PM, "Brandon Davidson" <brandond@uoregon.edu> wrote:

...

I work for central IS, so this is the first stage of a consolidated service offering that we anticipate may encompass all of our staff and faculty. We bought what we could with what we had, anticipating that usage will grow over time as individual units migrate off their existing infrastructure.

1/3 of the available capacity is passive 3rd-site disaster-recovery. The remaining 2 sites each host both an active and a passive copy of each mail store; we design to be able to sustain a site outage without loss of service. Each site has extra space for several years of growth, database restores, and archival / records retention reserves.

Oh, and you probably don't even want to think about what we did for our Dovecot infrastructure. Clustered NFS servers with seamless failover, snapshotting, and real-time block-level replication aren't cheap. The students and faculty/staff not supported by an existing Exchange environment aren't getting any less support, I'll say that much.

Folks trust us with their education, their livelihoods, and their personal lives. I'd like to think that 'my fellow taxpayers' understand the importance of what we do and appreciate the measures we take to ensure the integrity and availability of their data.

-Brad

Stan Hoeppner

16 Jan 16 Jan

6:03 a.m.

Brandon Davidson put forth on 1/14/2011 10:59 PM:

...

You obviously don't live in the same world I do. Have you ever been part of

Not currently, no, thankfully.

...

a grant approval process and seen what kinds of files are exchanged, and

I've never worked in the public sector, only private, so I've not dealt with the grant process, but I'm not totally ignorant of them either. I've assisted a couple of colleagues in the past with grant proposals. And yes, they can, and often do, suck.

...

with what frequency? Complied with retention and archival policies? Dealt with folks who won't (or can't) delete an message once they've received it?

I have, unfortunately, had to deal with regulatory compliance and some of the less than sane communications retention policies.

...

Blithely applying some inexplicable figure you've pulled out of who-knows-where and extrapolating from that hardly constitutes prudent planning.

Statistics are guidelines. As I'm not planning anything in this thread, I don't see how such non existent planning could be categorized as prudent or not. What I did do is simply make the case that 252TB seems bleeping outrageously high for 5k users, whether that entails email alone or every other kind of storage those 5k users need. If my math is correct, that's about 50GB/user, including your snapshot LUNs, etc.

...

We based our requirement on real numbers observed in our environment, expected growth, and our budget cycle.

You forgot to mention the 35-45% (mentioned below) gross storage loss due to inefficiencies in your chosen hardware vendor's platform/architecture. Over a third and almost half of the drive cost is entangled here is it not?

...

How do you plan? More blind averaging?

Ouija board.

...

You're close, if a bit high with one of your guesses. Netapp is good to Education.

Vendors with the largest profit margins (read: over priced) built into their products are those most willing and able to give big discounts to select customers.

...

Not that it matters - you know very little about the financial state of my institution or how capital expenditures work within my department's funding model.

That's true. I know nothing about the financial state of your institution. I didn't claim to. I simply know Oregon was/is facing a $3.8B deficit. Your institution is part of the state government budget. Thus, your institution's spending is part of that budget/deficit. That's simply fact. No?

...

I suppose I shouldn't be surprised though, you seem to be very skilled at taking a little bit of information and making a convincing-sounding argument about it... regardless of how much you actually know.

I know this: 252TB is bleeping ridiculously large for 5K seats at _any_ university, public or private, regardless of how much is wasted for "data management". Also, 34%-45% consumption of raw for any internal array functions/management is bleeping ridiculous. Is that your definition of "enterprise"? Massive required waste of raw capacity?

...

I work for central IS, so this is the first stage of a consolidated service offering that we anticipate may encompass all of our staff and faculty. We bought what we could with what we had, anticipating that usage will grow over time as individual units migrate off their existing infrastructure. Again, you're guessing and casting aspersions.

Guessing? Originally you stated that 252TB for email only, or specifically Exchange. You said nothing of a mass storage consolidation project:

Brad Davidson put forth on 1/14/2011 6:25 PM:

...

We just bought 252TB of raw disk for about 5k users. Given, this is going in to Exchange on Netapp

Casting aspersions? aspersion:

a : a false or misleading charge meant to harm someone's reputation <cast aspersions on her integrity>

What false or misleading charge did I make with the intention to harm your reputation Brad? I've merely made a technical argument for sane email retention policies, and against the need for 252TB for 5K users' email. I don't recall casting any aspersions.

...

This is enterprise storage; I'm not sure that you know what this actually means either. With Netapp you generally lose on the order of 35-45% due to right-sizing, RAID, spares, and aggregate/volume/snapshot reserves. What's left will be carved up into LUNs and presented to the hosts.

If your definition of "enterprise storage" is losing 34-45% of raw capacity for housekeeping chores, then I'll stick with my definition, and with Nexsan for my "enterprise" storage needs.

You didn't mention deduplication once yet. With Nexsan's DeDupe SG I cut my regulatory dictated on disk storage requirements in half, and Nexsan disk costs half as much as NetApp, for the same SATA disks. With Nexsan, my overall storage costs are less than half of a NetApp, for basically the same capability. The only "downside" is that I can't get all of the functionality in a single head controller--however in many ways this is actually an advantage. My total costs are still far lower than going with an FAS, CLARiiON, or HNAS+HCP. All software is integrated with no additional licensing costs. I have no restrictions on the number of SAN or CFS/CIFS hosts I can connect to a LUN or an exported filesystem.

Don't get me wrong--NetApp, EMC, and HDS all make great products with nice capabilities. However, you _really_ pay through the nose for it, and keep paying for it as you needs grow. With Nexsan I pay once and only once, unless/until I need more disks. No additional fees to unlock any capabilities already in the box. No restrictions.

...

1/3 of the available capacity is passive 3rd-site disaster-recovery. The remaining 2 sites each host both an active and a passive copy of each mail store; we design to be able to sustain a site outage without loss of service. Each site has extra space for several years of growth, database restores, and archival / records retention reserves.

Ok, so this 252TB of disk is actually spread out over multiple buildings with multiple FAS controllers, one in each building? And this 252TB isn't just for mail (and its safeguard data) as you previously stated?

...

That's how 16TB of active mail can end up requiring 252TB of raw disk. Doing things right can be expensive, but it's usually cheaper in the long run than doing it wrong. It's like looking into a whole other world for you, isn't it? No Newegg parts here...

There are many ways to "do things right". And I'd humbly suggest ballooning 16TB of mail (again ridiculously large) into 252TB of raw disk (you're still confusing us as to what is actually what in that 252TB) is not one of them. That's a 16x increase. Do you have deduplication and compression installed, enabled, and scheduled?

-- Stan

Noel Butler

15 Jan 15 Jan

8:42 a.m.

On Fri, 2011-01-14 at 21:09 -0600, Stan Hoeppner wrote:

...

Brad Davidson put forth on 1/14/2011 6:25 PM:

...
We just bought 252TB of raw disk for about 5k users. Given, this is going in to Exchange on Netapp with multi-site database replication, so this cooks down to about 53TB of usable space with room for recovery databases, defragmentation, archives, etc, but still... 28TB is not much anymore.

The average size of an email worldwide today is less than 4KB, less than one typical filesystem block.

Standard in your eyes maybe, hell, 4kb would barely cover the headers in some messages, I guess you also haven't heard about these things called "email attachments", please google it, we are not here to be your educators, though christ knows someone needs to be. PS your small message alone was 6K.

...

This is why people don't listen to Noel (I've had him kill filed for a year--but not to the extreme of body filtering him). They probably won't put much stock in what you say either Brad. Why?

oh my, what will I do, loss of sleep coming up? I think not... perhaps the fact truth hurts Stanley much more.

...

You two don't live in reality. Either that, or the reality you live in is

really? WOW cool, thanks, I guess I'll shut down all those smtp servers, the databases all of it, I mean, if its all a figment of our imagination, then all the power costs and costs of BTU cooling requirements, the data costs, they will vanish when we wake up and come baqck to stans reality? AWESOME!

*chuckles* this is better than any sitcom :P

Andrzej Adam Filip

12:02 p.m.

Stan Hoeppner <stan@hardwarefreak.com> wrote:

...

[...] The average size of an email worldwide today is less than 4KB, less than one typical filesystem block. [...]

Do not confuse "unix culture" of mostly plain text only email messages with "MS Junk" culture of overblown formatting with background images company logos as a few image files in every (internal) email.

-- [pl>en: Andrew] Andrzej Adam Filip : anfi@onet.eu Let thy maid servant be faithful, strong, and homely. -- Benjamin Franklin

Sven Hartge

5:29 p.m.

Andrzej Adam Filip <anfi@onet.eu> wrote:

...

Stan Hoeppner <stan@hardwarefreak.com> wrote:

...

...
[...] The average size of an email worldwide today is less than 4KB, less than one typical filesystem block. [...]

...

Do not confuse "unix culture" of mostly plain text only email messages with "MS Junk" culture of overblown formatting with background images company logos as a few image files in every (internal) email.

I just did a rough analysis of the mail spool of my university (6.000 users, students and faculty staff, about 10 million mails) and the average mail size was at about 96KiB. Last year, this average was at 77KiB and in 2009 we were at 62KiB.

Mails the average size of 4KiB would then have been at a time when MIME was not yet invented, I believe. Somewhere in 1994.

Grüße, Sven.

-- Sig lost. Core dumped.

Andrzej Adam Filip

8:07 p.m.

Sven Hartge <sven@svenhartge.de> wrote:

...

Andrzej Adam Filip <anfi@onet.eu> wrote:

...
Stan Hoeppner <stan@hardwarefreak.com> wrote:

...
...
[...] The average size of an email worldwide today is less than 4KB, less than one typical filesystem block. [...]

...
Do not confuse "unix culture" of mostly plain text only email messages with "MS Junk" culture of overblown formatting with background images company logos as a few image files in every (internal) email.

I just did a rough analysis of the mail spool of my university (6.000 users, students and faculty staff, about 10 million mails) and the average mail size was at about 96KiB. Last year, this average was at 77KiB and in 2009 we were at 62KiB.

Mails the average size of 4KiB would then have been at a time when MIME was not yet invented, I believe. Somewhere in 1994.

Grüße, Sven.

I assume that in bigger organizations most mail stored in IMAP storage is internal. I also assume that size of typical mail in "unix/linux culture" and "MS culture" do differ. It may explain quite different experiences.

Could you elaborate about penetration by MS software/culture (especially about MS Exchange) in your university?

BTW I have seen a few (smaller) organizations with most (internal) mails below 4KB but remaining *huge* mails capable to very significantly influence average size. It makes me doubt about value of *bare* "average email size".

P.S. Anyway many organization are legally obliged to archive all emails.

-- [pl>en: Andrew] Andrzej Adam Filip : anfi@onet.eu Oh wearisome condition of humanity! Born under one law, to another bound. -- Fulke Greville, Lord Brooke

Sven Hartge

9:50 p.m.

Andrzej Adam Filip <anfi@onet.eu> wrote:

...

Sven Hartge <sven@svenhartge.de> wrote:

...
Andrzej Adam Filip <anfi@onet.eu> wrote:

...
Stan Hoeppner <stan@hardwarefreak.com> wrote:

...

...
...
...
[...] The average size of an email worldwide today is less than 4KB, less than one typical filesystem block. [...]

...
Do not confuse "unix culture" of mostly plain text only email messages with "MS Junk" culture of overblown formatting with background images company logos as a few image files in every (internal) email.

I just did a rough analysis of the mail spool of my university (6.000 users, students and faculty staff, about 10 million mails) and the average mail size was at about 96KiB. Last year, this average was at 77KiB and in 2009 we were at 62KiB.

Mails the average size of 4KiB would then have been at a time when MIME was not yet invented, I believe. Somewhere in 1994.

...

I assume that in bigger organizations most mail stored in IMAP storage is internal. I also assume that size of typical mail in "unix/linux culture" and "MS culture" do differ. It may explain quite different experiences.

...

Could you elaborate about penetration by MS software/culture (especially about MS Exchange) in your university?

Zero on the server side.

We have a central IT which handles all mail and so far no Exchange has been requested. (Groupware features are handled by Egroupware.)

Many users of course use the mail client installed by default (which would be Outlook or Live! Mail) and thus produce and receive HTML mails.

...

From my Spamassassin statistics I can see about 50% of all incoming mails are HTML mails, I guess the amount would be the same for outgoing mails.

Grüße, Sven

-- Sig lost. Core dumped.

Stan Hoeppner

16 Jan 16 Jan

7:03 a.m.

Sven Hartge put forth on 1/15/2011 9:29 AM:

...

Andrzej Adam Filip <anfi@onet.eu> wrote:

...
Stan Hoeppner <stan@hardwarefreak.com> wrote:

...
...
[...] The average size of an email worldwide today is less than 4KB, less than one typical filesystem block. [...]

...
Do not confuse "unix culture" of mostly plain text only email messages with "MS Junk" culture of overblown formatting with background images company logos as a few image files in every (internal) email.

I just did a rough analysis of the mail spool of my university (6.000 users, students and faculty staff, about 10 million mails) and the average mail size was at about 96KiB. Last year, this average was at 77KiB and in 2009 we were at 62KiB.

Mails the average size of 4KiB would then have been at a time when MIME was not yet invented, I believe. Somewhere in 1994.

No. You're doing a statistical mean. You need to be doing median. The reason should be obvious.

-- Stan

Stan Hoeppner

7:18 a.m.

Stan Hoeppner put forth on 1/15/2011 11:03 PM:

...

Sven Hartge put forth on 1/15/2011 9:29 AM:

...

...
Mails the average size of 4KiB would then have been at a time when MIME was not yet invented, I believe. Somewhere in 1994.

No. You're doing a statistical mean. You need to be doing median. The reason should be obvious.

Correcting myself here. You are right, this should be a mean calculation. And the reason is obvious. ;)

-- Stan

Stan Hoeppner

2:48 a.m.

Andrzej Adam Filip put forth on 1/15/2011 4:02 AM:

...

Stan Hoeppner <stan@hardwarefreak.com> wrote:

...
[...] The average size of an email worldwide today is less than 4KB, less than one typical filesystem block. [...]

Do not confuse "unix culture" of mostly plain text only email messages with "MS Junk" culture of overblown formatting with background images company logos as a few image files in every (internal) email.

"average size of an email worldwide"

The bulk of all email is personal, not corporate: think Gmail, Hotmail, Yahoo, the 50k ISPs worldwide, etc. Average all of that together with the corporate mail (small percentage), and you're well under 4KB per message, especially considering the amount of SMS gatewaying going on with smart phones today. Most of those are one liners with the smtp header being 4 times the size of the body, with total message size being under 1KB.

-- Stan

Philipp Haselwarter

4:32 a.m.

"SH" == Stan Hoeppner <stan@hardwarefreak.com> writes:

SH> Andrzej Adam Filip put forth on 1/15/2011 4:02 AM:

...

...
Stan Hoeppner <stan@hardwarefreak.com> wrote:

...
[...] The average size of an email worldwide today is less than 4KB, less than one typical filesystem block. [...]

Do not confuse "unix culture" of mostly plain text only email messages with "MS Junk" culture of overblown formatting with background images company logos as a few image files in every (internal) email.

SH> "average size of an email worldwide"

SH> The bulk of all email is personal, not corporate: [...]

,---- | More than 97% of all e-mails sent over the net are unwanted, according | to a Microsoft security report.[39] | | MAAWG estimates that 85% of incoming mail is "abusive email", as of the | second half of 2007. The sample size for the MAAWG's study was over 100 | million mailboxes.[40][41][42] | | Spamhaus estimates that 90% of incoming e-mail traffic is spam in North | America, Europe or Australasia.[43] By June 2008 96.5% of e-mail | received by businesses was spam.[18][unreliable source?] `---- http://en.wikipedia.org/wiki/E-mail_spam#As_a_percentage_of_the_total_volume...

---8<---[snipped 3 lines]---8<--- SH> under 4KB per message, especially considering the amount of SMS SH> gatewaying going on with smart phones today. Most of those are one SH> liners with the smtp header being 4 times the size of the body, with SH> total message size being under 1KB.

SH> -- Stan

I just have a tiny set of 4k spam mails, but they have an avg size of 39KB, ie well above 4KB.

-- Philipp Haselwarter

Stan Hoeppner

7:19 a.m.

Philipp Haselwarter put forth on 1/15/2011 8:32 PM:

...

,---- | More than 97% of all e-mails sent over the net are unwanted, according | to a Microsoft security report.[39] | | MAAWG estimates that 85% of incoming mail is "abusive email", as of the | second half of 2007. The sample size for the MAAWG's study was over 100 | million mailboxes.[40][41][42] | | Spamhaus estimates that 90% of incoming e-mail traffic is spam in North | America, Europe or Australasia.[43] By June 2008 96.5% of e-mail | received by businesses was spam.[18][unreliable source?] `----

...

I just have a tiny set of 4k spam mails, but they have an avg size of 39KB, ie well above 4KB.

This discussion has been in the context of _storing_ user email. The assumption is that an OP is smart/talented enough to get his spam filters/appliances killing 99% before it reaches intermediate storage or mailboxes. Thus, in the context of this discussion, the average size of a spam message is irrelevant, because we're talking about what goes into the mail store.

If you're storing significantly more than 1% of spam you need to get that under control before doing any kind of meaningful analysis of mail storage needs.

-- Stan

Noel Butler

7:39 a.m.

LOL this is just soooooooo funny., watching the no no no im right you're wrong, give up stanley, those on many lists are aware of your trolling, nobody cares about your lil SOHO world, this list contains many different sized orgs, and like someone else mentione,d the 4K email size is SO 1994, but, that about sums you up anyway.

On Sat, 2011-01-15 at 23:19 -0600, Stan Hoeppner wrote:

...

Philipp Haselwarter put forth on 1/15/2011 8:32 PM:

...
,---- | More than 97% of all e-mails sent over the net are unwanted, according | to a Microsoft security report.[39] | | MAAWG estimates that 85% of incoming mail is "abusive email", as of the | second half of 2007. The sample size for the MAAWG's study was over 100 | million mailboxes.[40][41][42] | | Spamhaus estimates that 90% of incoming e-mail traffic is spam in North | America, Europe or Australasia.[43] By June 2008 96.5% of e-mail | received by businesses was spam.[18][unreliable source?] `----

...
I just have a tiny set of 4k spam mails, but they have an avg size of 39KB, ie well above 4KB.

This discussion has been in the context of _storing_ user email. The assumption is that an OP is smart/talented enough to get his spam filters/appliances killing 99% before it reaches intermediate storage or mailboxes. Thus, in the context of this discussion, the average size of a spam message is irrelevant, because we're talking about what goes into the mail store.

If you're storing significantly more than 1% of spam you need to get that under control before doing any kind of meaningful analysis of mail storage needs.

Robert Schetterer

11:10 a.m.

Am 16.01.2011 06:39, schrieb Noel Butler:

...

LOL this is just soooooooo funny., watching the no no no im right you're wrong, give up stanley, those on many lists are aware of your trolling, nobody cares about your lil SOHO world, this list contains many different sized orgs, and like someone else mentione,d the 4K email size is SO 1994, but, that about sums you up anyway.

On Sat, 2011-01-15 at 23:19 -0600, Stan Hoeppner wrote:

...
Philipp Haselwarter put forth on 1/15/2011 8:32 PM:

...
,---- | More than 97% of all e-mails sent over the net are unwanted, according | to a Microsoft security report.[39] | | MAAWG estimates that 85% of incoming mail is "abusive email", as of the | second half of 2007. The sample size for the MAAWG's study was over 100 | million mailboxes.[40][41][42] | | Spamhaus estimates that 90% of incoming e-mail traffic is spam in North | America, Europe or Australasia.[43] By June 2008 96.5% of e-mail | received by businesses was spam.[18][unreliable source?] `----

...
I just have a tiny set of 4k spam mails, but they have an avg size of 39KB, ie well above 4KB.

This discussion has been in the context of _storing_ user email. The assumption is that an OP is smart/talented enough to get his spam filters/appliances killing 99% before it reaches intermediate storage or mailboxes. Thus, in the context of this discussion, the average size of a spam message is irrelevant, because we're talking about what goes into the mail store.

If you're storing significantly more than 1% of spam you need to get that under control before doing any kind of meaningful analysis of mail storage needs.

the simple truth is SSD drives are fast, and it will/might be the future but mail people are conservative ( you might think about the hell loosing mail ), at present SSD drives tec is not ready for serving big mail stores. ( my opinion ) tec time is turning fast, so in whatever time we might all use only SSD drives then ,or not, we will see after all its not really a dovecot theme wich should lead to endless flames on this list

-- Best Regards

MfG Robert Schetterer

Germany/Munich/Bavaria

Cor Bosman

3:49 p.m.

...

This discussion has been in the context of _storing_ user email. The assumption is that an OP is smart/talented enough to get his spam filters/appliances killing 99% before it reaches intermediate storage or mailboxes. Thus, in the context of this discussion, the average size of a spam message is irrelevant, because we're talking about what goes into the mail store.

The fact is, we all live in different realities, so we're all arguing about apples and oranges. If you're managing a SOHO, small company, large company, university, or in our case, an ISP, the requirements are all different. We have about a million mailboxes, about 20K active at the same time, and people pay for it.

Take for example Stan's spam quote above. In the real world of an ISP, killing 99% of all spam before it hits the storage is unthinkable. We only block spam that is guaranteed to be unwanted, mostly based on technical facts that can't ever happen in normal email. But email that our scanning system flags as probable spam, is just that, probable spam. We can not just throw that away, because in the real world, there are always, and I mean always, false positives. It is unthinkable to throw false positives away. So we have to put these emails in a spam folder in case the user wants to look at it. We block about 40% of all spam on technical grounds, our total spam percentage is 90%, so still about 80% of all customer email reaching the storage is spam.

But in other environments throwing away all probable spam may be perfectly fine. For my SOHO id have no problem throwing probable spam away. I never look in my spam folder anyways, so cant be missing much.

The same goes for SSD. We use SSD drives extensively in our company. Currently mostly in database servers, but our experiences have been good enough that we're slowly starting to add them to more systems as even boot drives. But we're not using them yet in email storage. Like Brad we're using Netapp filers because as far as I know they're one of the few commercially available HA filesystem companies. We've looked at EMC and Sun as well, but havent found a reason to move away from Netapp. In 12 years of netapp we've only had 1 major outage that lasted half a day (and made the front page of national news papers). So, understand that bit. Major outages make it to national news papers for us. HA, failover, etc are kind of important to us.

So why not build something ourselves and use SSD? I suppose we could, but it's not as easy as it sounds for us. (your mileage may vary). It would take significant amounts of engineering time, testing, migrating, etc etc. And the benefits are uncertain. We dont know if an open source HA alternative can give us another 12 years of virtually faultless operation. It may. It may not. Email is not something to start gambling with. People get kind of upset when their email disappears. We know what we've got with Netapp.

I did dabble in using SSD for indexes for a while, and it looked very promising. Certainly indexes are a prime target for SSD drives. But when the director matured, we started using the director and the netapp for indexes again. I may still build my own NFS server and use SSD drives just for indexes, simply to offload IOPS from the Netapp. Indexes are a little less scary to experiment with.

So, if you're in the position to try out SSD drives for indexes or even for storage, go for it. Im sure it will perform much better than spinning drives.

Cor

Cor Bosman

17 Jan 17 Jan

1:34 a.m.

Btw, our average mailsize last we checked was 30KB. Thats a pretty good average as we're an ISP with a very wide user base. I think 4KB average is not a normal mail load.

Cor

Stan Hoeppner

4:33 a.m.

Cor Bosman put forth on 1/16/2011 5:34 PM:

...

Btw, our average mailsize last we checked was 30KB. Thats a pretty good average as we're an ISP with a very wide user base. I think 4KB average is not a normal mail load.

As another OP pointed out, some ISPs apparently have to deliver a lot of spam to mailboxen to avoid FPs, bumping up that average mail size considerably. Do you accept and deliver a lot of spam to user mailboxen?

-- Stan

Noel Butler

5:16 a.m.

On Sun, 2011-01-16 at 20:33 -0600, Stan Hoeppner wrote:

...

Cor Bosman put forth on 1/16/2011 5:34 PM:

...
Btw, our average mailsize last we checked was 30KB. Thats a pretty good average as we're an ISP with a very wide user base. I think 4KB average is not a normal mail load.

As another OP pointed out, some ISPs apparently have to deliver a lot of spam to mailboxen to avoid FPs, bumping up that average mail size considerably. Do you accept and deliver a lot of spam to user mailboxen?

Still assuming that 30K average is spam huh stanley, accept the fact you screwed up, your data is 1994, in fact I question its accuracy for even back then.

keep peddling, its funny to watch, thankfully its in the archives and google for any of your new prospective employers.

Timo Sirainen

7:29 a.m.

On 17.1.2011, at 5.16, Noel Butler wrote:

...

keep peddling, its funny to watch, thankfully its in the archives and google for any of your new prospective employers.

You should be more worried about prospective employers finding your own mails. I'll help by moderating your mails before they reach the list.

Steve

10:23 a.m.

-------- Original-Nachricht --------

...

Datum: Sun, 16 Jan 2011 20:33:23 -0600 Von: Stan Hoeppner <stan@hardwarefreak.com> An: dovecot@dovecot.org Betreff: Re: [Dovecot] SSD drives are really fast running Dovecot

...

Cor Bosman put forth on 1/16/2011 5:34 PM:

...
Btw, our average mailsize last we checked was 30KB. Thats a pretty good average as we're an ISP with a very wide user base. I think 4KB average is not a normal mail load.

As another OP pointed out, some ISPs apparently have to deliver a lot of spam to mailboxen to avoid FPs, bumping up that average mail size considerably.

Spam does not bump the average mail size considerably. Average spam mails is way smaller then average normal mails. The reason for this is very simple: Spammers need to reach as many end users as possible. And they need to get those mails out as fastest as possible.

...

Do you accept and deliver a lot of spam to user mailboxen?

-- Stan

-- Empfehlen Sie GMX DSL Ihren Freunden und Bekannten und wir belohnen Sie mit bis zu 50,- Euro! https://freundschaftswerbung.gmx.de

Maarten Bezemer

12:13 p.m.

On Mon, 17 Jan 2011, Steve wrote:

...

Spam does not bump the average mail size considerably. Average spam mails is way smaller then average normal mails. The reason for this is very simple: Spammers need to reach as many end users as possible. And they need to get those mails out as fastest as possible.

Somewhat correct. Due to a lot of spamfilter setups skip messages above a certain size, we've seen an increase of such big messages. These affect the average quite severely.

An average, however, is only just that: an average. There may not even be 1 message that has exactly the average size...

When looking at last two weeks worth of spam that didn't come from obvious blacklisted sources, I see: 45 messages below 4KB (including quite some miserable failures that forgot to include a message body...) 127 messages above 8KB, of which only 14 above 20KB 940 messages between 4KB and 8KB

Yet, the _average_ was well above 8KB, due to a few 500KB+ messages.

So, mean, median, or whatever, it's just lies, damn lies, and statistics.

-- Maarten

Steve

2:41 p.m.

-------- Original-Nachricht --------

...

Datum: Mon, 17 Jan 2011 11:13:19 +0100 (CET) Von: Maarten Bezemer <mcbdovecot@robuust.nl> An: Dovecot Mailing List <dovecot@dovecot.org> Betreff: Re: [Dovecot] SSD drives are really fast running Dovecot

...

On Mon, 17 Jan 2011, Steve wrote:

...
Spam does not bump the average mail size considerably. Average spam mails is way smaller then average normal mails. The reason for this is very simple: Spammers need to reach as many end users as possible. And they need to get those mails out as fastest as possible.

Somewhat correct. Due to a lot of spamfilter setups skip messages above a certain size, we've seen an increase of such big messages. These affect the average quite severely.

An average, however, is only just that: an average. There may not even be 1 message that has exactly the average size...

When looking at last two weeks worth of spam that didn't come from obvious blacklisted sources, I see: 45 messages below 4KB (including quite some miserable failures that forgot to include a message body...) 127 messages above 8KB, of which only 14 above 20KB 940 messages between 4KB and 8KB

Yet, the _average_ was well above 8KB, due to a few 500KB+ messages.

You get 500Kb+ sized spam messages? That is not usual. I have not done any computation on my part but I remember seen last year (or so) a study showing that spam messages are usually below 64Kb.

Anyway... Why is this so ultra important how big spam messages are?

...

So, mean, median, or whatever, it's just lies, damn lies, and statistics.

-- Maarten

-- NEU: FreePhone - kostenlos mobil telefonieren und surfen! Jetzt informieren: http://www.gmx.net/de/go/freephone

Giles Coochey

2:45 p.m.

On 17/01/2011 13:41, Steve wrote:

...

You get 500Kb+ sized spam messages? That is not usual. I have not done any computation on my part but I remember seen last year (or so) a study showing that spam messages are usually below 64Kb.

That can depend on what you clasify as SPAM. Many, 'newsletters' which you've been 'subscribed to' by negative option web-forms are considered SPAM by some, and those may contain PDF attachments of 500kb+

-- Best Regards,

Giles Coochey NetSecSpec Ltd NL T-Systems Mobile: +31 681 265 086 NL Mobile: +31 626 508 131 Gib Mobile: +350 5401 6693 Email/MSN/Live Messenger: giles@coochey.net Skype: gilescoochey

Steve

3:18 p.m.

-------- Original-Nachricht --------

...

Datum: Mon, 17 Jan 2011 13:45:51 +0100 Von: Giles Coochey <giles@coochey.net> An: dovecot@dovecot.org Betreff: Re: [Dovecot] SSD drives are really fast running Dovecot

...

On 17/01/2011 13:41, Steve wrote:

...
You get 500Kb+ sized spam messages? That is not usual. I have not done any computation on my part but I remember seen last year (or so) a study showing that spam messages are usually below 64Kb.

That can depend on what you clasify as SPAM. Many, 'newsletters' which you've been 'subscribed to' by negative option web-forms are considered SPAM by some, and those may contain PDF attachments of 500kb+

Welll.... I wrote about "usual" and those newsletter that you tag as Spam but have subscribed to them are definitely not the norm.

...

-- Best Regards,

Giles Coochey NetSecSpec Ltd NL T-Systems Mobile: +31 681 265 086 NL Mobile: +31 626 508 131 Gib Mobile: +350 5401 6693 Email/MSN/Live Messenger: giles@coochey.net Skype: gilescoochey

-- GMX DSL Doppel-Flat ab 19,99 Euro/mtl.! Jetzt mit gratis Handy-Flat! http://portal.gmx.net/de/go/dsl

Giles Coochey

3:26 p.m.

On 17/01/2011 14:18, Steve wrote:

...

...
On 17/01/2011 13:41, Steve wrote:

...
You get 500Kb+ sized spam messages? That is not usual. I have not done any computation on my part but I remember seen last year (or so) a study showing that spam messages are usually below 64Kb. That can depend on what you clasify as SPAM. Many, 'newsletters' which you've been 'subscribed to' by negative option web-forms are considered SPAM by some, and those may contain PDF attachments of 500kb+

Welll.... I wrote about "usual" and those newsletter that you tag as Spam but have subscribed to them are definitely not the norm.

I think that was Martin's point, while they are not the norm, they are off the scale enough to severely influence the 'average'. They account for 0.2% of the events in the same, yet are more than 10000% above what is considered the 'norm'.

-- Best Regards,

Giles Coochey NetSecSpec Ltd NL T-Systems Mobile: +31 681 265 086 NL Mobile: +31 626 508 131 Gib Mobile: +350 5401 6693 Email/MSN/Live Messenger: giles@coochey.net Skype: gilescoochey

Maarten Bezemer

3:27 p.m.

On Mon, 17 Jan 2011, Steve wrote:

...

...
Von: Giles Coochey <giles@coochey.net>

That can depend on what you clasify as SPAM. Many, 'newsletters' which you've been 'subscribed to' by negative option web-forms are considered SPAM by some, and those may contain PDF attachments of 500kb+

Welll.... I wrote about "usual" and those newsletter that you tag as Spam but have subscribed to them are definitely not the norm.

I didn't count "newsletters" I subscribed to.. Always using traceable addresses for those. In this case, it was JPG spam with large pics. Some claiming to be LED lighting newsletter, others disguised as new year's greetings. But content showed something not quite related to LEDs or happy new year stuff :-P All these big spams are addressed to bogus addresses, and/or standard addresses like info@domain. Usually with info@some-other-domain in the From: header.

But these are my last 2 cents for this thread as it has been derailing for quite some time now ;-)

-- Maarten

Rick Romero

15 Jan 15 Jan

4:29 a.m.

Quoting Stan Hoeppner <stan@hardwarefreak.com>:

...

...
David Jonas put forth on 1/14/2011 2:08 PM:

...
Raid10 is our normal go to, but giving up half the storage in this case seemed unnecessary. I was looking at SAS drives and it was getting pricy. I'll work SATA into my considerations.

That's because you're using the wrong equation for determining your disk storage needs. I posted a new equation on one of the lists a week or two ago. Performance and reliability are far more important now than
total space. And today performance means transactional write IOPS not streaming reads. In today's world, specifically for transaction oriented applications (db and mail) smaller faster more expensive disks are less expensive in total ROI that big fat slow drives. The reason is that few if any organizations actually need 28TB (14 2TB Cavier Green drives--popular with idiots today) of mail storage in a single mail store. That's 50 years worth of mail storage for a 50,000 employee company, assuming your employees aren't allowed porn/video
attachments, which which most aren't. And that's assuming a platter squeezing in 1TB of data at 7200RPMs doesn't get a comparable performance improvement to a higher rotational speed on a lower volume platter... Hell for the price of a single 250gb SSD drive, you can RAID 10 TEN 7200 RPM 500GB SATAs.

So while, yes, my 10 drive SATA RAID 10 ONLY performs 166MB/sec with a 'simplistic' dd test, In reality I just don't think Joe User is going to notice the difference between that and the superior performance of a single SSD drive when he POPs his 10 3k emails.

Rick

Stan Hoeppner

6:16 a.m.

Rick Romero put forth on 1/14/2011 8:29 PM:

...

And that's assuming a platter squeezing in 1TB of data at 7200RPMs doesn't get a comparable performance improvement to a higher rotational speed on a lower volume platter...

Size and density are irrelevant. Higher density will allow greater streaming throughput at the same spindle speed, _however_ this does nothing for seek performance. Streaming performance is meaningless for transaction servers. IOPS performance is critical for transaction servers. Seek performance equals IOPS performance. The _only_ way to increase mechanical disk IOPS is to increase the spindle speed the or the speed of the head actuator. If you've watched mechanical drive evolution for the past 20 years you've seen that actuator speed hasn't increased due to the physical properties of voice coil drive actuators.

...

Hell for the price of a single 250gb SSD drive, you can RAID 10 TEN 7200 RPM 500GB SATAs.

I think your pricing ratio is a bit off but we'll go with it. You'd get 50,000 4KB random IOPS from the SSD and only 750 IOPS from the RAID 10. The SSD could handle 67 times as many emails per second for 10 times the cost. Not a bad trade.

...

So while, yes, my 10 drive SATA RAID 10 ONLY performs 166MB/sec with a 'simplistic' dd test, In reality I just don't think Joe User is going to notice the difference between that and the superior performance of a single SSD drive when he POPs his 10 3k emails.

But Joe User _will_ notice a difference if this server with the RAID 10 mentioned above is supporting 5000 concurrent users, not just Joe. Responses will lag. With the SSD you can support 10000 concurrent users (assuming the rest of the hardware is up to that task and you have enough RAM) and responses for all of them will be nearly instantaneous. This is the difference SSD makes, and why it's worth the cost in many situations. However, doing so will require an email retention policy that doesn't allow unlimited storage--unless you can afford than much SSD capacity.

You can get 240,000 4k random IOPS and 1.9TB of capacity from two of these in a software RAID0 for $6,400 USD: http://www.newegg.com/Product/Product.aspx?Item=N82E16820227665

That's enough transactional IOPS throughput to support well over 50,000 concurrent IMAP users, probably far more. Of course this would require a server likely on the order of at least a single socket G34 AMD 12 core Magny Cours system w/2GHz cores, 128GB of RAM, and two free PCIe X4/X8 slots for the SSD cards, based on a board such as this SuperMicro: http://www.newegg.com/Product/Product.aspx?Item=N82E16813182240 (Actually this is the perfect board for running two of these RevoDrive X2 cards)

-- Stan

Charles Marcus

4:30 p.m.

It would be nice if some of you could stop with the personal attacks.

While I agree that assuming that all users only receive 4K emails is not realistic in most environments, neither is assuming a requirement of all of the super-duper triple redundant hot fail-over for a mailstore with no quota enforcing.

On 1/14/2011 11:16 PM, Stan Hoeppner wrote:

...

But Joe User _will_ notice a difference if this server with the RAID 10 mentioned above is supporting 5000 concurrent users, not just Joe. Responses will lag. With the SSD you can support 10000 concurrent users (assuming the rest of the hardware is up to that task and you have enough RAM) and responses for all of them will be nearly instantaneous. This is the difference SSD makes, and why it's worth the cost in many situations. However, doing so will require an email retention policy that doesn't allow unlimited storage--unless you can afford than much SSD capacity.

One thing we are looking at here (small 50+ userbase) is kind of a 'best of both worlds' setup - using SSD's (haven't decided yet to trust a bare striped set or go with a 4 drive RAID10 - probably the latter so I can sleep at night) for the main OS and a limited amount of storage space per user (maildir) for active/recent email, then use another namespace with a much higher quota - I'm thinking about 10GB per user should do in our environment - for 'slow' storage (cheap mechanical RAID10 setup) - ie, emails that are only accessed on occasion (mdbox).

Then, enforce a smallish per user quota (how much would depend on your particular environment, but I'm thinking something like 250 or maybe 500MB, since our users do get a lot of large attachments in the course of doing business) on their INBOX - & Sent, Drafts and Templates folders too, but that's a question on my list of 'how to do' - how to easily place these 'special' folders on the 'fast' namespace, and all user created folders in the 'slow' namespace. It would be really nice if there were some kind of native way that dovecot could 'assign' the 'special' folders to the same namespace as the INBOX, and all other user created folders to another...

Doing this will also help train users in proper email management - treating their INBOX just like they would a physical INBOX tray on their desk. They wouldn't just let paper pile up there, why do so in their INBOX (because they 'can')? Ie, it should be something they should always strive to keep totally EMPTY. Of course this practically never happens, but the point is, they need to learn to make a decision once they are finished with it, and most importantly, take said action - either delete it, or file it.

Best regards,

Charles

Bradley Giesbrecht

6:54 p.m.

On Jan 15, 2011, at 6:30 AM, Charles Marcus wrote:

...

One thing we are looking at here (small 50+ userbase) is kind of a
'best of both worlds' setup - using SSD's (haven't decided yet to trust a
bare striped set or go with a 4 drive RAID10 - probably the latter so I can sleep at night) for the main OS and a limited amount of storage space per user (maildir) for active/recent email, then use another namespace with a much higher quota - I'm thinking about 10GB per user should
do in our environment - for 'slow' storage (cheap mechanical RAID10 setup) - ie, emails that are only accessed on occasion (mdbox).

Then, enforce a smallish per user quota (how much would depend on your particular environment, but I'm thinking something like 250 or maybe 500MB, since our users do get a lot of large attachments in the course of doing business) on their INBOX - & Sent, Drafts and Templates
folders too, but that's a question on my list of 'how to do' - how to easily place these 'special' folders on the 'fast' namespace, and all user created folders in the 'slow' namespace. It would be really nice if there were some kind of native way that dovecot could 'assign' the 'special' folders to the same namespace as the INBOX, and all other
user created folders to another...

Doing this will also help train users in proper email management - treating their INBOX just like they would a physical INBOX tray on
their desk. They wouldn't just let paper pile up there, why do so in their INBOX (because they 'can')? Ie, it should be something they should always strive to keep totally EMPTY. Of course this practically never happens, but the point is, they need to learn to make a decision once they are finished with it, and most importantly, take said action - either delete it, or file it.

Sounds like a great idea. I work with media companies where quotas can
be challenging.

-- Brad

Charles Marcus

17 Jan 17 Jan

4:07 p.m.

On 2011-01-15 9:30 AM, Charles Marcus wrote:

...

Then, enforce a smallish per user quota (how much would depend on your particular environment, but I'm thinking something like 250 or maybe 500MB, since our users do get a lot of large attachments in the course of doing business) on their INBOX - & Sent, Drafts and Templates folders too, but that's a question on my list of 'how to do' - how to easily place these 'special' folders on the 'fast' namespace, and all user created folders in the 'slow' namespace. It would be really nice if there were some kind of native way that dovecot could 'assign' the 'special' folders to the same namespace as the INBOX, and all other user created folders to another...

Timo - any chance you could comment on the best way to accomplish this - or if it is even possible right now? I'm hoping to start testing this in the next few weeks...

Thanks,

Best regards,

Charles

Timo Sirainen

9:47 p.m.

On Mon, 2011-01-17 at 09:07 -0500, Charles Marcus wrote:

...

On 2011-01-15 9:30 AM, Charles Marcus wrote:

...
Then, enforce a smallish per user quota (how much would depend on your particular environment, but I'm thinking something like 250 or maybe 500MB, since our users do get a lot of large attachments in the course of doing business) on their INBOX - & Sent, Drafts and Templates folders too, but that's a question on my list of 'how to do' - how to easily place these 'special' folders on the 'fast' namespace, and all user created folders in the 'slow' namespace. It would be really nice if there were some kind of native way that dovecot could 'assign' the 'special' folders to the same namespace as the INBOX, and all other user created folders to another...

Timo - any chance you could comment on the best way to accomplish this - or if it is even possible right now? I'm hoping to start testing this in the next few weeks...

Well, you can have per-namespace quotas. But that of course means that the other namespace must have a prefix, so users would have to have something like:

INBOX
Drafts
Sent
bignamespace
- work
- etc.

Which wouldn't be very pretty. So if you then wanted to arbitrarily move mailboxes across different quota roots .. I'm not really even sure what would be a good way to configure that.

For INBOX only a separate namespace should be possible to implement without trouble though.

Charles Marcus

10 p.m.

On 2011-01-17 2:47 PM, Timo Sirainen wrote:

...

On Mon, 2011-01-17 at 09:07 -0500, Charles Marcus wrote:

...
On 2011-01-15 9:30 AM, Charles Marcus wrote:

...
Then, enforce a smallish per user quota (how much would depend on your particular environment, but I'm thinking something like 250 or maybe 500MB, since our users do get a lot of large attachments in the course of doing business) on their INBOX - & Sent, Drafts and Templates folders too, but that's a question on my list of 'how to do' - how to easily place these 'special' folders on the 'fast' namespace, and all user created folders in the 'slow' namespace. It would be really nice if there were some kind of native way that dovecot could 'assign' the 'special' folders to the same namespace as the INBOX, and all other user created folders to another...

Timo - any chance you could comment on the best way to accomplish this - or if it is even possible right now? I'm hoping to start testing this in the next few weeks...

Well, you can have per-namespace quotas. But that of course means that the other namespace must have a prefix, so users would have to have something like:

INBOX

Drafts

Sent

bignamespace

work

etc.

Which wouldn't be very pretty. So if you then wanted to arbitrarily move mailboxes across different quota roots .. I'm not really even sure what would be a good way to configure that.

For INBOX only a separate namespace should be possible to implement without trouble though.

Ok, thanks... hmmm... have to do some more thinking on this one...

Best regards,

Charles

Rick Romero

15 Jan 15 Jan

6:34 p.m.

Quoting Stan Hoeppner <stan@hardwarefreak.com>:

...

...
Rick Romero put forth on 1/14/2011 8:29 PM:

...
And that's assuming a platter squeezing in 1TB of data at
7200RPMs doesn't get a comparable performance improvement to a higher rotational
speed on a lower volume platter...

Size and density are irrelevant. Higher density will allow
greater streaming throughput at the same spindle speed, _however_ this does
nothing for seek performance. Streaming performance is meaningless for
transaction servers. IOPS performance is critical for transaction servers. Seek performance equals IOPS performance. The _only_ way to increase mechanical disk IOPS is to increase the spindle speed the or the speed of the head
actuator. If you've watched mechanical drive evolution for the past 20 years you've seen that actuator speed hasn't increased due to the physical properties
of voice coil drive actuators.

...
Hell for the price of a single 250gb SSD drive, you can RAID 10 TEN 7200 RPM 500GB SATAs.

I think your pricing ratio is a bit off but we'll go with it. You'd get 50,000 4KB random IOPS from the SSD and only 750 IOPS from the RAID 10. The SSD could handle 67 times as many emails per second for 10 times the cost. Not a bad trade.

...
So while, yes, my 10 drive SATA RAID 10 ONLY performs 166MB/sec with a 'simplistic' dd test, In reality I just don't think Joe User is going to notice the difference between that and the superior performance of a single SSD drive when he POPs his 10 3k emails.

But Joe User _will_ notice a difference if this server with the RAID 10 mentioned above is supporting 5000 concurrent users, not just
Joe. Responses will lag. With the SSD you can support 10000 concurrent users
(assuming the rest of the hardware is up to that task and you have enough RAM) and responses for all of them will be nearly instantaneous. This is the difference SSD makes, and why it's worth the cost in many situations. However, doing so will require an email retention policy that doesn't allow unlimited storage--unless you can afford than much SSD capacity.

You can get 240,000 4k random IOPS and 1.9TB of capacity from two of these in a software RAID0 for $6,400 USD: http://www.newegg.com/Product/Product.aspx?Item=N82E16820227665

That's enough transactional IOPS throughput to support well over 50,000 concurrent IMAP users, probably far more. Of course this would require a server likely on the order of at least a single socket G34 AMD 12 core
Magny Cours system w/2GHz cores, 128GB of RAM, and two free PCIe X4/X8 slots
for the SSD cards, based on a board such as this SuperMicro: http://www.newegg.com/Product/Product.aspx?Item=N82E16813182240 (Actually this is the perfect board for running two of these RevoDrive X2 cards)

I use pricewatch - so, yes, we may be talking refurb drives, but this is not an issue when you're saving enough money to just buy a few more of items you're already buying.

Also, if your filesystem is using 4k clusters, aren't you only using 1 random IOPS for a 4k email? It just sounds to me like if you plan 'smarter', anyone can avoid the excessive costs of SSD and get 'end user similar' performance with commodity hardware.

Rick

Stan Hoeppner

16 Jan 16 Jan

8:05 a.m.

Rick Romero put forth on 1/15/2011 10:34 AM:

...

Also, if your filesystem is using 4k clusters, aren't you only using 1 random IOPS for a 4k email? It just sounds to me like if you plan 'smarter', anyone can avoid the excessive costs of SSD and get 'end user similar' performance with commodity hardware.

This depends heavily on which filesystem you use and the flow of mails/sec. For instance, if multiple writes of small (<4KB) files (maildir), or multiple writes to a single file (mbox) on an XFS filesystem are pending simultaneously, because XFS uses a delayed allocation scheme, a variable extent size, and because it is optimized for parallel workloads, it can pack many small files into a single 4KB extent (if they add up to 4KB or less) and write them to disk in a single IOP. Likewise, with mbox storage XFS can coalesce multiple writes to the same file extent in a single IOP. Moreover, XFS can take a pending write of say, 37 small files or write fragments to the same file, and if they fit within say, 12KB, it can write all 37 in 3 pipelined IOPS. Using O_DIRECT with mbox files, the IOPS performance can be even greater. However, I don't know if this applies to Dovecot because AFAIK MMAP doesn't work well with O_DIRECT...

***Hay Timo, does/can Dovecot use Linux O_DIRECT for writing the mail files?

Now, here's where load issue comes in. If the pending writes are more than a few seconds apart, you lose this XFS coalesced write advantage. I don't know how other filesystems handle writing multiple small files <4KB.

-- Stan

Timo Sirainen

8:48 p.m.

On Sun, 2011-01-16 at 00:05 -0600, Stan Hoeppner wrote:

...

Using O_DIRECT with mbox files, the IOPS performance can be even greater. However, I don't know if this applies to Dovecot because AFAIK MMAP doesn't work well with O_DIRECT... ***Hay Timo, does/can Dovecot use Linux O_DIRECT for writing the mail files?

mmap doesn't matter, because mbox files aren't read with mmap. But I doubt it's a good idea to use O_DIRECT for mbox files, because even if it gives higher iops, you're using more iops because you keep re-reading the same data from disk since it's not cached to memory.

As for O_DIRECT writes.. I don't know if it's such a good idea either. If client is connected, it's often going to read the mail soon after it was written, so it's again a good idea that it stays in cache.

I once wrote a patch to free message contents from OS cache once the message was read entirely, because it probably wouldn't be read again. No one ever reported if it gave any better or worse performance. http://dovecot.org/patches/1.1/fadvise.diff

Stan Hoeppner

10:20 p.m.

Timo Sirainen put forth on 1/16/2011 12:48 PM:

...

On Sun, 2011-01-16 at 00:05 -0600, Stan Hoeppner wrote:

...
Using O_DIRECT with mbox files, the IOPS performance can be even greater. However, I don't know if this applies to Dovecot because AFAIK MMAP doesn't work well with O_DIRECT... ***Hay Timo, does/can Dovecot use Linux O_DIRECT for writing the mail files?

mmap doesn't matter, because mbox files aren't read with mmap. But I doubt it's a good idea to use O_DIRECT for mbox files, because even if it gives higher iops, you're using more iops because you keep re-reading the same data from disk since it's not cached to memory.

As for O_DIRECT writes.. I don't know if it's such a good idea either. If client is connected, it's often going to read the mail soon after it was written, so it's again a good idea that it stays in cache.

I once wrote a patch to free message contents from OS cache once the message was read entirely, because it probably wouldn't be read again. No one ever reported if it gave any better or worse performance. http://dovecot.org/patches/1.1/fadvise.diff

I'd gladly test it but I don't have the resources currently, and frankly, at this time, the prerequisite knowledge of building from source.

-- Stan

Eric Rostetter

13 Jan 13 Jan

10:26 p.m.

Quoting David Jonas <djonas@vitalwerks.com>:

...

I've been considering getting a pair of SSDs in raid1 for just the dovecot indexes.

While raid-1 is better than the raid-0 of the previous poster, do you really want to slow down your fast SDDs with software raid-1 on top of them?

...

The hope would be to minimize the impact of pop3 users hammering the server.

...

Proposed design is something like 2 drives (ssd or platter) for OS and logs

The OS is mostly cached, so any old drive should work. Logs are lots of writes, so a dedicated drive might be nice (no raid). Be sure to tune the OS for the logs...

...

2 ssds for indexes (soft raid1),

I'd probably just go with one drive, maybe have a spare as a cold- or
hot-spare in case of failure.

...

12 sata or sas drives in RAID5 or 6 (hw raid, probably 3ware) for maildirs. The

I'd say either raid-6 or raid-10 for this, depending on budget and size needs.

...

indexes and mailboxes would be mirrored with drbd. Seems like the best of both worlds -- fast and lots of storage.

drbd to where at what level? There was a some other discussion about this which basically said "Don't use drbd to mirror between VM guests" which I agree with. If you want to do this, use DRBD between VM servers (physical hosts) and not between VM guests (virtual hosts).

I do use a VM cluster with DRBD between the physical hosts, but not for mail services, and it works fine. Doing DRBD inside the virtual hosts though would not be good...

...

Does anyone run a configuration like this? How does it work for you?

No. I do 2 nodes, with DRBD between them, using GFS on them for both the mbox files and indexes... No virtualization at all... No SSD drives at all...

...

Anyone have any improvements on the design? Suggestions?

Only my advice about where to drbd if you are virtualizing, and what raid levels to use... But these are just my opinions and your milage may vary...

-- Eric Rostetter The Department of Physics The University of Texas at Austin

Go Longhorns!

David Woodhouse

6:01 p.m.

On Wed, 2011-01-12 at 09:53 -0800, Marc Perkel wrote:

...

I just replaced my drives for Dovecot using Maildir format with a pair of Solid State Drives (SSD) in a raid 0 configuration. It's really really fast. Kind of expensive but it's like getting 20x the speed for 20x the price. I think the big gain is in the 0 seek time.

You may find ramfs is even faster :)

I hope you have backups.

-- dwmw2

Javier de Miguel Rodríguez

16 Jan 16 Jan

8 p.m.

El 13/01/11 17:01, David Woodhouse escribió:

...

On Wed, 2011-01-12 at 09:53 -0800, Marc Perkel wrote:

...
I just replaced my drives for Dovecot using Maildir format with a pair of Solid State Drives (SSD) in a raid 0 configuration. It's really really fast. Kind of expensive but it's like getting 20x the speed for 20x the price. I think the big gain is in the 0 seek time. You may find ramfs is even faster :) ramfs (tmpfs in linux-land) is useful for indexes. If you lose the indexes, they will created automatically the next time a user logs in.

 We are now trying zlib plugin to lower the number of iops to our

maildir storage systems. We are using gzip (bzip2 increases a lot the latency). LZMA/xz seems interesting (high compression and rather good decompression speed) and lzo also seems interesting (blazing fast compression AND decompression, not much compression savings though)

 What kind of "tricks" do you use to lower the number of IOPs of

your dovecot servers?

 Regards

 Javier

...

I hope you have backups.

Stan Hoeppner

10:10 p.m.

Javier de Miguel Rodríguez put forth on 1/16/2011 12:00 PM:

...

What kind of "tricks" do you use to lower the number of IOPs of your dovecot

servers?

Using hardware SAN RAID controllers with 'large' (2GB) write cache. The large write cache allows for efficient use of large queue depths. A deeper queue allows the drives to order reads/writes most efficiently decreasing head seek movement. This doesn't necessarily decrease IO per se, but it makes the drives more efficient, allowing for more total physical drive IOPS.

Using XFS with delayed logging mount option (requires kernel 2.6.36 or later).

XFS has natively used delayed allocation for quite some time, coalescing multiple pending writes before pushing them into the buffer cache. This not only decreases physical IOPS, but it also decreases filesystem fragmentation by packing more files into each extent. Decreased fragmentation means fewer disk seeks required per file read, which also decreases physical IOPS. This also greatly reduces the wasted space typical of small file storage. Works very well with maildir, but also with the other mail storage formats.

Using the delayed logging feature, filesystem metadata write operations are pushed almost entirely into RAM. Not only does this _dramatically_ decrease physical metadata write IOPS but it also increases metadata write performance by an order of magnitude. Really shines with maildir, obviously, but would also help the s/mdbox formats since they make use of multiple files. Delaylog doesn't help mbox at all, and it doesn't do anything for index file performance. The caveat here is _load_. You won't get much benefit on a mostly idle server. The benefits of delayed logging increase as the filesystem metadata write load increases. Busy servers benefit the most.

-- Stan

Ed W

17 Jan 17 Jan

12:11 a.m.

...

Using XFS with delayed logging mount option (requires kernel 2.6.36 or later).

XFS has natively used delayed allocation for quite some time, coalescing multiple pending writes before pushing them into the buffer cache. This not only decreases physical IOPS, but it also decreases filesystem fragmentation by packing more files into each extent. Decreased fragmentation means fewer disk seeks required per file read, which also decreases physical IOPS. This also greatly reduces the wasted space typical of small file storage. Works very well with maildir, but also with the other mail storage formats.

What happens if you pull out the wrong cable in the rack, kernel lockup/oops, power failure, hot swap disk pulled, or something else which causes an unexpected loss of a few seconds of written data?

Surely your IOPs are hard limited by the number of fsyncs (and size of any battery backed ram)?

Ed W

Stan Hoeppner

4:20 a.m.

Ed W put forth on 1/16/2011 4:11 PM:

...

...
Using XFS with delayed logging mount option (requires kernel 2.6.36 or later).

XFS has natively used delayed allocation for quite some time, coalescing multiple pending writes before pushing them into the buffer cache. This not only decreases physical IOPS, but it also decreases filesystem fragmentation by packing more files into each extent. Decreased fragmentation means fewer disk seeks required per file read, which also decreases physical IOPS. This also greatly reduces the wasted space typical of small file storage. Works very well with maildir, but also with the other mail storage formats.

What happens if you pull out the wrong cable in the rack, kernel lockup/oops, power failure, hot swap disk pulled, or something else which causes an unexpected loss of a few seconds of written data?

Read the XFS FAQ. These questions have been answered hundreds of times since XFS was released in Irix in 1994. I'm not your personal XFS tutor.

...

Surely your IOPs are hard limited by the number of fsyncs (and size of any battery backed ram)?

Depends on how your applications are written and how often they call fsync. Do you mean BBWC? WRT delayed logging BBWC is mostly irrelevant. Keep in mind that for delayed logging to have a lot of metadata writes in memory someone, or many someones, must be doing something like an 'rm -rf' or equivalent on a large dir with many thousands of files. Even in this case, the processing is _very_ fast.

If your assumption is that your system is unstable, or you assume you will do stupid things to break your system, then don't use a high performance filesystem. This behavior is not limited to XFS.

-- Stan

Ed W

8:23 p.m.

On 17/01/2011 02:20, Stan Hoeppner wrote:

...

Ed W put forth on 1/16/2011 4:11 PM:

...
...
Using XFS with delayed logging mount option (requires kernel 2.6.36 or later).

XFS has natively used delayed allocation for quite some time, coalescing multiple pending writes before pushing them into the buffer cache. This not only decreases physical IOPS, but it also decreases filesystem fragmentation by packing more files into each extent. Decreased fragmentation means fewer disk seeks required per file read, which also decreases physical IOPS. This also greatly reduces the wasted space typical of small file storage. Works very well with maildir, but also with the other mail storage formats. What happens if you pull out the wrong cable in the rack, kernel lockup/oops, power failure, hot swap disk pulled, or something else which causes an unexpected loss of a few seconds of written data? Read the XFS FAQ. These questions have been answered hundreds of times since XFS was released in Irix in 1994. I'm not your personal XFS tutor.

Why the hostile reply?

The question was deeper than your response?

...

...
Surely your IOPs are hard limited by the number of fsyncs (and size of any battery backed ram)? Depends on how your applications are written and how often they call fsync. Do you mean BBWC? WRT delayed logging BBWC is mostly irrelevant. Keep in mind that for delayed logging to have a lot of metadata writes in memory someone, or many someones, must be doing something like an 'rm -rf' or equivalent on a large dir with many thousands of files. Even in this case, the processing is _very_ fast.

You have completely missed my point.

Your data isn't safe until it hits the disk. There are plenty of ways to spool data to ram rather than committing it, but they are all vulnerable to data loss until the data is written to disk.

You wrote: "filesystem metadata write operations are pushed almost entirely into RAM", but if the application requests an fsync then you still have to write it to disk? As such you are again limited by disk IO, which itself is limited by the performance of the device (and temporarily accelerated by any persistent write cache). Hence my point that your IOPs are generally limited by the number of fsyncs and any persistent write cache?

As I write this email I'm struggling with getting a server running again that has just been rudely powered down due to a UPS failing (power was fine, UPS failed...). This isn't such a rare event (IMHO) and hence I think we do need to assume that at some point every machine will suffer a rude and unexpected event which looses all in progress write cache. I have no complaints at XFS in general, but I think it's important that filesystem designers in general have give some thought to this event and recovering from it?

Please try not to be so hostile in your email construction - we aren't all idiots here, and even if we were, your writing style is not conducive to us wanting to learn from your apparent wealth of experience?

Regards

Ed W

Stan Hoeppner

20 Jan 20 Jan

8:06 a.m.

Ed W put forth on 1/17/2011 12:23 PM:

...

On 17/01/2011 02:20, Stan Hoeppner wrote:

...
Ed W put forth on 1/16/2011 4:11 PM:

...
...
Using XFS with delayed logging mount option (requires kernel 2.6.36 or later).

XFS has natively used delayed allocation for quite some time, coalescing multiple pending writes before pushing them into the buffer cache. This not only decreases physical IOPS, but it also decreases filesystem fragmentation by packing more files into each extent. Decreased fragmentation means fewer disk seeks required per file read, which also decreases physical IOPS. This also greatly reduces the wasted space typical of small file storage. Works very well with maildir, but also with the other mail storage formats. What happens if you pull out the wrong cable in the rack, kernel lockup/oops, power failure, hot swap disk pulled, or something else which causes an unexpected loss of a few seconds of written data? Read the XFS FAQ. These questions have been answered hundreds of times since XFS was released in Irix in 1994. I'm not your personal XFS tutor.

Why the hostile reply?

If you think the above is "hostile" you have lived a privileged and sheltered life, and I envy you. :) That isn't "hostile" but a combination of losing patience and being blunt. "Hostile" is "f--k you!". Obviously I wasn't being "hostile".

...

The question was deeper than your response?

Do you want to troll or learn something?

Prior to 2007 there was a bug in XFS that caused filesystem corruption upon power loss under some circumstances--actual FS corruption, not simply zeroing of files that hadn't been fully committed to disk. Many (uneducated) folk in the Linux world still to this day tell others to NOT use XFS because "Power loss will always corrupt your file system." Some probably know better but are EXT or JFS (or god forbid, BTRFS) fans and spread fud regarding XFS. This is amusing considering XFS is hands down the best filesystem available on any platform, including ZFS. Others are simply ignorant and repeat what they've heard without looking for current information.

Thus, when you asked the question the way you did, you appeared to be trolling, just like the aforementioned souls who do the same. So I directed you to the XFS FAQ where all of the facts are presented and all of your questions would be answered, from the authoritative source, instead of wasting my time on a troll.

...

...
...
Surely your IOPs are hard limited by the number of fsyncs (and size of any battery backed ram)? Depends on how your applications are written and how often they call fsync. Do you mean BBWC? WRT delayed logging BBWC is mostly irrelevant. Keep in mind that for delayed logging to have a lot of metadata writes in memory someone, or many someones, must be doing something like an 'rm -rf' or equivalent on a large dir with many thousands of files. Even in this case, the processing is _very_ fast.

You have completely missed my point.

No, I haven't.

...

Your data isn't safe until it hits the disk. There are plenty of ways to spool data to ram rather than committing it, but they are all vulnerable to data loss until the data is written to disk.

The delayed logging code isn't a "ram spooler", although that is a mild side effect. Apparently I didn't explain it fully, or precisely. And keep in mind, I'm not the dev who wrote the code. So I'm merely repeating my recollection of the description from the architectural document and what was stated on the XFS list by the author, Dave Chinner of Red Hat.

...

You wrote: "filesystem metadata write operations are pushed almost entirely into RAM", but if the application requests an fsync then you still have to write it to disk? As such you are again limited by disk IO, which itself is limited by the performance of the device (and temporarily accelerated by any persistent write cache). Hence my point that your IOPs are generally limited by the number of fsyncs and any persistent write cache?

In my desire to be brief I didn't fully/correctly explain how delayed logging works. I attempted a simplified explanation that I thought most would understand. Here is the design document: http://oss.sgi.com/archives/xfs/2010-05/msg00329.html

Early performance numbers: http://oss.sgi.com/archives/xfs/2010-05/msg00329.html

...

As I write this email I'm struggling with getting a server running again that has just been rudely powered down due to a UPS failing (power was fine, UPS failed...). This isn't such a rare event (IMHO) and hence I think we do need to assume that at some point every machine will suffer a rude and unexpected event which looses all in progress write cache. I have no complaints at XFS in general, but I think it's important that filesystem designers in general have give some thought to this event and recovering from it?

Rest assured this is a top priority. Ever heard of SGI by chance? They sell supercomputers with 1024 CPUs, 16 terabytes of RAM, and petabyte FC RAID systems, in a shared memory NUMA configuration, i.e "SMP", but the memory access times aren't symmetric. In short, it's a 1024 CPU server--that costs something like $4+ million USD. SGI was the creator of XFS in 93/94 and open sourced it in 2000 when they decided to move from MIPS/IRIX to Itanium/Linux. SGI has used nothing but XFS since 1994 on all their systems. NASA currently has almost a petabyte of XFS storage, and 10 petabytes of CXFS storage. CXFS is the proprietary clustered version of XFS.

NASA is but one high profile XFS user on this planet. There are hundreds of others, including many US Government labs of all sorts. With customers such as these, data security/reliability is a huge priority.

...

Please try not to be so hostile in your email construction - we aren't all idiots here, and even if we were, your writing style is not conducive to us wanting to learn from your apparent wealth of experience?

You're overreacting. Saying "I'm not your personal XFS tutor" is not being hostile. Heh, if you think that was hostile, go live on NANAE for a few days or a week and report back on what real hostility is. ;)

-- Stan

Ed W

2:54 p.m.

On 20/01/2011 06:06, Stan Hoeppner wrote:

...

If you think the above is "hostile" you have lived a privileged and sheltered life, and I envy you. :) That isn't "hostile" but a combination of losing patience and being blunt. "Hostile" is "f--k you!". Obviously I wasn't being "hostile".

I'm living in the "Dovecot mailing list" which has historically been a very tolerant and forgiving place to learn? Do you mind if I continue to remain "sheltered"?

...

You're overreacting. Saying "I'm not your personal XFS tutor" is not being hostile. Heh, if you think that was hostile, go live on NANAE for a few days or a week and report back on what real hostility is. ;)

I for one don't want the tone of this list to deteriorate to "NANAE" levels....

There are plenty of lists and forums where you can get sarcastic answers from folks with more experience than ones self. Please lets try and keep the tone of this list as the friendly, helpful place it has been?

To offer just an *opinion*, being sarcastic (or just less than fully helpful) to "idiots" who "can't be bothered" to learn the basics before posting is rarely beneficial. Many simply leave and go elsewhere. Some do the spadework and become "experienced", but in turn they usually respond in the same sharp way to new "inexperienced" questions... The circle continues...

I find it helpful to always presume there is a reason I should respect the poster, despite what might look like a lazy question to me. Does someone with 10 years of experience in their of field deserve me to be sharp with them because they tried to skip a step and ask a "lazy question" without doing their own leg work? Only yesterday I was that dimwit having spent 5 hours applying the wrong patch to a kernel and wondering why it failed to build, until I finally asked their list and got a polite reply pointing out my very trivial mistake...

Lets assume everyone deserves some respect and take the time to answer the dim questions politely?

Oh well, pleading over. Good luck and genuinely thanks to Stan for spending his valuable time here. Here's hoping you will continue to do so, but also being nice to the dummies?

Regards

Ed W

Stan Hoeppner

6:38 p.m.

Ed W put forth on 1/20/2011 6:54 AM:

...

Oh well, pleading over. Good luck and genuinely thanks to Stan for spending his valuable time here. Here's hoping you will continue to do so, but also being nice to the dummies?

"Dummies" isn't what this was about. Again, I misread the intent of your question as being troll bait against XFS. That's why I responded with a blunt, short reply. I misread you, you misread me, now we're all one big happy family. Right? :)

-- Stan

Frank Cusack

10:30 p.m.

On 1/20/11 12:06 AM -0600 Stan Hoeppner wrote:

...

This is amusing considering XFS is hands down the best filesystem available on any platform, including ZFS. Others are simply ignorant and repeat what they've heard without looking for current information.

Not to be overly brusque, but that's a laugh. The two "best" filesystems out there today are vxfs and zfs, for almost any enterprise workload that exists. I won't argue that xfs won't stand out for specific workloads such as sequential write, it might and I don't know quite enough about it to be sure, but for general workloads including a mail store zfs is leaps ahead. I'd include WAFL in the top 3 but it's only accessible via NFS. Well there is a SAN version but it doesn't really give you access to the best of the filesystem feature set (tradeoff for other features of the hardware).

Your pronouncement that others are simply ignorant is telling.

...

...
Your data isn't safe until it hits the disk. There are plenty of ways to spool data to ram rather than committing it, but they are all vulnerable to data loss until the data is written to disk.

The delayed logging code isn't a "ram spooler", although that is a mild side effect. Apparently I didn't explain it fully, or precisely. And keep in mind, I'm not the dev who wrote the code. So I'm merely repeating my recollection of the description from the architectural document and what was stated on the XFS list by the author, Dave Chinner of Red Hat. ... In my desire to be brief I didn't fully/correctly explain how delayed logging works. I attempted a simplified explanation that I thought most would understand. Here is the design document: http://oss.sgi.com/archives/xfs/2010-05/msg00329.html

I guess I understand your championing of it if you consider that a design document. That brief piece of email hardly describes it at all, and the performance numbers are pretty worthless (due to the caveat that barriers are disabled).

Given the paragraph in the "design document":

...

The best IO behaviour comes from the delayed logging version of XFS, with the lowest bandwidth and iops to sustain the highest performance. All the IO is to the log - no metadata is written to disk at all, which is the way this test should execute. As a result, the delayed logging code was the only configuration not limited by the IO subsystem - instead it was completely CPU bound (8 CPUs worth)...

it is indeed a "ram spooler", for metadata, which is a standard (and good) approach. That's not a side effect, that's the design. AFAICT from the brief description anyway.

This is guaranteed to lose data on power loss or drive failure.

Stan Hoeppner

21 Jan 21 Jan

7:49 a.m.

Frank Cusack put forth on 1/20/2011 2:30 PM:

...

On 1/20/11 12:06 AM -0600 Stan Hoeppner wrote:

...
This is amusing considering XFS is hands down the best filesystem available on any platform, including ZFS. Others are simply ignorant and repeat what they've heard without looking for current information.

...

Your pronouncement that others are simply ignorant is telling.

So is your intentionally quoting me out of context. In context:

Me: "Prior to 2007 there was a bug in XFS that caused filesystem corruption upon power loss under some circumstances--actual FS corruption, not simply zeroing of files that hadn't been fully committed to disk. Many (uneducated) folk in the Linux world still to this day tell others to NOT use XFS because "Power loss will always corrupt your file system." Some probably know better but are EXT or JFS (or god forbid, BTRFS) fans and spread fud regarding XFS. This is amusing considering XFS is hands down the best filesystem available on any platform, including ZFS. Others are simply ignorant and repeat what they've heard without looking for current information."

The "ignorant" are those who blindly accept the false words of others regarding 4+ year old "XFS corruption on power fail" as being true today. They accept but without verification. Hence the "rumor" persists in many places.

...

...
In my desire to be brief I didn't fully/correctly explain how delayed logging works. I attempted a simplified explanation that I thought most would understand. Here is the design document: http://oss.sgi.com/archives/xfs/2010-05/msg00329.html

...

I guess I understand your championing of it if you consider that a design document. That brief piece of email hardly describes it at all, and the performance numbers are pretty worthless (due to the caveat that barriers are disabled).

You quoted me out of context again, intentionally leaving out the double paste error I made of the same URL.

Me: "In my desire to be brief I didn't fully/correctly explain how delayed logging works. I attempted a simplified explanation that I thought most would understand. Here is the design document: http://oss.sgi.com/archives/xfs/2010-05/msg00329.html

Early performance numbers: http://oss.sgi.com/archives/xfs/2010-05/msg00329.html"

Note the double URL paste error? Frank? Why did you twist an honest mistake into something it's not? Here's the correct link:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=Do...

...

Given the paragraph in the "design document":

Stop being an ass. Or get off yours and Google instead of requiring me to spoon feed you.

...

...
The best IO behaviour comes from the delayed logging version of XFS, with the lowest bandwidth and iops to sustain the highest performance. All the IO is to the log - no metadata is written to disk at all, which is the way this test should execute. As a result, the delayed logging code was the only configuration not limited by the IO subsystem - instead it was completely CPU bound (8 CPUs worth)...

it is indeed a "ram spooler", for metadata, which is a standard (and good) approach. That's not a side effect, that's the design. AFAICT from the brief description anyway.

As you'll see in the design doc, that's not the intention of the patch. XFS already had a delayed metadata update design, but it was terribly inefficient in implementation. Dave increased the efficiency several fold. The reason I mentioned it on Dovecot is that it directly applies to large/busy maildir style mail stores.

XFS just clobbers all other filesystems in parallel workload performance, but historically its metadata performance was pretty anemic, about half that of other FSes. Thus, parallel creates and deletes of large numbers of small files were horrible. This patch fixes that issue, and brings the metadata performance of XFS up to the level of EXT3/4, Reiser, and others, for single process/thread workloads, and far surpasses their performance with large parallel process/thread workloads, as is shown in the email I linked.

This now makes XFS the perfect Linux FS for maildir and [s/m]dbox on moderate to heavy load IMAP servers. Actually it's now the perfect filesystem for all Linux server workloads. Previously it was for all workloads but metadata heavy ones.

...

This is guaranteed to lose data on power loss or drive failure.

On power loss, on a busy system, yes. Due to a single drive failure? That's totally incorrect. How are you coming to that conclusion?

As with with every modern Linux filesystem that uses the kernel buffer cache, which is, all of them, you will lose in flight data that's in the buffer cache when power drops.

Performance always has a trade off. The key here is that the filesystem isn't corrupted due to this metadata loss. Solaris with ZFS has the same issues. One can't pipeline anything in a block device queue and not have some data loss on power failure, period. If one syncs every write then you have no performance. Solaris and ZFS included.

-- Stan

Jerry

3:53 p.m.

Seriously, isn't it time this thread died a peaceful death. It has long since failed to to have any real relevance to Dovecot, except in the most extreme sense. It has evolved into a few testosterone poisoned individuals attempting to make this forum a theater for some mating ritual. If they seriously want to continue this convoluted thread, perhaps they would be so kind as to take it of-list and find a platform better suited for this public display. At the very least, I would hope that Timo might consider closing this thread. I know that Wietse would never have let this thread reach this point on the Postfix forum.

In any case, I am not creating a kill filter to dispense with it.

-- Jerry ✌ Dovecot.user@seibercom.net

Disclaimer: off-list followups get on-list replies or get ignored. Please do not ignore the Reply-To header.

Ralf Hildebrandt

3:55 p.m.

...

In any case, I am not creating a kill filter to dispense with it.

not or now?

-- Ralf Hildebrandt Geschäftsbereich IT | Abteilung Netzwerk Charité - Universitätsmedizin Berlin Campus Benjamin Franklin Hindenburgdamm 30 | D-12203 Berlin Tel. +49 30 450 570 155 | Fax: +49 30 450 570 962 ralf.hildebrandt@charite.de | http://www.charite.de

Stan Hoeppner

5:15 p.m.

Jerry put forth on 1/21/2011 7:53 AM:

...

Seriously, isn't it time this thread died a peaceful death. It has long since failed to to have any real relevance to Dovecot, except in the most extreme sense. It has evolved into a few testosterone poisoned individuals attempting to make this forum a theater for some mating ritual. If they seriously want to continue this convoluted thread, perhaps they would be so kind as to take it of-list and find a platform better suited for this public display. At the very least, I would hope that Timo might consider closing this thread. I know that Wietse would never have let this thread reach this point on the Postfix forum.

In any case, I am not creating a kill filter to dispense with it.

I'm guilty as charged. Consider it dead. Sorry for the noise Jerry, everyone.

-- Stan

Frank Cusack

24 Jan 24 Jan

11:06 p.m.

On 1/20/11 11:49 PM -0600 Stan Hoeppner wrote:

...

Frank Cusack put forth on 1/20/2011 2:30 PM:

...
On 1/20/11 12:06 AM -0600 Stan Hoeppner wrote:

...
This is amusing considering XFS is hands down the best filesystem available on any platform, including ZFS. Others are simply ignorant and repeat what they've heard without looking for current information.

...
Your pronouncement that others are simply ignorant is telling.

So is your intentionally quoting me out of context.

Not at all. Your statement about ignorance has no context required.

...

The "ignorant" are those who blindly accept the false words of others regarding 4+ year old "XFS corruption on power fail" as being true today. They accept but without verification. Hence the "rumor" persists in many places.

Indeed, those folks are more than ignorant, they are in fact idiots. (Ignorant meaning simply unaware.)

...

"In my desire to be brief I didn't fully/correctly explain how delayed logging works. I attempted a simplified explanation that I thought most would understand. Here is the design document: http://oss.sgi.com/archives/xfs/2010-05/msg00329.html

Early performance numbers: http://oss.sgi.com/archives/xfs/2010-05/msg00329.html"

Note the double URL paste error? Frank? Why did you twist an honest mistake into something it's not? Here's the correct link:

Wow so you are basically an asshole as well as arrogant.

...

Stop being an ass. Or get off yours and Google instead of requiring me to spoon feed you.

LOL that actually made me laugh, thanks.

...

...
This is guaranteed to lose data on power loss or drive failure.

On power loss, on a busy system, yes. Due to a single drive failure? That's totally incorrect. How are you coming to that conclusion?

Why don't you re-read the design. I'm not going to spoon feed you.

...

Performance always has a trade off. The key here is that the filesystem isn't corrupted due to this metadata loss. Solaris with ZFS has the same issues. One can't pipeline anything in a block device queue and not have some data loss on power failure, period. If one syncs every write then you have no performance. Solaris and ZFS included.

You might want to get current on ZFS as well.

Frank Cusack

11:07 p.m.

Sorry all. I responded before catching up to the end of the thread.

Patrick Ben Koetter

11:09 p.m.

Frank Cusack <frank+lists/dovecot@linetwo.net>:

...

On 1/20/11 11:49 PM -0600 Stan Hoeppner wrote:

...
Frank Cusack put forth on 1/20/2011 2:30 PM:

...
On 1/20/11 12:06 AM -0600 Stan Hoeppner wrote:

...

Wow so you are basically an asshole as well as arrogant.

Please give it a break. Take private things offlist.

p@rick

-- state of mind Digitale Kommunikation

http://www.state-of-mind.de

Franziskanerstraße 15 Telefon +49 89 3090 4664 81669 München Telefax +49 89 3090 4666

Amtsgericht München Partnerschaftsregister PR 563

Frank Cusack

18 Jan 18 Jan

3:36 a.m.

On 1/16/11 2:10 PM -0600 Stan Hoeppner wrote:

...

Using XFS with delayed logging mount option (requires kernel 2.6.36 or later). ... Using the delayed logging feature, filesystem metadata write operations are pushed almost entirely into RAM. Not only does this _dramatically_ decrease physical metadata write IOPS but it also increases metadata write performance by an order of magnitude. Really shines with maildir,

ext3 has the same "feature". It's fantastic for write performance, and especially for NFS, but for obvious reasons horrible for reliability. I'm sure XFS fixes the consistency problems of ext3, ie on power failure your fs is still consistent, but clearly this strategy is not good for a mail store, ie where you also care about not losing data.

I personally like ZFS with a SLOG for the sync writes. Plus you get the extra guarantees of zfs, ie it guarantees that your disk isn't lying to you.

Warren Baker

10:53 a.m.

New subject: [Dovecot] SSD drives are really fast running Dovecot

On Monday, January 17, 2011, Stan Hoeppner <stan@hardwarefreak.com> wrote:

...

Cor Bosman put forth on 1/16/2011 5:34 PM:

...
Btw, our average mailsize last we checked was 30KB. Thats a pretty good average as we're an ISP with a very wide user base. I think 4KB average is not a normal mail load.

As another OP pointed out, some ISPs apparently have to deliver a lot of spam to mailboxen to avoid FPs, bumping up that average mail size considerably. Do you accept and deliver a lot of spam to user mailboxen?

At an ISP I worked at, we did a study (just over 2 years ago) on the average size of spam mail that was been delivered to the users. It worked out to an average size of between 8KB and 10KB. This was based on data over a period of 12months with an average of a 180 million mails per month been received. Legit mail was averaging at 32KB.

I doubt whether the sizes of spam has changed much.

.warren

-- .warren

Stan Hoeppner

11:44 a.m.

Warren Baker put forth on 1/18/2011 2:53 AM:

...

On Monday, January 17, 2011, Stan Hoeppner <stan@hardwarefreak.com> wrote:

...
Cor Bosman put forth on 1/16/2011 5:34 PM:

...
Btw, our average mailsize last we checked was 30KB. Thats a pretty good average as we're an ISP with a very wide user base. I think 4KB average is not a normal mail load.

As another OP pointed out, some ISPs apparently have to deliver a lot of spam to mailboxen to avoid FPs, bumping up that average mail size considerably. Do you accept and deliver a lot of spam to user mailboxen?

At an ISP I worked at, we did a study (just over 2 years ago) on the average size of spam mail that was been delivered to the users. It worked out to an average size of between 8KB and 10KB. This was based on data over a period of 12months with an average of a 180 million mails per month been received. Legit mail was averaging at 32KB.

I doubt whether the sizes of spam has changed much.

What was the ratio of spam to ham you were delivering to user mailboxes?

-- Stan

Warren Baker

12:52 p.m.

On Tue, Jan 18, 2011 at 11:44 AM, Stan Hoeppner <stan@hardwarefreak.com> wrote:

...

...
At an ISP I worked at, we did a study (just over 2 years ago) on the average size of spam mail that was been delivered to the users. It worked out to an average size of between 8KB and 10KB. This was based on data over a period of 12months with an average of a 180 million mails per month been received. Legit mail was averaging at 32KB.

I doubt whether the sizes of spam has changed much.

What was the ratio of spam to ham you were delivering to user mailboxes?

If i remember correctly it was around 85/15.

-- .warren

5377

Age (days ago)

5389

Last active (days ago)

List overview

82 comments

30 participants

participants (30)

Andrzej Adam Filip
Brad Davidson
Bradley Giesbrecht
Brandon Davidson
Charles Marcus
Cor Bosman
David Jonas
David Woodhouse
Ed W
Eric Rostetter
Frank Cusack
Giles Coochey
Javier de Miguel Rodríguez
Jerry
Maarten Bezemer
Marc Perkel
Matt
Miha Vrhovnik
Noel Butler
Patrick Ben Koetter
Philipp Haselwarter
Ralf Hildebrandt
Rick Romero
Robert Brockway
Robert Schetterer
Stan Hoeppner
Steve
Sven Hartge
Timo Sirainen
Warren Baker