[Dovecot] index performance issues

newer
[Dovecot] error with virtual users

older
[Dovecot] imap + unmounted home...

Netserve

12 Mar 2007 12 Mar '07

8:01 p.m.

Running 1.0.rc22

We have a single file server, running NFS (Single large sas disk) with about 80Gb of mail. mount -o remount,rsize=2048,wsize=2044,actimeo=0,soft,rw xx.xx.xx.xx:/mailboxes /nfs1

There are 3 front end pop/imap servers using an LVS director. Indexes are stored on each front end server.

-rw------- 1 admin 500 144 Mar 12 17:47 dovecot.index -rw------- 1 admin 500 10272 Mar 12 17:47 dovecot.index.cache -rw------- 1 admin 500 120 Mar 12 17:47 dovecot.index.log

Normally everything runs fine, there are about 1500 active pop accounts and 100 Imap. Typically, 10 to 20 pop accounts are downloading at anyone time, and there may be 10 or 20 active Imap sessions.

If we find that the index partition has got to 100% full and corrupted the indexes then we'll delete the entire index dir and let dovecot rebuild all the indexes however, we end up with 100mbit / 10Mb/sec of network traffic between the NFS server and the pop/imap servers. The nfs server load climbs to 20+ as does the load on the front end mail servers.

It can take 4 or 5 hours for the indexes to rebuild during which time imap is slow but works and pop downloads all end up with dead processes that need clearing every few minutes.

admin 5869 0.0 0.1 4384 1060 ? D 17:56 0:00 \_ pop3 [info@.co.uk 180.98] admin 5877 0.0 0.1 4392 1112 ? D 17:57 0:00 \_ pop3 [bro94 62.182] admin 5879 0.0 0.1 4396 1052 ? D 17:57 0:00 \_ pop3 [spamcatch@.com 81.163] admin 5884 0.0 0.1 4392 1056 ? D 17:57 0:00 \_ pop3 [ctaylor@.com 81.175]

Normally the nfs server delivers possibly 200Kbytes of traffic average spiking to 400Kbytes but the 900+Kbytes per second sustained rate for 5 or 6 hours is killing the service.

Is there something we've got wrong here?

(I know a Scsi raid NFS would help as would Gbit networking but I can't see the setup we've got as needing that sort of hardware spec?)

Thanks

John

Show replies by date

Charles Marcus

12 Mar 12 Mar

8:14 p.m.

On 3/12/2007 Netserve (john@nsnoc.com) wrote:

...

If we find that the index partition has got to 100% full and corrupted the indexes then we'll delete the entire index dir and let dovecot rebuild all the indexes however,

<snip>

...

Is there something we've got wrong here?

I would say that considering it normal that the partition storing the indexes would routinely fill up and require deleting all indexes is wrong.

Increase the size of the partition storing the indexes to a sane size so that this doesn't happen.

Just my .02 clad coins worth

Best regards,

Charles

John Lyons

9:53 p.m.

...

I would say that considering it normal that the partition storing the indexes would routinely fill up and require deleting all indexes is wrong.

I appreciate that being in a position where the indexes are corrupt is not good but then there may be unavoidable reasons why they become out of sync or corrupt.

If we add a new pop/imap server to the cluster is it to be expected that we'll have the service down for 5 or 6 hours while the new server creates indexes and chews through 100mbit of network capacity in the process?

If this is all perfectly normal, I'd rather have a separate index building script that I can run on the file server and then export the built indexes to each of the pop servers to get things running again. I'm guessing indexes for 80Gb of mail could be built on a local file system in 30 to 60 mins rather than 6+ hours over a network?

John

Charles Marcus

10:23 p.m.

John Lyons wrote:

...

...
I would say that considering it normal that the partition storing the indexes would routinely fill up and require deleting all indexes is wrong.

...

I appreciate that being in a position where the indexes are corrupt is not good but then there may be unavoidable reasons why they become out of sync or corrupt.

Thats not what you said - you said the disk *filled up*, after which you chose to delete *all* of the indexes, causing *all* of them to have to be rebuilt. This (deleting *all* indexes) is very different from just isolated, occasional corruption of an index or three, causing just *those* indexes to have to be rebuilt (trivial).

...

If we add a new pop/imap server to the cluster is it to be expected that we'll have the service down for 5 or 6 hours while the new server creates indexes and chews through 100mbit of network capacity in the process?

Not sure why this would be the case... why would bringing a new box into an *existing* cluster cause all of the indexes to have to be rebuilt?

Maybe I'm missing something obvious? It wouldn't be the first time... ;)

Best regards,

Charles

John Lyons

13 Mar 13 Mar

2:58 a.m.

...

Thats not what you said - you said the disk *filled up*, after which you chose to delete *all* of the indexes, causing *all* of them to have to be rebuilt. This (deleting *all* indexes) is very different from just isolated, occasional corruption of an index or three, causing just *those* indexes to have to be rebuilt (trivial).

In the last 8 weeks we've ended up in a situation where 4 times we've seen no alternative but to delete the indexes and start from scratch. First was a version upgrade which pushed the NFS server to 100% network usage and caused all of the pop downloads to die. It looked like dovecot was rebuilding the indexes or they were in an old version format. Either way, we started from scratch as we couldn't guarantee that it was fixing itself.

Then a filled index partition, recovering the space and restarting dovecot didn't get us anywhere, pop downloads were still failing and dieing.

The last two issues have just been plain odd. Massive volumes of mail arriving to pop accounts, imap sessions working fine but pop downloads dieing after a few seconds.

If we were in a position to see 'an index or three' as being the cause, we'd have been happy to fix those but we're seeing 20+ pop logins and 90% of them dead. Kill the processes and 60 seconds later there's another 20 dead processes and the pop server has a load of 20+ and NFS traffic is 100Mbit.

Regards

John

Timo Sirainen

3:17 a.m.

On Tue, 2007-03-13 at 00:58 +0000, John Lyons wrote:

...

If we were in a position to see 'an index or three' as being the cause, we'd have been happy to fix those but we're seeing 20+ pop logins and 90% of them dead. Kill the processes and 60 seconds later there's another 20 dead processes and the pop server has a load of 20+ and NFS traffic is 100Mbit.

If this happens again, it would help fixing the problem if you:

Strace some of the hanging processes. What is it doing? If there are multiple processes for the same user, I suppose most of them are waiting for a lock. If it's not a locking problem, then:
Copy some of the hanging users' mailboxes and their indexes to some temporary location. Once everything is working again, try if it logging into those saved broken mailboxes still hangs. If they do, I'd like to get the dovecot.index, dovecot.index.log and dovecot-uidlist files and a list of files in the maildir. Those are probably enough to reproduce the bug and they don't contain any actual mail contents.

Although if the hang depends on a broken dovecot.index.cache file as well, it can get more problematic since that file might contain some message headers. But with POP3-only users it should contain only message sizes and no headers.

Timo Sirainen

12 Mar 12 Mar

11:01 p.m.

On Mon, 2007-03-12 at 19:53 +0000, John Lyons wrote:

...

I appreciate that being in a position where the indexes are corrupt is not good but then there may be unavoidable reasons why they become out of sync or corrupt.

Dovecot should fix all those problems internally. If it doesn't, it's a bug and if I can reproduce it I'll fix it. And they can't become "out of sync", because they're constantly synced with the backend.

...

If we add a new pop/imap server to the cluster is it to be expected that we'll have the service down for 5 or 6 hours while the new server creates indexes and chews through 100mbit of network capacity in the process?

I guess you use maildir? The dovecot-uidlist files are stored in NFS, right?

I think you have one or two problems which cause the disk I/O:

POP3 users who keep the mails stored in server. POP3 needs to get a list of all the mails' virtual sizes, which requires reading all the mails' contents. This could be avoided by adding ,W=<size> to the maildir filenames (http://wiki.dovecot.org/MailboxFormat/Maildir), although you can't do that for existing mails without causing their UIDs to change.
IMAP webmail, if you use one. They often use sorting/threading which requires reading all the mails' headers.

...

If this is all perfectly normal, I'd rather have a separate index building script that I can run on the file server and then export the built indexes to each of the pop servers to get things running again. I'm guessing indexes for 80Gb of mail could be built on a local file system in 30 to 60 mins rather than 6+ hours over a network?

Or just copy the indexes from another existing server?

I guess since you're using mostly POP3 the indexing script could be pretty simple (pop3 login/logout), but if you want it to do something useful for IMAP users you'll need to make it more complex (mailbox open/close doesn't update dovecot.index.cache file at all, which is really the only thing that matters much).

Richard Laager

11:27 p.m.

On Mon, 2007-03-12 at 23:01 +0200, Timo Sirainen wrote:

...

This could be avoided by adding ,W=<size> to the maildir filenames

Is there a way to make Dovecot's deliver program do this?

Thanks, Richard

Timo Sirainen

11:59 p.m.

On Mon, 2007-03-12 at 16:27 -0500, Richard Laager wrote:

...

On Mon, 2007-03-12 at 23:01 +0200, Timo Sirainen wrote:

...
This could be avoided by adding ,W=<size> to the maildir filenames

Is there a way to make Dovecot's deliver program do this?

Nope. It's probably not even easy to do with v1.0's code. With CVS HEAD it should be easier.

Netserve

13 Mar 13 Mar

3:11 p.m.

POP3 users who keep the mails stored in server. POP3 needs to get a list of all the mails' virtual sizes, which requires reading all the mails' contents. This could be avoided by adding ,W=<size> to the maildir filenames (http://wiki.dovecot.org/MailboxFormat/Maildir), although you can't do that for existing mails without causing their UIDs to change.

...

...
...

The wiki doesn't explain how to add the W=<size> to the file names, the conf refers to mail_cache_fields = size.virtual size.physical Having set that on a test server/pop account, deleted the index and UID file and logged in, I don't see any size details in either the filenames or the UID list.

It's not clear if the mail_cache_fields are added to both the file names and index cache or what?

John

Timo Sirainen

3:18 p.m.

On Tue, 2007-03-13 at 13:11 +0000, Netserve wrote:

...

POP3 users who keep the mails stored in server. POP3 needs to get a list of all the mails' virtual sizes, which requires reading all the mails' contents. This could be avoided by adding ,W=<size> to the maildir filenames (http://wiki.dovecot.org/MailboxFormat/Maildir), although you can't do that for existing mails without causing their UIDs to change.

...
...
...
The wiki doesn't explain how to add the W=<size> to the file names,

There's no way to do that right now. But are you using Dovecot's deliver to store the mails? If not, then Dovecot can't even do that.

...

It's not clear if the mail_cache_fields are added to both the file names and index cache or what?

It affects what's cached in dovecot.index.cache file.

Perhaps the message size could be stored in dovecot-uidlist file as well when POP3 is used. But this won't happen before v1.0, so it won't help you now..

Timo Sirainen

12 Mar 12 Mar

8:51 p.m.

On 12.3.2007, at 20.01, Netserve wrote:

...

If we find that the index partition has got to 100% full and
corrupted the indexes then we'll delete the entire index dir

The upcoming 1.0.rc27 won't break indexes if it runs out of disk
space. But then again if it has already run out of disk space there,
there's really nothing that frees more space.. Unless you keep
running some find script to delete the oldest indexes.

6732

Age (days ago)

6733

Last active (days ago)

List overview

11 comments

5 participants

participants (5)

Charles Marcus
John Lyons
Netserve
Richard Laager
Timo Sirainen