[Dovecot] Big sites using Dovecot [long]

Thu Sep 28 01:29:07 EEST 2006

Hi,

Having asked if there are any big sites (50,000-100,000 users) it
seems there are a few.  I'd like to ask some fairly general questions.

I have inherited responsibility for a Cyrus mail store, at a UK university.

It is front-ended by a pair of mail gateways running Exim which handle
spam, A/V etc.

Local delivery is a dedicated Suse Linux box running Postfix feeding Cyrus
over LMTP.  There are around 80,000 accounts, with around 20,000 active
(one or (many) more messages per day).  I suspect we peak at around 500
simultaneous users.  The message store is around 600Gb.

Cyrus back-end storage is a fibre-channel SAN.  We use most of the
Cyrus functionality including Sieve, quotas and shared mailboxes.  Clients
access the mail store using their choice of client, predominantly IMAP/SSL
from Horde/IMP or Outlook, although some of us use Thunderbird.  In theory
we have a stand-by box, which is a similar configuration (but with a local
RAID array).  The two used to be connected by DRBD, which was replaced by
rsync - I believe this is because following any comms failure the entire
mail store had to be resynced.  Backups run over the net to tape and take
around 24 hours to complete.

A small number of users are on an Exchange server instead of Cyrus.
They will not be moving.  User authentication runs over LDAP and there
is an attribute in LDAP which identifies whether the user is a Cyrus
user or an Exchange user, so that Exim knows which mail store to send
their mail to, and Webmail knows whether to redirect them to Horde or
to a Microsoft Outlook Web server.

It is time for a refresh which needs to take place seamlessly, and in
short order (complete roll-out in the next couple of months).  We need
to add a few extras into the equation...

1. It is corporate policy to move all storage to a NetApp filer which is
    replicated using frequent snap-mirrors to a second site over a shared
    1Gb link. (Due to possible bandwidth issues, the two filers do not
    update synchronously, but the backup NetApp should be no more than a
    couple of minutes behind, and this much loss of data would be
    tolerated in the event of a disaster recovery deployment.)

2. NFS is preferred over iSCSI, due to file recovery and disk space
    utilisation on the NetApp.

3. The two servers (or two clusters, if we go that way) will be sited
    one at each site.  In the event of a data centre failure, we need to have
    quick and effective fail over to the other site (manual intervention
    is acceptable).  It is possible that the redundant link between the sites
    could fail, leading to the servers losing touch with each other but both
    still running.

4. We have user communities at both sites.  Currently they both talk to
    the single Cyrus server at "HQ".

5. Clustered servers would be preferred so we can do rolling upgrades by
    removing individual machines for OS patches etc.  We have layer 4
    load balancers available.

6. Our preferred corporate platform is Suse Linux Enterprise Server 9 running
    on Intel hardware.

Cyrus generally is seen as a very competent solution, and greatly preferred to
the UW Imap server it replaced (this may be to do with the NFS servers UW used).
Reasons for leaving Cyrus are (1) NFS and (2) replication - although I understand
the Cyrus 2.3 tree has some good support for keeping multiple servers loosely
synchronised over a WAN.

I am very nervous about comments on this list concerning NFS lock-ups.  This
system has to be bullet-proof 24/7.  I would consider SolarisX86 (or possibly
FreeBSD) if the NFS implementation is robust out-of-the-box.  Management would
like the warm feeling that a vendor-supported operating system would give them
(so Suse and Sun are preferred).

My gut feeling is that I would like to split the users into two
communities, with half on each NetApp, and with the two NetApps mirroring
to each other.  In practice users will work from both sites (and remotely)
but each one has a "home" site in terms of their home directory, etc.
At each site, I'd like 2 identical Dovecot boxes.  I'll call this a 2 x 2
solution.

All users (Exchange users excepted) have the same address wired into their
e-mail client for IMAP/SSL and SMTP/SSL, so there would have to be some
magic to ensure that the user ended up talking to a Dovecot server which
could see the appropriate NetApp.  I don't think the load balancers are clever
enough to be able to do this.  I think I've read it's possible for an IMAP
server to hand a user off to a different IMAP server, but can Dovecot do this
and is there client support.  Or should I just proxy users who hit the wrong
server.  Or should I just put everyone on the same NetApp and use 4 servers?
I'll call this a 4 x 1 solution.

If we lost a site with a live NetApp, I would expect the surviving site to
mount the latest snap-mirror and serve it.  In the case we are running 2 x 2
it would become 1 x 2.  If we were running 4 x 1 it would become 2 x 1 which
is arguably more robust.

Does anyone have any comments on any of this.  If it were your site, what would
you be doing?  What kit would you use?  Which operating system?  How will it
play with our load balancers. 4 x 1 or 2 x 2?  Would anyone else in UK academia
like to compare notes?

Many thanks,
Jonathan.