questions for a unusual master/slave like mail-server setup

Christoph Anton Mitterer calestyo at scientia.net
Mon Dec 21 05:57:47 UTC 2015


Hey folks.

I was planning since a long time for some more advanced mail server
setup (described below), which basically has some weird ;) master/slave
ideas behind.
It's probably a bit more complicated (and maybe also a bit more insane)
than the usual mail server setup, so I'd be happy about any advice and
some questions answered:



There are two mailservers (dovecot, as you may have guessed ;) ):
- IMAP shall be used; using POP3 is not planned.
- A format in which each mail is stored in a separate file shall
  be used.
  That leaves for either maildir or sdbox, where I'd prefer maildir,
  as I like to be able to loose the indexes.
- Each (per user) mailbox shall have many subfolders and will receive
  loads of mail (including from several high traffic mailing lists).
- Mail filtering shall take place *at least* on the "online" server,
  probably either with sieve or maildrop.
- One of the mailservers is always "online",
  directly connected to the internet (and globally available).
- The other mailserver is "offline", not directly connected to the
  internet (and only locally available).


The "online" mailserver:
- Is in a not fully trusted location (i.e. at some ISP).
- Has somewhat limited storage.
- Makes the mails on it available to MUAs connecting from anywhere.
- Receives mail from various sources, e.g. via postfix/LDA or
  something like fetchmail.
- Has only a parts of *all* mails, that means for example:
  - *All* mail from some personal contacts.
  - Only the last 1 year for low volume mailing lists.
  - Only the last 4 months for high volume mailing lists.

The "offline" mailserver:
- Is in a trusted location (beneath my pillow ;) ).
- Has enough storage for the purposes of mail.
- Makes all mails available to MUAs connecting from the LAN.
- Receives mails from the "online" mailserver, when manually "moved" to
  it via some MUA over IMAP.
- Has *all* mails.


Client's MUAs, may connect to both, but the "offline" server will only
be available at home.


As you can imagine, the idea is basically the following:
I get large amounts of mails, from all kind of mailing lists as well as
personal mail. All that is filtered to a tree hierarchy of folders.
There's a not fully secure mailserver, where mail arrives and which I
can use when I'm on the way. It contains either just recent mails (the
amount depending on the traffic of the list) or for selected
senders/lists everything, and I control basically on a per folder basis
how much should be retained (all, last 10 years, last week, etc.).
There's another server, where *all* mail should eventually end up, for
archival and security reasons.
This implies obviously, that some of the mail is on both servers, some
only on the "offline" server, and that this may vary from folder to
folder.

Further, on both servers, I may want to add filters, which basically
split up a folder, e.g. of high traffic lists, so that mail of previous
years or months is moved to subfolder structures, e.g.:
linux-kernel
+--2014
\--2013
and so on.


So far, all that wouldn't be too difficult,... but as you can imagine,
I'd like to have a bit more, which actually makes things quite tricky.


How the two mailservers shall synchronise:
I don't want to "simply" rsync - the whole system needs to be a bit
smarter:
1) It's okay to assume, that synchronisation happens only at discrete
   times and not continuously (the "offline" server doesn't run
   always).
2) It would be ok, if the synchronisation requires a downtime of the
   "online" mailserver (i.e. the LDA).
   Since btrfs is used, I could also easily do snapshots.
3) One can also assume that, when synchronisation happens, the
   "offline" server contains at least all those folders, that the
   "online" server has as well (and possibly more).

4) Obviously, only maildir's cur/ directories should be synced (stuff
   in tmp&new may be still arriving).

5) As mentioned previously, not all mail may stay on the "online"
   server, depending on the folder, it will either stay permanently, or
   only for some period.
6) However, on the "offline" server, all mail stays, unless I manually 
   delete it.
7) One must assume, that mails on *both* servers:
   - change their IMAP  status (as in (S)een, (A)nswered, (F)lagged,
     etc.)
   - are moved to other mail folders
   - may be deleted via MUAs (not clear yet where this can happen)
   - "Fresh" mail, however, only appears on the "online" server.
8) Mails, once on the "offline" server, shouldn't be rewritten, in the
   sense that the mail is deleted and recreated (hopefully with the
   same content), which could happen either wenn the status of the mail
   changes or when it's moved to another folder (on the other server).
   The reason is simply, that I wouldn't fully trust the online server,
   so once the mail is local, it should be immutable (except for the
   status and folder location).

The last, two points make things pretty difficult, or at least I
wouldn't see a straightforward solution.

(4) would be quite easy, e.g. when rsync would be used, then I'd just
rsync all the cur/ directory.
The rsync would happen not using --delete, so that mail which is no
longer on the "online" server, doesn't get deleted from the "offline"
server.

(5) one could do, by "simply", on a per folder basis, remove mails
older than X on the "online" server, after it has been copied to the
"offline" server.
An already existing tool for doing that would be nice, though.

(7) and (8) is IMHO the most difficult:
The only idea I have so far assumes that each mail file has an ID,
which is unique for the *whole* mailbox (not just for each folder, as
the mails may be moved between folders).

The idea would be roughly, the following:
* When I synchronise, I either shut down the LDA or make a snapshot of
  the "online" server's mailbox and work with that.

* Then, I make a list (ID, status and pathname) of *all* mails on the
  online server and another one of *all* mails of the offline server.

* Then, I probably scan for duplicate IDs in each list... just to be
  safe.

* Then I process the mail lists as follows:
  a) If an ID is on *both* servers with the same status and pathname,
     than that mail had been previously synced already and neither
     moved nor changed its status since then. Nothing to do.
  b) If an ID is on *both* servers but with different pathnames or
     status, then the mail has either been moved or it's status has
     changed.
     Consequently, I would move/rename the mail to match the new
     location and/or IMAP-status.
     I would move/rename that with the older ctime to match the one
     with the newer (ctime gets set at move or rename and thus at
     status change).
     Even if an attacker would have fiddled with the "online" server's
     time or somehow else my system times were bogus, the worst that
     would happen was, that status changes are lost or mail ends up in
     the wrong folder.
  c) If an ID is on the "offline" server but not the "online" server,
     it could be either a mail that has already been garbage collected
     on the "online" server (so nothing would needed to be done),...
     or it could be mail hat was removed (via a MUA) on the "online"
     server and that should as well be removed on the "offline" server,
     either automatically, or manually after checking.
     How to find out which of that applies!?
  d) If an ID is on the "online" server but not the "offline" server,
     it could be either fresh mail that would needed to be synced,...
     or it could be mail that was removed on the "offline" server and
     should as well deleted on the "online" server, either
     automatically, or manually after checking.

(a) and (b) are again rather simple, (c) and (d) not so much...

Possible solutions that I could think of:
- Either simply don't sync the removal of mails from the "online"
  server to the "offline" server and vice-versa.
  It will anyway happen rather rarely.
- Remember a list of all IDs, that have been garbage collected on the
  "online" server.
- Remebmer a list of all IDs, that have been removed via MUAs.
  => not sure if that's easily doable, probably not, is there any
     dierct way one could do this with dovecot.
- Something to find out which mail has been gone on both, the each,
  the "offline" and the "online" server, since the last
  synchronisation, via tricks using the snapshot features of btrfs.
- Somehow using the date of now and the last synchronisation and the
  mtimes (which are here basically like creation times, as the mail
  file contents never change).
  But I guess that would only work for (d) and not for (c),... and I'd
  rather not depend on times too much.

* In any case, I could give the user a list of mails that would be
  deleted, for double checking.
  I actually think this would scale, as I don't expect many mails to
  be manually deleted through the MUA *once they had been synced*.


After that, I think the two servers should be in sync - in the sense
described in the beginning.


Some further parts of the idea:
As said, snapshots and the refcopy feature of btrfs could be quite
helpful. So I could basically keep:
- the "online" mailserver's mailbox
  - as of the last sync
  - as of now
- the "offline" mailserver's mailbox
  - as of the last sync
  - as of now
at the disk of the "offline" server without much additional cost.
They'd all share reflinks and wouldn't use (much) additional space
(except for fs meta-data) and I could use the snapshots to find out
which files were deleted on either of the two servers since last time.

Actually copying new mail from the "online" server to the offline
server's disk (in the "as of now" area), could then happen via
rsync'ing every cur/ from the "online" server to the "offline"
server, with --delete and --ignore-existing (as I don't trust the
"online" server and mail file contents should be immutable).



So... questions... :D
1) Obviously, does anyone know a better solution, or has some
   improvements, other ideas?
   Or are there any problems with the above, which I haven't realised,
   or would you think it would work?
2) Does dovecot *ever* change the contents (i.e. the file contents) of
   a maildir mail file?
3) As ID I'd have thought that the maildir mail filename might work,
   e.g. 1234567890.M20046P2137.mailserver,S=4542,W=4642:2,Sb
   I mean especially the first part: 1234567890.M20046P2137.mailserver
   Is that unique across *all* folders of a mailbox?
   AFAIU, that name would be determined by whatever delivers the mail,
   right?
   So can I somehow configure that to make it unique?
4) Does dovecot ever change anything from the maildir mail filenames,
   except anything after the ":" above? I.e. would it ever change:
   1234567890.M20046P2137.mailserver,S=4542,W=4642
   ?
5) Would dovecot every copy, truncate+copy or somehow else rewrite
   files, instead of just moving them (i.e. thus possibly changing
   inodes, reflinks or that like)?
6) AFAIU, when using maildir, dovecots index files can get completely
   lost,... and one can fully "recover" all mails and their status.
   The only "bad" thing that might happen is that MUAs re-load all
   mail, right?
7) How graceful does dovceot handle the situation when mail is added
   or  removed to/from cur/ by not using dovecot's IMAP,... especially
   with respect to index files.
   In the above solution I'd do a lot adding/moving/deleting files in
   maildirs, which would be done purely at the (file) system level, so
   the index would quite often get invalidated.
   How does it recognise that (if at all)? And is this generally
   stable, or should I rather remove any indexes after syncing and have
   them rebuilt by that?
8) All the above doesn't handle yet the thing of having
   yearly/monthly/etc. archive subfolders as described above, like:
   linux-kernel
   +--2014
   \--2013
   possibly on *both* "online" and "offline" server and possibly even
   in a different layout.
   I'm not yet sure whether this would be easily doable with the above
   solution,... my basic idea would be simply consider any such
   "archive" subfolders as the same than the "base" folder during
   syncing... and afterwards do re-order the mails on the "offline" and
   "online" server's folders as necessary.
   For example, if the "online" server has:
   linux-kernel
   +--01
   +--02
   +--03
    ...
   \--12
   in other
words, one archive folder for the each of the last 12
   months.
   While
the "offline" server has:
   linux-kernel
   +--2014
   \--2013
   The
script would then e.g. move the files to their new folders.
   => does
anyone know, whether such thing exists already, i.e. a tool
      that
sorts the mails into maildir subfolders, based on years,
      months,
etc?
9) As said before, I'd like to be able to loose any indexes, but
 
 AFAICS, my solution wouldn't anyway work with sdbox.
   But if there's
an sdbox based solution that makes everything much
   easier, do not
hesitate to tell me :)
10) In the long-term I hope to be able to do
something that makes
    server-side search (i.e. through the big fat
mail archive) much
    faster (or is that already supported in
dovecot?). I haven't really
    looked at all into that topic (just know
about notmuch[0]).
    Anyway, if anyone sees problems in my above
solution, when that
    would be used with something that gives me
better server side
    search (i.e. some indexing solution),... I'd be
happy to hear about
    that as well.


Thanks a lot so far,
Chris.


[0] https://notmuchmail.org/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/x-pkcs7-signature
Size: 5313 bytes
Desc: not available
URL: <http://dovecot.org/pipermail/dovecot/attachments/20151221/24e71ab8/attachment-0001.bin>


More information about the dovecot mailing list