On 5.12.2011, at 0.07, Lorens Kockum wrote:
Timo Sirainen wrote:
And before designing it I'd need to look into how the backup softwares usually work.. If anyone has any ideas about this, I'd like to hear.
Simple or even moderately efficient backup programs like rsync copy all the files.
I'm mainly wondering if it's common for backup programs to support using a separate program to generate the backups. For example if there was a "dovecot-backup" binary that just dumps all (or new-since-last-backup) of the users' mails into stdout, which the backup program can use. Or perhaps in that case there wouldn't really be much of anything for the backup to do except to write it to tape..
Also backing up the attachment links could be problematic if the backup system doesn't support hard links. Each attachment always has at least 2 links, so if the backup doesn't realize that it at minimum duplicates the space used by attachments.
rsync recognizes hard links with option -H, but at a very noticeable performance cost when dealing with millions of files. If the aa/bb/aabccddeeff-etc is unique across the whole mailstore, it would be easy to replace the hard link with a symlink, as you said:
SIS was designed to work with hard links. They couldn't be replaced with symlinks without a redesign (which would be less efficient in normal operation).
maybe not storing the attachments directly to backups, but add symlinks to them so they can be used to figure out what to restore. Or maybe the backing up wouldn't need a special tool, but the restoring tool could just read through the dbox files to see what attachments are also needed and write a list of them somewhere so they can be taken from backups as well.
In the second way, you would have a separate hierarchy for multiple-recipient attachments, or would the attachment be "really" stored in the box of a recipient chosen at random?
I meant that SIS would work exactly like it works now, with hard links and everything, but on top of that it would also create symlinks to the used files simply to make it easier to find what files are used. The annoying thing about that is that in error situations the symlinks can get out of sync with the reality.
Just some random thoughts: professionally, I use Zimbra. Messages are stored in Maildir-equivalents. The time it takes to backup is a quite severe constraint on the backup technique. For example, compressing the backup files takes too long, so the zip files are not compressed. Instead, the individual mails are stored compressed on disk. Each backup zips up the mails in a few big backup files.
You mean you first create uncompressed zip files (why not just tar?) of all the mails to the filesystem and the backup software then backups those zip files?
An improvement could be to sort mails into backup zip files so that once a zip file is made, it stays the same. After all, if a mail is not deleted a month after it is read, then it will probably stay in the same state forever, or at least until the user starts a keep-me-under-quota cleaning-up spree. During this time, backing up that big zip file can just be a check to see if it is already OK in the backup, which is much quicker. I have no idea if this could be applied to Dovecot, but who knows.
Dovecot's mdbox files already contain multiple messages in each file, so it should be a lot more efficient to do backups on those. And each message in an mdbox file can be compressed if zlib plugin is enabled. So I think that sounds quite a lot like what you propose.