[Dovecot] (Single instance) attachment storage

Mon Jul 19 18:24:01 EEST 2010

Now that v2.0.0 is only waiting for people to report bugs (and me to
figure out how to fix them), I've finally had time to start doing what I
actually came here (Portugal Telecom/SAPO) to do. :)

The idea is to have dbox and mdbox support saving attachments (or MIME
parts in general) to separate files, which with some magic gives a
possibility to do single instance attachment storage. Comments welcome.

Reading attachments
-------------------

dbox metadata would contain entries like (this is a wrapped single line
entry):

 X1442 2742784 94/b2/01f34a9def84372a440d7a103a159ac6c9fd752b
  2744378 27423 27/c8/a1dccc34d0aaa40e413b449a18810f600b4ae77b

So the format is:

 "X" 1*(<offset> <byte count> <link path>)

So when reading a dbox message body, it's read as:

 offset=0: <first 1442 bytes from dbox body>
 offset=1442: <next 2742784 bytes from external file>
 offset=2744226: <next 152 bytes from dbox body>
 offset=2744378: <next 27423 bytes from external file>
 offset=2744378 27423: <the rest from dbox body>

This is all done internally by creating a single istream that lazily
opens the external files only when data is actually tried to be read
from that part of the message.

The link paths don't have to be in any specific format. In future
perhaps it can recognize different formats (even http:// urls and such).

Saving attachments separately
-----------------------------

Message MIME structure is being parsed while message is saved. After
each MIME part's headers are parsed, it's determined if this part should
be stored into attachment storage. By default it only checks that the
MIME part isn't multipart/* (because then its child parts would contain
attachments). Plugins can also override this. For example they could try
to determine if the commonly used clients/webmail always downloads and
shows the MIME part when opening the mail (text/*, inline images, etc).

dbox_attachment_min_size specifies the minimum MIME part size that can
be saved as an attachment. Anything smaller than that will be stored
normally. While reading a potential attachment MIME part body, it's
first buffered into memory until the min. size is reached. After that
the attachment file is actually created and buffer flushed to it.

Each attachment filename contains a global UID part, so that no two
(even identical) attachments will ever contain the same filename. But
there can be multiple attachment storages in different mount points, and
each one could be configured to do deduplication internally. So
identical attachments should somehow be stored to same storage. This is
done by taking a hash of the body and using a part of it as the path to
the file. For example:

 mail_location = dbox:~/dbox:ATTACHMENTS=/attachments/$/$

Each $ would be expanded to 8 bits of the hash in hex (00..ff). So the
full path to an attachment could look like:

 /attachments/04/f1/5ddf4d05177b3b4c7a7600008c4a11c1

Sysadmin can then create /attachment/00..ff as symlinks to different
storages.

Hashing problems
----------------

Some problematic design decisions:

1) Hash is taken from hardcoded first n kB vs. first
dbox_attachment_min_size bytes?

 + With first n kB, dbox_attachment_min_size can be changed without
causing duplication of attachments, otherwise after the change the same
attachment could get a hash to a different storage than before the
change.
 - If n kB is larger than dbox_attachment_min_size, it uses more memory.
 - If n kB is determined to be too small to get uniform attachment
distribution to different storages, it can't be changed without
recompiling.

2) Hash is taken from first n bytes vs. everything?

 + First n bytes are already read to memory anyway and can be hashed
efficiently. The attachment file can be created without wasting extra
memory or disk I/O. If everything is hashed, the whole attachment has to
be first stored to memory or to a temporary file and from there written
to final storage.
 - With first n bytes it's possible for an attacker to generate lots of
different large attachments that begin with the same bytes and then
overflow a single storage. If everything is hashed with a secure hash
function and a system-specific secret random value is added to the hash,
this attack isn't possible.

I'm thinking that even though taking a hash of everything is the least
efficient option, it's the safest option. It's pretty much guaranteed to
give a uniform distribution across all storages, even against
intentional attacks. Also the worse performance isn't probably that
noticeable, especially assuming a system where local disk isn't used for
storing mails, and the temporary files would be created there.

Single instance storage
-----------------------

All of the above assumes that if you want a single instance storage,
you'll need to enable it in your storage. Now, what if you can't do
that?

I've been planning on making all index/dbox code to use an abstracted
out simple filesystem API rather than using POSIX directly. This work
can be started by making the attachment reading/writing code use the FS
API and then create a single instance storage FS plugin. The plugin
would work like:

open(ha/sh/hash-guid): The destination storage is in ha/sh/ directory,
so a new temp file can be created under it. The hash is part of the
filename to make unlink() easier to handle.

Since the hash is already known at open() time, look up if hashes/<hash>
file exists. If it does, open it.

write(): Write to the temp file. If hashes/ file is open, do a
byte-by-byte comparison of the inputs. If there's a mismatch, close the
hashes/ file and mark it as unusable.

finish():
 a) If hashes/ file is still open and it's at EOF, link() it to our
final destination filename and delete the temp file. If link() fails
with ENOENT (it was just expunged), goto b. If link() fails with EMLINK
(too many links), goto c.
 b) If hashes/ file didn't exist, link() the temp file to the hash and
rename() it to the destination file.
 c) If the hashed file existed but wasn't the same, or if link() failed
with EMLINK, link() our temp file to a second temp file and rename() it
over the hashes/ file and goto a.

unlink(): If hashes/<hash> has the same inode as our file and the link
count is 2, unlink() the hash file. After that unlink() our file.

One alternative to avoid using <hash> as part of the filename would be
for unlink() to read the file and recalculate its hash, but that would
waste disk I/O.

Another possibility would to be to not unlink() the hashes/ files
immediately, but rather let some nightly cronjob to stat() through all
of the files and unlink() the ones that have link count=1. This could be
wastefully inefficient though.

Yet another possibility would be for the plugin to internally calculate
the hash and write it somewhere. If it's at the beginning of the file,
it could be read from there with some extra disk I/O. But is it worth
it?..

Extra features
--------------

The attachment files begin with an extensible header. This allows a
couple of extra features to reduce disk space:

1) The attachment could be compressed (header contains compressed-flag)

2) If base64 attachment is in a standardized form that can be 100%
reliably converted back to its original form, it could be stored decoded
and then encoded back to original on the fly.

It would be nice if it was also possible to compress (and decompress)
attachments after they were already stored. This would be possible, but
it would require finding all the links to the message and recreating
them to point to the new message. (Simply overwriting the file in place
would require there are no readers at the same time, and that's not easy
to guarantee, except if Dovecot was entirely stopped. I also considered
some symlinking schemes but they seemed too complex and they'd also
waste inodes and performance.)

Code status
-----------

Initial version of the attachment reading/writing code is already done
and works (lacks some error handling and probably performance
optimizations). The SIS plugin code is also started and should be
working soon.

This code is very isolated and can't cause any destabilization unless
it's enabled, so I'm thinking about just adding it to v2.0 as soon as it
works, although the config file comments should indicate that it's still
considered unstable.