[Dovecot] (Single instance) attachment storage
William Blunn
bill+dovecot at blunn.org
Mon Jul 19 19:29:09 EEST 2010
Timo Sirainen wrote:
> Now that v2.0.0 is only waiting for people to report bugs (and me to figure out how to fix them), I've finally had time to start doing what I actually came here (Portugal Telecom/SAPO) to do. :)
>
> The idea is to have dbox and mdbox support saving attachments (or MIME parts in general) to separate files, which with some magic gives a possibility to do single instance attachment storage. Comments welcome.
>
Cool.
> Extra features
> --------------
>
> The attachment files begin with an extensible header. This allows a couple of extra features to reduce disk space:
>
> 1) The attachment could be compressed (header contains compressed-flag)
>
Cool.
> 2) If base64 attachment is in a standardized form that can be 100% reliably converted back to its original form, it could be stored decoded and then encoded back to original on the fly.
>
Cool.
I have thought about this issue in the past. What follows may be obvious
to you already, but might as well mention rather than missing something.
Presumably you want to be able to recreate the original base64 stream
exactly verbatim?
Under base64, the number of 4-byte (encoded) / 3-byte (decoded) cells
per line is not fixed by the specs.
I believe the optimal value is 19 cells per line, but I have seen some
systems use 18 cells per line, and I think I have seen 15 as well. Once
you have three possibilities, you might as well just store the number of
cells per line.
I would suggest considering the base64 format as (conceptually) having
an (integer) parameter for the number of cells in each line (except for
the last line).
So base64(19) would have on each line 19 cells encoding 57 (19 × 3)
bytes into 76 (19 × 4) bytes.
Probably you would need to have a base64 matcher/decoder which is
smarter than normal base64 decoders and checks to make sure that all
lines (apart from the last) are encoded (a) canonically (e.g.. with no
trailing whitespace), and (b) using the same number of cells per line.
The base64 matcher/decoder needs to return information about the cell
count as well as the decoded data.
If any line is not canonical base64 or uses a different number of cells,
then the base64 may still be valid but "weird" so would just be stored
as the original base64 stream.
When recovering message data, obviously your base64 encoder needs to use
a parameter which is the number of cells per line to encode. Then you
get back your original base64 stream verbatim.
==
Some systems finish the base64 stream with a newline (which in a
multipart manifests as a blank line between the base64 stream and the
'--' of the MIME boundary), whereas some systems finish the base64
stream at the end of final 4-byte cell (which in a multipart manifests
as the '--' of the MIME boundary appearing on the line immediately
following the base64 encoded data). Your encoding allows for arbitrary
data between the objects, so you would have no problem store these two
cases verbatim. But something to watch out for when storing.
==
Maybe it would be a good idea to have the ability to say that an object
was base64 decoded AND compressed (i.e. to recover the original stream
fragment you need to decompress and base64 encode (with the relevant
number of base64 cells per line)) --- as well as options for just base64
decoded or just compressed.
You could go nuts and say that it is an arbitrarily-sized filter stack,
but my first guess would be that this would be too much flexibility.
It might be better to say that there can be
zero or one decode/encode layers (like base64 or something else), and
zero or one compression layers (like gzip or bzip2 or xz/LZMA).
Obviously whatever translations are required to recover the original
stream should be encoded into the attachment file so that sysadmins can
tune the storage algorithm without affecting previously stored attachments.
Bill
More information about the dovecot
mailing list