[Dovecot] (Single instance) attachment storage

Mon Jul 19 19:29:09 EEST 2010

Timo Sirainen wrote:
> Now that v2.0.0 is only waiting for people to report bugs (and me to figure out how to fix them), I've finally had time to start doing what I actually came here (Portugal Telecom/SAPO) to do. :)
>
> The idea is to have dbox and mdbox support saving attachments (or MIME parts in general) to separate files, which with some magic gives a possibility to do single instance attachment storage. Comments welcome.
>   

Cool.

> Extra features
> --------------
>
> The attachment files begin with an extensible header. This allows a couple of extra features to reduce disk space:
>
> 1) The attachment could be compressed (header contains compressed-flag)
>   

Cool.

> 2) If base64 attachment is in a standardized form that can be 100% reliably converted back to its original form, it could be stored decoded and then encoded back to original on the fly.
>   

Cool.

I have thought about this issue in the past. What follows may be obvious 
to you already, but might as well mention rather than missing something.

Presumably you want to be able to recreate the original base64 stream 
exactly verbatim?

Under base64, the number of 4-byte (encoded) / 3-byte (decoded) cells 
per line is not fixed by the specs.

I believe the optimal value is 19 cells per line, but I have seen some 
systems use 18 cells per line, and I think I have seen 15 as well. Once 
you have three possibilities, you might as well just store the number of 
cells per line.

I would suggest considering the base64 format as (conceptually) having 
an (integer) parameter for the number of cells in each line (except for 
the last line).

So base64(19) would have on each line 19 cells encoding 57 (19 × 3) 
bytes into 76 (19 × 4) bytes.

Probably you would need to have a base64 matcher/decoder which is 
smarter than normal base64 decoders and checks to make sure that all 
lines (apart from the last) are encoded (a) canonically (e.g.. with no 
trailing whitespace), and (b) using the same number of cells per line.

The base64 matcher/decoder needs to return information about the cell 
count as well as the decoded data.

If any line is not canonical base64 or uses a different number of cells, 
then the base64 may still be valid but "weird" so would just be stored 
as the original base64 stream.

When recovering message data, obviously your base64 encoder needs to use 
a parameter which is the number of cells per line to encode. Then you 
get back your original base64 stream verbatim.

==

Some systems finish the base64 stream with a newline (which in a 
multipart manifests as a blank line between the base64 stream and the 
'--' of the MIME boundary), whereas some systems finish the base64 
stream at the end of final 4-byte cell (which in a multipart manifests 
as the '--' of the MIME boundary appearing on the line immediately 
following the base64 encoded data). Your encoding allows for arbitrary 
data between the objects, so you would have no problem store these two 
cases verbatim. But something to watch out for when storing.

==

Maybe it would be a good idea to have the ability to say that an object 
was base64 decoded AND compressed (i.e. to recover the original stream 
fragment you need to decompress and base64 encode (with the relevant 
number of base64 cells per line)) --- as well as options for just base64 
decoded or just compressed.

You could go nuts and say that it is an arbitrarily-sized filter stack, 
but my first guess would be that this would be too much flexibility.

It might be better to say that there can be
zero or one decode/encode layers (like base64 or something else), and
zero or one compression layers (like gzip or bzip2 or xz/LZMA).

Obviously whatever translations are required to recover the original 
stream should be encoded into the attachment file so that sysadmins can 
tune the storage algorithm without affecting previously stored attachments.

Bill