Timo Sirainen wrote:
Now that v2.0.0 is only waiting for people to report bugs (and me to figure out how to fix them), I've finally had time to start doing what I actually came here (Portugal Telecom/SAPO) to do. :)
The idea is to have dbox and mdbox support saving attachments (or MIME parts in general) to separate files, which with some magic gives a possibility to do single instance attachment storage. Comments welcome.
Cool.
Extra features
The attachment files begin with an extensible header. This allows a couple of extra features to reduce disk space:
- The attachment could be compressed (header contains compressed-flag)
Cool.
- If base64 attachment is in a standardized form that can be 100% reliably converted back to its original form, it could be stored decoded and then encoded back to original on the fly.
Cool.
I have thought about this issue in the past. What follows may be obvious to you already, but might as well mention rather than missing something.
Presumably you want to be able to recreate the original base64 stream exactly verbatim?
Under base64, the number of 4-byte (encoded) / 3-byte (decoded) cells per line is not fixed by the specs.
I believe the optimal value is 19 cells per line, but I have seen some systems use 18 cells per line, and I think I have seen 15 as well. Once you have three possibilities, you might as well just store the number of cells per line.
I would suggest considering the base64 format as (conceptually) having an (integer) parameter for the number of cells in each line (except for the last line).
So base64(19) would have on each line 19 cells encoding 57 (19 × 3) bytes into 76 (19 × 4) bytes.
Probably you would need to have a base64 matcher/decoder which is smarter than normal base64 decoders and checks to make sure that all lines (apart from the last) are encoded (a) canonically (e.g.. with no trailing whitespace), and (b) using the same number of cells per line.
The base64 matcher/decoder needs to return information about the cell count as well as the decoded data.
If any line is not canonical base64 or uses a different number of cells, then the base64 may still be valid but "weird" so would just be stored as the original base64 stream.
When recovering message data, obviously your base64 encoder needs to use a parameter which is the number of cells per line to encode. Then you get back your original base64 stream verbatim.
==
Some systems finish the base64 stream with a newline (which in a multipart manifests as a blank line between the base64 stream and the '--' of the MIME boundary), whereas some systems finish the base64 stream at the end of final 4-byte cell (which in a multipart manifests as the '--' of the MIME boundary appearing on the line immediately following the base64 encoded data). Your encoding allows for arbitrary data between the objects, so you would have no problem store these two cases verbatim. But something to watch out for when storing.
==
Maybe it would be a good idea to have the ability to say that an object was base64 decoded AND compressed (i.e. to recover the original stream fragment you need to decompress and base64 encode (with the relevant number of base64 cells per line)) --- as well as options for just base64 decoded or just compressed.
You could go nuts and say that it is an arbitrarily-sized filter stack, but my first guess would be that this would be too much flexibility.
It might be better to say that there can be zero or one decode/encode layers (like base64 or something else), and zero or one compression layers (like gzip or bzip2 or xz/LZMA).
Obviously whatever translations are required to recover the original stream should be encoded into the attachment file so that sysadmins can tune the storage algorithm without affecting previously stored attachments.
Bill