Re: [Dovecot] Scalability plans: Abstract out filesystem and make it someone else's problem

13 Aug 2009


      Steve wrote:
...
...
dbox-only is fine.  I could care less about the storage method chosen -
filesystem, db, encrypted, whatever - but I believe the impact on
storage - and possibly indexes & searching - would be huge.
On the personal greedy side, if you want to see a mass corporate
migration to Dovecot, with potential service contracts - that would be a
feature worth talking about.  I can see IT manager's eyes light up at
hearing about such a item - and I've never heard of any other mail
server supporting such a thing.
IBM Lotus Domino has that feature since ages (they call it shared mail). And they don't have that just for normal mails but for archives as well (called single instance store). This feature was first introduced in cc:Mail and then got integrated into Domino and is still there and even extended to work with various backends (like the new DB2 backend). Microsoft copied that concept from them (from my viewpoint the way how MS has done it in the past was horrible. I think newer versions work better but I am not sure).
...
From my experience in doing messaging since 2 decades I can tell you that it is not worth doing single instance store (or how ever you call it). Storage is ultra cheep these days and backup systems are so fast that all the benefits which where valid some years ago are gone today.
It might rock your geek heart to implement something like that but doing the math on costs versus benefits will soon or later show you that today it's not worth doing it.
I have no experience with Domino, but I just did a Google for "lotus
domino shared mail" and read the brief on lotus.com.  Based on what I
read, it has potential - only splits message headers from bodies and
stores the bodies as complete images, without separating attachments.

That helps reduce the load when somebody blasts out a flier to everyone
in the company in a single message - but I'm asking for something more
ambitious.
If every attachment in a given message is individually scanned to
generate some unique identifier, and that identifier then used to
determine whether or not it exists in the database - this could have
HUGE effects.  This now addresses not just the simple broadcast - but
some really crazy possibilities.
User A receives a message with an attachment (like a product brochure),
likes it, and forwards it to Users B-Z.
User F recognizes that product, but has a counter-proposal, so he
attaches another brochure and replies to A-Z.  Being an idiot, the
original attachment is still kept in the reply.
User H forwards this message to a buddy at another company for discussion.
[...time passes...]
Three weeks later, User 101 at the other company gets back from
vacation, has just received a message with the original brochure.  He
forwards it to User A (who started this mess).
User A, being a total dimwit, doesn't recognize that he already spread
this junk throughout the company last month - so he broadcasts it again.
Under the structure I've proposed, net storage consumed by the
attachments should be one copy of attachment 1, and one copy of
attachment two, plus headers and any comments in the messages times the
number of recipients.  Domino would store one copy of attachment 1, then
a copy of attachment 1 + attachment 2, then another copy of attachment 1.
This is a minor example - but I just wanted to show SOMETHING to justify
the effort.
As far as cheap storage - I agree costs are a fraction of what they once
were.  But by reducing the amount stored, consider the tradeoffs of
reduced caching, smaller differential backups, and reduced archival
costs (off-site storage costs often calculated per GB), just to name a
few.  To me the only down side (other than requiring Timo to invest more
blood, sweat, & tears in this project) is how much this costs in message
READ time.  For me, typical user interaction is reading.  As I believe
previously mentioned, if the server implements some type of delayed
delete function, then delete times are not a concern.  And write times
are also (I think) a minor concern.  But the primary issue is how fast
can we retrieve a message + attachments and stream it to the client.  It
seems to be header lists won't be impacted, so simply pointing the mail
client at the server to see a list of mail shouldn't change at all.  So
then the question is the potential latency from when a user selects a
message to when it appears on their screen.  Will the time spent
searching the disk, and assembling the message, be significant when
compared with the network communication between server & client?
--
Daniel

Re: [Dovecot] Scalability plans: Abstract out filesystem and make it someone else's problem

Daniel L. Miller