[Dovecot] Scalability plans: Abstract out filesystem and make it someone else's problem

Thu Aug 13 00:54:59 EEST 2009

Steve wrote:
>
>> dbox-only is fine.  I could care less about the storage method chosen - 
>> filesystem, db, encrypted, whatever - but I believe the impact on 
>> storage - and possibly indexes & searching - would be huge.
>>
>> On the personal greedy side, if you want to see a mass corporate 
>> migration to Dovecot, with potential service contracts - that would be a 
>> feature worth talking about.  I can see IT manager's eyes light up at 
>> hearing about such a item - and I've never heard of any other mail 
>> server supporting such a thing.
>>
>>     
> IBM Lotus Domino has that feature since ages (they call it shared mail). And they don't have that just for normal mails but for archives as well (called single instance store). This feature was first introduced in cc:Mail and then got integrated into Domino and is still there and even extended to work with various backends (like the new DB2 backend). Microsoft copied that concept from them (from my viewpoint the way how MS has done it in the past was horrible. I think newer versions work better but I am not sure).
>
> >From my experience in doing messaging since 2 decades I can tell you that it is not worth doing single instance store (or how ever you call it). Storage is ultra cheep these days and backup systems are so fast that all the benefits which where valid some years ago are gone today.
>
> It might rock your geek heart to implement something like that but doing the math on costs versus benefits will soon or later show you that today it's not worth doing it.
I have no experience with Domino, but I just did a Google for "lotus 
domino shared mail" and read the brief on lotus.com.  Based on what I 
read, it has potential - only splits message headers from bodies and 
stores the bodies as complete images, without separating attachments.  
That helps reduce the load when somebody blasts out a flier to everyone 
in the company in a single message - but I'm asking for something more 
ambitious.

If every attachment in a given message is individually scanned to 
generate some unique identifier, and that identifier then used to 
determine whether or not it exists in the database - this could have 
HUGE effects.  This now addresses not just the simple broadcast - but 
some really crazy possibilities.

User A receives a message with an attachment (like a product brochure), 
likes it, and forwards it to Users B-Z.
User F recognizes that product, but has a counter-proposal, so he 
attaches another brochure and replies to A-Z.  Being an idiot, the 
original attachment is still kept in the reply.
User H forwards this message to a buddy at another company for discussion.
[...time passes...]
Three weeks later, User 101 at the other company gets back from 
vacation, has just received a message with the original brochure.  He 
forwards it to User A (who started this mess).
User A, being a total dimwit, doesn't recognize that he already spread 
this junk throughout the company last month - so he broadcasts it again.

Under the structure I've proposed, net storage consumed by the 
attachments should be one copy of attachment 1, and one copy of 
attachment two, plus headers and any comments in the messages times the 
number of recipients.  Domino would store one copy of attachment 1, then 
a copy of attachment 1 + attachment 2, then another copy of attachment 1.

This is a minor example - but I just wanted to show SOMETHING to justify 
the effort.

As far as cheap storage - I agree costs are a fraction of what they once 
were.  But by reducing the amount stored, consider the tradeoffs of 
reduced caching, smaller differential backups, and reduced archival 
costs (off-site storage costs often calculated per GB), just to name a 
few.  To me the only down side (other than requiring Timo to invest more 
blood, sweat, & tears in this project) is how much this costs in message 
READ time.  For me, typical user interaction is reading.  As I believe 
previously mentioned, if the server implements some type of delayed 
delete function, then delete times are not a concern.  And write times 
are also (I think) a minor concern.  But the primary issue is how fast 
can we retrieve a message + attachments and stream it to the client.  It 
seems to be header lists won't be impacted, so simply pointing the mail 
client at the server to see a list of mail shouldn't change at all.  So 
then the question is the potential latency from when a user selects a 
message to when it appears on their screen.  Will the time spent 
searching the disk, and assembling the message, be significant when 
compared with the network communication between server & client?

--
Daniel