On Wed, 2009-08-12 at 18:42 +0100, Ed W wrote:
Something like that. In dbox you have one storage directory containing all mailboxes' mails (so that copying can be done by simple index updates). Then you have a bunch of files, each about n MB (configurable, 2 MB by default). Expunging initially only marks the message as expunged in index. Then later (or immediately, configurable) you run a cronjob that goes through all dboxes and actually removes the used space by recreating those dbox files.
Yeah, sounds good.
You might consider some kind of "head optimisation", where we can already assume that the latest chunk of mails will be noisy and have a mixture of deletes/appends, etc. Typically mail arrives, gets responded to, gets deleted quickly, but I would *guess* that if a mail survives for XX hours in a mailbox then likely it's going to continue to stay there for quite a long time until some kind of purge event happens (user goes on a purge, archive task, etc)
If disk space usage isn't such a huge problem, I think the nightly purges solve this issue too. During the day user may get mails and delete them, and at night the deleted mails are purged. Perhaps it could help a bit if new mails were all stored in separate file(s) and at night then appended to some larger existing file, but that optimization can be left until later. :)
Oh, have you considered some "optional" api calls in the storage API? The logic might be to assume that someone wanted to do something clever and split the message up in some way, eg store headers separately to bodies or bodies carved up into mime parts. The motivation would be if there was a certain access pattern to optimise. Eg for an SQL database it may well be sensible to split headers and the message body in order to optimise searching - the current API may not take advantage of that?
Well, files have paths. I think the storage backend can determine from that what type the data is. So if you're writing to mails/foo/bar/123 it means you're storing a message with ID 123 to mailbox "foo/bar". It could then internally parse the message and store its header/body/mime separately.