On Thu, Jun 28, 2012 at 1:14 PM, Timo Sirainen <tss@iki.fi> wrote:
On 28.6.2012, at 17.43, Gary Mort wrote:
First I want to add AWS S3 as a storage option for alternate storage.
Then instead of the above model, the new model would be that email is always stored in alternate storage, and may be in primary storage. So, when mail comes in, I'd have Dovecot save the email to the alternate storage S3 bucket and update the indexs and other information[ideally, for convenience purposes, a few bits of relevant indexing information can be stored as metadata in the S3 object - sufficient so that instead of retrieving the entire S3 object, just the meta data can be pulled to build indexes.
The indexes have to be in primary storage.
True, but the data they are based on I'm assuming does not include the full email message, just a few key pieces: uniqueid, subject, from, to, etc.
For an always running server, the indexes are always up to date in primary.
For a server starting up with no index data, it will need to rebuild the index information[or for a second server running when new email has been delivered]. As such, rather then download every single email message just for a few bits of key info, I can run a re-index process to pull just the meta information and grab the data from there.
When a client attempts to retrieve an email message, Dovecot would check primary storage as it does now, if the message is not found than it will retrieve it from the alternate storage system AND store a copy in the primary storage.
I think the storing wouldn't be very useful. Most clients download the message once. There's no reason to cache it if it doesn't get downloaded again. The way it should work that new mails are immediately delivered to both primary and alt storage.
I've got tons of space - so I don't mind having 750MB or so for primary email message storage. If I can track how many times a message was actually read, over time I can get an idea of how I use it and setup the primary storage purge rules accordingly.
Secondly, I'd like to replace the Mysql database usage with a simpleDB database. While simpleDB lacks much of MySQL's sophistication, it doesn't seem that Dovecot is really using any of that, so simpleDB can be functionally equivalent.
Dovecot will probably get Redis and/or memcache backend for passdb+userdb. If simpledb is similar key-value database I guess the same code could be used partially.
simpleDB is more like SQLLITE: "Amazon SimpleDB is a highly available and flexible non-relational data store that offloads the work of database administration. Developers simply store and query data items via web services requests and Amazon SimpleDB does the rest." http://aws.amazon.com/simpledb/
Data model: http://docs.amazonwebservices.com/AmazonSimpleDB/latest/DeveloperGuide/DataM...
Domain == Table Item == row ItemName == primary key Attributes == column Value == data in column[multi value, so there can be multiple values for an attribute of an item]
There is no built in key relationship between data, it's just one big flat table. Columns/Attributes only have 2 types, string or integer
You query the data like an SQL table: http://docs.amazonwebservices.com/AmazonSimpleDB/latest/DeveloperGuide/Using...
Because there are no dates, it's best to store dates as UTC timestamps which are integers and can then be compared against numerically.
The datastore is spread over multiple Amazon data servers and can take up to a second to sync, so there are two methods of querying the data. Default: eventually consistent read: get the data quickly Optional: consistent read: check /all/ datastores and get the latest data
Since the data in simpleDB may not be updated frequently, a simple hack using the notification system could be: Before updating simpleDB send SNS notice that the data is being updated and where[domain, user, config] Update Data After updating simpleDB send SNS notice that the update is complete
Other servers running can record data updating notices in memory and expire them in about 15 seconds. For any queries they want to make for that type of data in the next 15 seconds, they will use consistent read.
The nice thing about using S3 and simpleDB is that I can completely skip a lot of steps in replication/distributed services as it is all handled already. And one can always take one set of api calls and substitute another for a different notification system, distributed database, and cloud file storage.