Replication is the New Durability (Thoughts About Dynamo)

Jul 27, 2010 / By Gwen Shapira

Tags: , ,

“Dynamo: Amazon’s Highly Available Key-value Store” is a high level description of a data store, written by Amazon to solve the problem of a system where updates must never ever fail and must take less than a specific amout of time in 99.99% of the cases. No matter what happens to the servers or the network, updates to the system must continue as usual, and they emphasize that they deal with hardware and network failures nearly constantly.

The paper has one of the best descriptions on the trade-offs involved in eventual consistency, and when it makes sense. But even more interesting is the implicit decision that disks, commits and synchronous writes to redo logs are not really needed for durability.

To allow for simple design, robustness and high performance, Amazon limited the reporting capabilities – data is accessed by primary key only. To implement the high availability requirement, Dynamo replicates each key-value pair between multiple machines in different data stores. And to make sure that replication doesn’t impact performance, Amazon decided they can live without consistency – requests from the system can sometimes return old data or several contradicting versions of data – for Amazon’s requirements, this is much better than failure or delays.

The consistency trade-off is very explicitly dealt with in the paper – there will be multiple versions of the data, a request may get any number of replies and may not get the most recent updates. The application developers have to build applications that can handle this situation and merge several versions of the data into the information they want.

The trade-off of durability is done implicitly. The paper simply does not mention writing anything to any disk. Nothing like redo logs is mentioned either (except in the context of tracking changes for a missing node). This is obviously very fast, but can we really trust this system not to lose any changes?

To avoid losing data, the system uses a quorum-like system. There are 3 important parameters to tune for the replication: How many times each key-value pair should be replicated (N), how many nodes should participate in a successful write (W) and how many nodes should participate in successful read (R).

When W=3, at least 3 nodes should signal that the write was successful before the application gets an “ack” from the data store. A coordinator node will attempt to write to N nodes, but once W nodes approve the write, it is considered permanent. In order to lose an update we need at least W nodes to crash simultaneously after an update and probably more because W is just the lower bound. As long as at least one of those nodes is up, the data is not lost.

The idea is to set the replication level to a number high enough to satisfy durability requirements, but low enough to prevent too much latency. You also want to split those nodes between at least two data centers to reduce the chance of losing all nodes at once. Considering that this system doesn’t allow for backups, I would set the number of replicated nodes really high. I would expect about 50 nodes covering each piece of data before I’d consider it as durable as disk bases system (considering how often servers crash vs. how often disks crash). Amazon, on the other hand, mentioned in the paper that they use replication level of 3, and they keep stressing that not losing a single update is vital to their business (because each update can be a sell of an item), so maybe I’m under-estimating the durability of this system.

From the description, it also appears that it is impossible to dump the contents of the system. This means that backups are impossible, no point-in-time-recovery, and ETL into a DW system also sounds impossible.

I like the idea of databases without disks – disks are notorious for just slowing everything down. I’m just not sure I would want my bank to use a replication-only system, even if it does seem durable in theory. So, if we can’t use Dynamo for banks, what is it good for?

Amazon’s paper gives Amazon shopping cart as an example – The data is temporary anyway, there are many more writes to the system than reads (you keep puting items in the cart, but only look at the cart once before checking out) and there is a clear rule on how to merge conflicting versions (union). Social Networks is another obvious use-case, with tons of updates and trivial way to merge different version, I’m not sure if access is limited to just primary key though – you want to see your own data and that of your friends.

I think that an interesting possible use case is for system monitoring data – it is also update heavy, can be mostly primary-key access (metric, such as CPU on specific machine, being the key), and merging versions is trivial (union). Monitoring data does need to be reported, so I would expect my data store to allow dumping the data into a more traditional store in specific intervals.

The Dynamo paper was written four years ago, I’m sure the system improved since. I’m also sure that the paper gives only a partial description of the features of the system (and I suspect that they do write data to disks after all).  Still, it paved way for a future of special purpose databases , ones that solve a specific use-case better than a general purpose RDBMS.

If you ask yourself “What are my requirements?” and “What is the best data store for my requirements?” instead of saying “We have Oracle, lets throw everything there and it will take care of everything magically”, your application will be better for it. As Cary Millsap recently wrote – it is better to think clearly than to be correct.

4 Responses to “Replication is the New Durability (Thoughts About Dynamo)”

  • Gary says:

    “I’m not sure if access is limited to just primary key though – you want to see your own data and that of your friends.”

    That is possible with linked lists but very data heavy. “Fred” has a linked list of all the updates they expect to see. They query them by getting the most recent update, which links to the previous on, and so on.
    But that means that when one of “Fred’s” friends posts an update it has to add that update to “Fred’s” linked list. So if “Wilma” has ten friends, any update means ten new entries in different linked lists. Workable for tens/hundreds (and indeed, it was the way in pre-relational days).

    At twitter scale, where some people have hundreds of thousands of followers, that would get scary.

  • Write only is the the attribute of many systems. For example, RFID tags tracking. However, this would be one level only and this info needs to go somewhere for the next processing but this time in bulk so dumping data somehow is a must.

    Even in social media, how are they doing business? The need to be able to tap into this super valuable information of connected human networks to be able to make decisions about ads targeting and etc.

  • Gwen Shapira says:

    As I said – I’m sure Amazon didn’t give away everything in that paper.

    For example, they give recommendations based on items that I looked at, but didn’t buy – that means they run analytics of sorts on this data, which means they have it in some kind of DW. How did it get there? They don’t tell.

  • joel garry says:

    @Gwen:

    DW for Amazon may be far different than we expect. See http://mashable.com/2010/07/27/amazon-facebook-recommendations/

    My speculation is they have generalized the interaction between data streams and little DW type views. So someone likes Pink Floyd it would follow down a neural network of all things related to Pink Floyd, and tell you to buy Ummagumma for their birthday. Or maybe an axe if the friend is named Gene. Or http://news.cnet.com/2100-1023-976435.html

Leave a Reply

  • (will not be published)

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>