How Todd Hoff stoped worrying and used disk space to scale

Posted in: MySQL, Technical Track

Todd Hoff, who apparently learned a hell of a lot during a short stint at Yahoo followed by some startups has an extremely well-written and edutaining article about how scaling to a million or more users requires jettisoning more or less everything we know and love about relational modeling.

Even though he uses bigtable (Google’s distributed hash storage system) as his example, in reality this approach works well with relational datastores like MySQL and Oracle too, you just have to think about your data differently and use the databases differently. So I’m including this article in the MySQL and Oracle categories because I think it would be of interest.

Here’s a taste of how it reads:

How do you structure your database using a distributed hash table like BigTable? The answer isn’t what you might expect. If you were thinking of translating relational models directly to BigTable then think again. The best way to implement joins with BigTable is: don’t. You–pause for dramatic effect–duplicate data instead of normalize it. *shudder*

Flickr anticipated this design in their architecture when they chose to duplicate comments in both the commentor and the commentee user shards rather than create a separate comment relation. I don’t know how that decision was made, but it must have gone against every fiber in their relational bones…

But Flickr’s reasoning was genius. To scale you need to partition. User data must spread across the shards. So where do comments belong in a scalable architecture?

The answer is, in case you aren’t following yet, you store it everywhere you might need it and worry about keeping your multiple copies in sync later, if at all.

BigTable data ethics are more Mardi Gras than dinner with the in-laws. Data just wants to have fun. BigTable won’t stop you from hurting yourself. And to get the best results you may have to engage in some conventionally risky behaviors. But if those are the glass bead necklaces you have to give for a peak at scalability, why not take a walk on the wild side?

So anyway, this is awesome stuff and thanks Todd. For your reading and learning enjoyment: Todd Hoff’s “How I learned to stop worrying and use lots of disk space to scale”.

Interested in working with Paul? Schedule a tech call.

About the Author

As Pythian’s Chief Executive Officer, Paul leads this center of excellence for expert, outsourced technical services for companies whose systems are directly tied to revenue growth and business success. His passion and foresight for using data and technology to drive business success has helped Pythian become a high-growth global company with over 400 employees and offices in North America, Europe, and Asia. Paul, who started his career as a data scientist, founded Pythian when he was 25 years old. In addition to driving the business, Paul is a vocal proponent of diversity in the workplace, human rights, and economic empowerment. He supports his commitment through Pythian’s hiring and retention practices, his role as board member for the Basic Income Canada Network, and as a supporter of women in technology.

2 Comments. Leave new

Hi Paul,

Actually I was learning from the people mentioned in post and inflicting it on the blog. I don’t have an easy time with this stuff either. The post was my way of trying to understand how to write my own code and then writing up how it seems to me.

And I think you are exactly correct about the ideas working on other systems.


Yes, exactly. In fact I was jamming with a giant in the internet-scale world (a very early leader) today about how one could use Oracle’s object interface on a federated architecture to exactly mimic the performance and scaling characteristics of bigtable.



Leave a Reply

Your email address will not be published. Required fields are marked *