As some of you may know, in my current role at Pythian I am tackling OSDB, and Cassandra is on my radar. One of the things I have been trying to do is learn what Cassandra is, so in this post, I'm going to share a bit of what I have been able to learn.
According to the whitepaper "Solving Big Data Challenges for Enterprise Application Performance Management", Cassandra is a "distributed key value store developed at Facebook. It was designed to handle very large amounts of data spread out across many commodity servers while providing a highly available service without single point of failure allowing replication even across multiple data centers as well as for choosing between synchronous or asynchronous replication for each update."
Cassandra, in layman's terms, is a NoSQL database developed in Java. One of Cassandra's many benefits is that it's an open source DB with deep developer support. It is also a fully distributed DB, meaning that there is no master DB (unlike Oracle or MySQL) so this allows this database to have no point of failure. It also touts being linearly scalable, meaning that if you have 2 nodes and a throughput of 100,000 transactions per second, and you added 2 more nodes, you would now get 200,000 transactions per second, and so forth.
Cassandra is based on 2 core technologies, Google's Big Table and Amazon's Dynamo. Facebook uses the latter to power their Inbox Search feature. It was released as an open source project on Google Code and then incubated at Apache, nowadays, Cassandra is considered a Top-Level-Project. Currently there are 2 versions of Cassandra:
Since Cassandra is a distributed system, it follows the CAP Theorem, which states that, in a distributed system, you can only have two out of the following three guarantees across a write/read pair:
Also Cassandra is a BASE (Basically Available, Soft state, Eventually consistent) type system, not an ACID (Atomicity, Consistency, Isolation, Durability) type system, meaning that the system is optimistic and accepts that the database consistency will be in a state of flux, not like ACID which is pessimistic and it forces consistency at the end of every transaction.
Cassandra stores data according to the column family data model where:
Also we need to understand the basic architecture of Cassandra, which has the following key structures:
Hopefully this will help you understand the basic Cassandra concepts. In the next post, I will go over the architecture concepts of what a Seed node is, the purpose of the Snitch and topologies, the Coordinator node, replication factors, etc.
Learn more about Pythian's Cassandra Services or get a free cassandra assessment today.
Related Blog Posts André Araújo, a great friend of mine and previous Pythianite, wrote about his first experience with Cassandra. The original post was published on René Antúnez’s blog.
Ready to optimize your Cassandra Database for the future?