Cassandra 101: Understanding Cassandra | Pythian Blog

Written by Rene Antunez | Mar 16, 2015 4:00:00 AM

An Introduction to Apache Cassandra

As some of you may know, in my current role at Pythian I am tackling OSDB, and Cassandra is on my radar. One of the things I have been trying to do is learn what Cassandra is, so in this post, I'm going to share a bit of what I have been able to learn.

According to the whitepaper "Solving Big Data Challenges for Enterprise Application Performance Management", Cassandra is a "distributed key value store developed at Facebook. It was designed to handle very large amounts of data spread out across many commodity servers while providing a highly available service without single point of failure allowing replication even across multiple data centers as well as for choosing between synchronous or asynchronous replication for each update."

Cassandra, in layman's terms, is a NoSQL database developed in Java. One of Cassandra's many benefits is that it's an open source DB with deep developer support. It is also a fully distributed DB, meaning that there is no master DB (unlike Oracle or MySQL) so this allows this database to have no point of failure. It also touts being linearly scalable, meaning that if you have 2 nodes and a throughput of 100,000 transactions per second, and you added 2 more nodes, you would now get 200,000 transactions per second, and so forth.

Cassandra is based on 2 core technologies, Google's Big Table and Amazon's Dynamo. Facebook uses the latter to power their Inbox Search feature. It was released as an open source project on Google Code and then incubated at Apache, nowadays, Cassandra is considered a Top-Level-Project. Currently there are 2 versions of Cassandra:

Community Edition: Distributed under the Apache™ License
Enterprise Edition: Distributed by Datastax

Theoretical Foundations: CAP Theorem and BASE vs. ACID

Since Cassandra is a distributed system, it follows the CAP Theorem, which states that, in a distributed system, you can only have two out of the following three guarantees across a write/read pair:

Consistency: A read is guaranteed to return the most recent write for a given client.
Availability: A non-failing node will return a reasonable response within a reasonable amount of time (no error or timeout).
Partition Tolerance: The system will continue to function when network partitions occur.

Also Cassandra is a BASE (Basically Available, Soft state, Eventually consistent) type system, not an ACID (Atomicity, Consistency, Isolation, Durability) type system, meaning that the system is optimistic and accepts that the database consistency will be in a state of flux, not like ACID which is pessimistic and it forces consistency at the end of every transaction.

The Cassandra Data Model: Keyspaces and Column Families

Cassandra stores data according to the column family data model where:

Keyspace is the container for your application data, similar to a schema in a relational database. Keyspaces are used to group column families together. Typically, a cluster has one keyspace per application. It also defines the replication strategy and data objects belong to a single keyspace.
Column Family is a set of one, two, or more individual rows with a similar structure.
Row is a collection of sorted columns, it is the the smallest unit that stores related data in Cassandra, and any component of a Row can store data or metadata.
Row Key uniquely identifies a row in a column family.
Column key uniquely identifies a column value in a row.
Column value stores one value or a collection of values.

Physical and Logical Infrastructure: Nodes, Racks, and Clusters

Also we need to understand the basic architecture of Cassandra, which has the following key structures:

Node is one Cassandra instance and is the basic infrastructure component in Cassandra. Cassandra assigns data to nodes in the cluster; each node is assigned a part of the database based on the Row Key. Usually corresponds to a host, but not necessarily, especially in Dev or Test environments.
Rack is a logical set of nodes.
Data Center is a logical set of Racks, a data center can be a physical data center or virtual data center. Replication is set by data center.
Cluster contains one or more data centers and is the full set of nodes which map to a single complete token ring.

Conclusion and Further Learning

Hopefully this will help you understand the basic Cassandra concepts. In the next post, I will go over the architecture concepts of what a Seed node is, the purpose of the Snitch and topologies, the Coordinator node, replication factors, etc.

Learn more about Pythian's Cassandra Services or get a free cassandra assessment today.

Related Blog Posts André Araújo, a great friend of mine and previous Pythianite, wrote about his first experience with Cassandra. The original post was published on René Antúnez’s blog.

Cassandra Database Consulting Services

Ready to optimize your Cassandra Database for the future?

View full post