Cassandra as a time series database
Reasons
Purpose
Whisper (Graphite), RRDtool, Ceres and Prometheus are special purpose databases designed for, or part of existing monitoring and metrics systems and are not general purpose. If you are building a monitoring and metrics system to keep track of the performance of your server farm they are going to be obvious candidates. If you are tracking freeway sensor or fitness tracker data probably not so much.Cost
Some of the databases in the list are proprietary licensed products. Nothing wrong with that, but it does mean a potentially significant upfront cost and to be honest in the modern open source popular world this becomes a major negative to many organizations. Informix, InfluxDB enterprise and DataStax Enterprise Cassandra (DSE) loose on this score.Scalability
If you are tracking the processor, memory and disk utilization of a couple dozen servers you don’t need much disk space to store many years’ worth of your time series data and you don’t much care about scalability. In that case Graphite and the community version of InfluxDB will probably work just fine. But…., if you want to track server metrics from one hundred thousand virtual servers with one-minute granularity you probably need a really scalable platform. Then you are going to need InfluxDB enterprise, Riak TS OpenTSDB or can I say it now? Cassandra.Availability
If you are monitoring a large number of somethings you probably really want to monitor them all the time and outages of your metrics database is going to be a BAD thing. So you want a highly available database. In the highly available world the best highly available systems are those which are designed from the ground up to be highly available where you can write or read to/from any node in the system and get meaningful results. Master Slave architectures no matter how good are just not as effective in handling a continuously incoming stream of data as multi-master databases are. With that in mind you are going to be limited to Riak TS, OpenTSDB (sort of) and Cassandra. All three are based on some similar concepts involving a cluster of distributed nodes. although the implementations vary quite a bit.Ease of use
Graphite and open source InfluxDB are incredibly easy to set up and use although they have their own API languages for accessing the data that you have to learn they provide powerful aggregates very useful for Time series data. The Influx query language is SQL like. Graphite doesn’t really support a query language outside of the graphical UI beyond a basic csv export function. Cassandra and Riak TS are fairly easy to set up (more difficult primarily in that they are distributed in nature) and they offer an SQL like language to access the data, although they do not have some of the very nice aggregates available to Graphite and InfluxDB. They are also a bit more general purpose and can be used for other things. OpenTSDB is based on Hadoop and Hbase requiring that you set up a Hadoop cluster then install Hbase and zookeeper. Finally, you add the OpenTSDB Daemons. This is a lot of work to get your time series database up and running. OpenTSDB also uses its own proprietary Query language which although very powerful for Time Series data is not anything like SQL.Everyone is using it
Now we get to the more subjective warm and fuzzy (pun intended) logic for making a decision which database to use. If you are a big data user with Hadoop, Zookeeper and Hbase already running OpenTSDB is the obvious choice. If you are a DevOps person with experience building monitoring systems, you are probably going to lean in the direction of Graphite, InfluxDB or Prometheus. At least until you run into scaling issues. If you know Erlang and Riak you might try Riak TS (there are not so very many of us in that group and if you haven’t heard the company that developed Riak went Bankrupt last year). If you don’t have DevOps experience or the scale of your project eliminates InfluxDB or Graphite and you don’t have a Hadoop cluster ready to go for you a quick search of the internet or chatting with your buddy over at xyz organization, you are going to hear about the wonders of Cassandra for storing Time Series data.Conclusion
Cassandra is not a purpose-built for time series data. In fact, its lack of aggregates can make the choice an odd one at times but it accepts rapid writes (in fact writes are usually an order of magnitude faster than reads), scales to really, large clusters with lots of disk space and its multi-master write and read anywhere even in geographically separate locations with lots of other people doing it makes a lot of sense.The mandatory chart
Find out about Pythian's services for Cassandra.
On this page
Share this
Share this
More resources
Learn more about Pythian by reading the following blogs and articles.
Replicating MySQL to Snowflake with Kafka and Debezium—Part One: Data Extraction
Replicating MySQL to Snowflake with Kafka and Debezium—Part One: Data Extraction
May 5, 2021 12:00:00 AM
13
min read
Exposing MyRocks internals via system variables: Part 5, Data Reads
Exposing MyRocks internals via system variables: Part 5, Data Reads
May 21, 2019 12:00:00 AM
13
min read
Basho Technologies and PalominoDB Partner to Offer Enhanced Support and Monitoring Services for Riak Installation and Management
Basho Technologies and PalominoDB Partner to Offer Enhanced Support and Monitoring Services for Riak Installation and Management
Jul 25, 2012 12:00:00 AM
2
min read
Ready to unlock value from your data?
With Pythian, you can accomplish your data transformation goals and more.