Introduction
I’ve been working with data in many forms for my entire career. During this time, I have occasionally needed to build or query existing databases to get statistical data. Traditional databases are usually designed to query specific data from the database quickly and support transactional operations. While the capability is there to perform statistical aggregates, they are not usually very fast. Especially when you are working with databases containing hundreds of millions of rows. To support purely analytic access to a database the column store was invented. See this link in Wikipedia for more on the history: https://en.wikipedia.org/wiki/Column-oriented_DBMSWhat makes a column store?
Most traditional databases store data in the form of records. Each record contains a set of fields containing values that are related to each other. In a relational database, records are rows and fields are columns. In general, what we get is something like this table:![](https://www.pythian.com/hs-fs/hubfs/Imported_Blog_Media/RowStore-2018-03-16.png?width=616&height=149&name=RowStore-2018-03-16.png)
![](https://www.pythian.com/hs-fs/hubfs/Imported_Blog_Media/columnStore-2018-03-18.png?width=614&height=195&name=columnStore-2018-03-18.png)
The two kinds of column stores
It turns out that there are really two very different kinds of database that can be referred to as a column store. Cassandra and HBase store each column separately for reasons very different from other column stores. As a result the performance is also very different. The HBase and Cassandra column store concept does not enable rapid analytics. Instead they store individual columns clumped together to support their sparse table (big table) model and fast writes. When you hear or read references to these kinds of databases as column stores remember they aren’t going to give you the same kind of performance analytics. Since both Hbase and Cassandra are so very poor at doing analytics, they don't support much in the way of built in aggregates at all.Some examples of specific open source column stores
While there are many examples of proprietary column stores since I am deeply engaged in the open source world my examples come primarily from the open source world.MariaDB AX
MariaDB AX is a relative newcomer to the column store world although the source code they are using has been around for quite a while. The MariaDB corporation adopted the codebase from Infinidb: https://en.wikipedia.org/wiki/InfiniDB MariaDB AX uses the storage engine interface to allow users to use the MySQL SQL dialect against the distributed column store. One of the nice features of using MariaDB AX is you can also create tables using the Transactional InnoDB storage engine and mix both transactional and analytic workloads on the same database. Although you won’t be doing them on the same tables.![](https://www.pythian.com/hs-fs/hubfs/Imported_Blog_Media/MariaDBAX_2018-03-16.png?width=609&height=251&name=MariaDBAX_2018-03-16.png)
Parquet
Parquet isn’t a database. Instead it’s a file format which can be used to store database tables on distributed file systems like HDFS, CEPH or AWS S3. Data is stored in Parquet in chunks that contain blocks of column data in a fashion that makes it possible to break up a Parquet file. Storing the file on many distributed hosts while allowing it to be processed in parallel. You can access Parquet files using Apache Spark, Hive, Pig, Apache Drill and Cloudera’s Impala. For more information on Parquet: https://parquet.apache.org/documentation/latest/Clickhouse
Clickhouse was created by the Russian company Yandix to support their own internal requirements to provide analytics to their customers. Unlike MariaDB AX or Parquet it has its own dialect of SQL and its own APIs. Nevertheless, it is rapidly becoming very popular because it is very fast. For more information on Clickhouse: https://en.wikipedia.org/wiki/ClickHouseCitus Data
The Citus Data corporation builds extensions to PostgreSQL. One, which they are best known for, is a distributed cluster extension that turns PostgreSQL into a distributed cluster database. The other is a Hybrid column store feature which allows you to choose to make some tables within your PostgreSQL database column oriented while allowing you to keep others in the older transaction friendly model. For more information on Citus Data the company: https://www.citusdata.com/download/ For more information on the Citus Data Column store: https://www.citusdata.com/blog/2014/04/03/columnar-store-for-analytics/Greenplum
Greenplum has been in the data warehouse and business intelligence business for quite a while. Their product like Citus Data is based on PostgreSQL. Albeit a much older version of PostgreSQL. They offer both a proprietary version and an open source community version of their product. For more information on Greenplum see: https://en.wikipedia.org/wiki/GreenplumAll the rest
The ones I have listed here are the ones I have dealt with in the last couple of years. There are a bunch of database technologies in the column store category I have not mentioned here but you can find out about them here: https://en.wikipedia.org/wiki/List_of_column-oriented_DBMSesConclusion
Column store databases store data in columns instead of rows. They make it possible to compute statistics on those columns one to two orders of magnitude or more, faster than on traditional row-oriented databases. A column-oriented table is very good for analytics but usually terrible for traditional transactional workloads. Most, although not all, column stores are designed to operated on a distributed cluster of servers.Share this
You May Also Like
These Related Stories
Datascape podcast episode 24 – MariaDB and the MySQL ecosystem
Datascape podcast episode 24 – MariaDB and the MySQL ecosystem
Mar 29, 2018
1
min read
Hadoop FAQ - But What About the DBAs?
Hadoop FAQ - But What About the DBAs?
Jan 25, 2013
5
min read
What you should know about Docker Containers for Oracle Data Integrator
What you should know about Docker Containers for Oracle Data Integrator
Aug 1, 2017
12
min read
No Comments Yet
Let us know what you think