You’ve probably heard the expression, “Data is the new oil”. That’s because data is at the heart of any business decision system of any organization. However, the data you need isn’t always located in only one place, like a regular and vanilla RDBMS. Normally, it is scattered across different RDBMS and other data silos.
So, it is not uncommon to come across situations where data is sitting across all kinds of different sources and the DBA or data engineer needs to implement a solution that pulls all relevant data into a centralized location for further analysis.
Picture 1: The many possible data sources
Recently, one of our customers had sales data in an AWS Postgres RDS Instance and needed that data (a set of tables to be more precise) to be replicated (almost in real time) into some tables in their Azure VM with MS SQL Server. Unfortunately, the heterogeneous replication is marked as deprecated and it only works between SQL Server and Oracle. No Postgres allowed in the batch. Moreover, there’s no native heterogeneous replication on the Postgres side. This can only be achieved with the use of third-party tools or manually in HIGH effort mode, creating a CDC (change data capture) mechanism by using triggers and then pulling data from control tables.
It sounded fun to implement, but time was of the essence. Developing such a solution would take quite some time.
In short, another solution had to be used to achieve the required data movement.
Picture 1 shows a possible scenario but doesn’t show the solution as in which tool could be used as the pipeline. The Big Data scenario has many tools that could perhaps achieve this, but none of them had the advantages that Kafka offered.
Originally, Apache Kafka was designed to be a message queue like app (remember IBM MQ Series or even SQL Server Service Broker?). The best definition of what it is can be quoted from Confluent: “Apache Kafka is a community distributed event streaming technology capable of handling trillions of events a day”. From a DBA’s mind: “Well, streaming must be like replication, but massive!” That is accurate, but a streaming system needs to have the ability to ingest data from many different data sources and deliver its messages on a variety of data endpoints as well.
We could choose from many tools to achieve this, Azure Event Hubs or AWS Kinesis or even the Kafka managed in Confluent Cloud. However, for the sake of learning how things work under the hood and since they all come from similar backgrounds (which is the Apache Kafka open-source code) or have the same goal we decided to show the customer how to set up the environment from ground zero without any managed services for the stream processing.
So, the previous picture would look something like this:
Picture 2: Using Kafka to handle many sources and deliver the message to one (or many) destinations
As a brief introduction, the basic Apache Kafka components are:
In conclusion, Kafka is the tool of choice for that task and we’re going to create the entire environment for the data movement from scratch. For more information, access here: https://kafka.apache.org/documentation.
In order to achieve the streaming functionality and see the data flow in action we will use a set of tables from PostgreSQL DB as source, a set of tables from SQL Server as destination and Apache Kafka will act as the “replication mechanism”. To connect to Postgres and detect its data changes (CDC) a connector plugin is also needed. In this case we will use Debezium as it is also open-source and built from the ground up to work with Kafka as easily as possible.
Picture 3: Moving data from Postgres to MS SQL in a nutshell
Diving a bit more into details, at the end of this blog post series we should have:
Therefore, our architecture should look a little bit more like Picture 4.
Picture 4: Detailing the architecture
Bear in mind that in order to achieve the required goal (replicate data from Postgres into SQL Server using Kafka) we could simply use a Kafka-as-a-service product in any of the major cloud providers. However, in order to fully understand how the architecture operates and simply as a showcase we decided to not use any SaaS related product, nor Docker in this series. You may find a new series in this Blog in the near future that contemplates Cloud Services. With expectations set, We’re ready to proceed.
Let’s begin by installing our first VM. I’m using a Windows PC to do all my testing, so I’m creating the VM in Hyper-V. Feel free to use whichever hypervisor you prefer. My choice for OS is Red Hat Developer edition 8.5. You will need to subscribe to the Red Hat Portal: https://developers.redhat.com/products/rhel/overview.
In the next post we will install and configure the Postgres database, create a small dataset and install the Confluent required software for the Apache Kafka software to work. So stay tuned for updates!
Ready to migrate your data with confidence?