GoldenGate 12.2 big data adapters: part 3 - Kafka

Written by Gleb Otochkin | Mar 31, 2016 4:00:00 AM

This article continues our deep dive into the GoldenGate Big Data adapters, specifically focusing on the Kafka adapter. If you missed the previous entries, check out Part 1 on HDFS and Part 2 on Flume.

1. What is Kafka?

Kafka is a distributed streaming subscriber-publisher system. A common question is: How is it different from Flume? While Flume uses pre-created sources, sinks, and interceptors to move data, Kafka is a general-purpose system. In Kafka, the destination receives exactly what you put in at the source, and most of the control resides in the consumer programs you build. In this guide, we will actually use Kafka and Flume together to leverage the strengths of both.

2. Environment and Architecture

Our configuration remains streamlined to focus on the adapter functionality:

Source: Oracle Database with OGG 12.2 Integrated Extract capturing schema changes.
Target: Oracle GoldenGate for Big Data installed on a Linux box, receiving changes via trail files.

3. Configuring the Kafka Adapter

To begin, copy the necessary configuration files from the adapter examples directory into your GoldenGate directory.

cp $OGG_HOME/AdapterExamples/big-data/kafka/* $OGG_HOME/dirdat/

Adjusting `kafka.props`

You must define the Kafka/Zookeeper topics for data flow and schema changes, as well as the Java classpath for Kafka and Avro classes.

# Snippet of dirprm/kafka.props gg.handlerlist = kafkahandler gg.handler.kafkahandler.type = kafka gg.handler.kafkahandler.TopicName = oggtopic gg.handler.kafkahandler.SchemaTopicName = mySchemaTopic gg.handler.kafkahandler.format = avro_op gg.classpath = dirprm/:/u01/kafka/libs/*:/usr/lib/avro/*

Defining the Kafka Producer

Next, edit custom_kafka_producer.properties to point to your Kafka service and define compression settings.

bootstrap.servers=sandbox:9092 acks=1 compression.type=gzip

4. Topic Management and Zookeeper

Kafka requires Zookeeper to manage topics. For this setup, we assume Zookeeper is running on port 2181 and Kafka is in standalone mode.

Creating Topics

While OGG can create topics automatically, creating them manually allows for custom partitions and replication factors.

bin/kafka-topics.sh --zookeeper sandbox:2181 --create --topic oggtopic --partitions 1 --replication-factor 1

5. Integrating Flume for HDFS Storage

Because Kafka is a streaming system, it doesn't natively "save" files to HDFS. We use Flume as a consumer to pick up data from Kafka topics and write them to HDFS in Avro format.

Flume Configuration Example

We define two sources (one for data, one for schema) and two sinks pointing to the same HDFS location.

agent.sources.ogg1.type = org.apache.flume.source.kafka.KafkaSource agent.sources.ogg1.topic = oggtopic agent.sinks.hdfs1.hdfs.path = hdfs://sandbox/user/oracle/ggflume

6. Executing the Initial Load and Ongoing Replication

With the infrastructure ready, we can run a passive replicat for the initial data load.

Initial Load Results

Upon running the replicat, Flume generates files in HDFS. Typically, schema definitions for each table are stored in separate files, while data changes are bundled together.

Starting Ongoing Replication

For active replication, ensure your replicat points to the correct trail file sequence. If your extract has progressed, you may need to alter the replicat position.

GGSCI> alter replicat rkafka EXTSEQNO 43 GGSCI> start replicat rkafka

7. Testing Data and Schema Synchronization

To verify the pipeline, we perform DML and DDL operations on the Oracle source and monitor the HDFS output.

DML (Insert/Update)

Performing an INSERT or UPDATE results in new files on HDFS containing the operation flag and the data values. For updates, both the old and new values are captured, allowing for full change reconstruction.

DDL and Truncate Behavior

DDL changes behave slightly differently:

Truncate: Interestingly, a TRUNCATE command might not trigger a new file on HDFS until subsequent DML occurs.
Alter Table: Adding a column won't generate a new schema definition file until the next DML operation is performed on that table.

8. Performance and Summary

In testing with JMeter, the combination of GoldenGate, Kafka, and Flume easily sustained 225 transactions per second (30% inserts, 70% updates) with negligible lag.

While this test is a simple standalone scenario, it demonstrates that the adapters are robust and stable. In a production environment, you would likely integrate Kafka with a processing engine like Spark or Storm, but for raw data movement to a Big Data lake, this setup is highly effective.

Stay tuned for our next post, where we will explore the HBASE adapter.

Data Analytics Consulting Services

Ready to make smarter, data-driven decisions?

View full post