GoldenGate 12.2 big data adapters: part 3 - Kafka
This article continues our deep dive into the GoldenGate Big Data adapters, specifically focusing on the Kafka adapter. If you missed the previous entries, check out Part 1 on HDFS and Part 2 on Flume.
1. What is Kafka?
Kafka is a distributed streaming subscriber-publisher system. A common question is: How is it different from Flume? While Flume uses pre-created sources, sinks, and interceptors to move data, Kafka is a general-purpose system. In Kafka, the destination receives exactly what you put in at the source, and most of the control resides in the consumer programs you build. In this guide, we will actually use Kafka and Flume together to leverage the strengths of both.
2. Environment and Architecture
Our configuration remains streamlined to focus on the adapter functionality:
- Source: Oracle Database with OGG 12.2 Integrated Extract capturing schema changes.
- Target: Oracle GoldenGate for Big Data installed on a Linux box, receiving changes via trail files.
3. Configuring the Kafka Adapter
To begin, copy the necessary configuration files from the adapter examples directory into your GoldenGate directory.
cp $OGG_HOME/AdapterExamples/big-data/kafka/* $OGG_HOME/dirdat/
Adjusting kafka.props
You must define the Kafka/Zookeeper topics for data flow and schema changes, as well as the Java classpath for Kafka and Avro classes.
# Snippet of dirprm/kafka.props gg.handlerlist = kafkahandler gg.handler.kafkahandler.type = kafka gg.handler.kafkahandler.TopicName = oggtopic gg.handler.kafkahandler.SchemaTopicName = mySchemaTopic gg.handler.kafkahandler.format = avro_op gg.classpath = dirprm/:/u01/kafka/libs/*:/usr/lib/avro/*
Defining the Kafka Producer
Next, edit custom_kafka_producer.properties to point to your Kafka service and define compression settings.
bootstrap.servers=sandbox:9092 acks=1 compression.type=gzip
4. Topic Management and Zookeeper
Kafka requires Zookeeper to manage topics. For this setup, we assume Zookeeper is running on port 2181 and Kafka is in standalone mode.
Creating Topics
While OGG can create topics automatically, creating them manually allows for custom partitions and replication factors.
bin/kafka-topics.sh --zookeeper sandbox:2181 --create --topic oggtopic --partitions 1 --replication-factor 1
5. Integrating Flume for HDFS Storage
Because Kafka is a streaming system, it doesn't natively "save" files to HDFS. We use Flume as a consumer to pick up data from Kafka topics and write them to HDFS in Avro format.
Flume Configuration Example
We define two sources (one for data, one for schema) and two sinks pointing to the same HDFS location.
agent.sources.ogg1.type = org.apache.flume.source.kafka.KafkaSource agent.sources.ogg1.topic = oggtopic agent.sinks.hdfs1.hdfs.path = hdfs://sandbox/user/oracle/ggflume
6. Executing the Initial Load and Ongoing Replication
With the infrastructure ready, we can run a passive replicat for the initial data load.
Initial Load Results
Upon running the replicat, Flume generates files in HDFS. Typically, schema definitions for each table are stored in separate files, while data changes are bundled together.
Starting Ongoing Replication
For active replication, ensure your replicat points to the correct trail file sequence. If your extract has progressed, you may need to alter the replicat position.
GGSCI> alter replicat rkafka EXTSEQNO 43 GGSCI> start replicat rkafka
7. Testing Data and Schema Synchronization
To verify the pipeline, we perform DML and DDL operations on the Oracle source and monitor the HDFS output.
DML (Insert/Update)
Performing an INSERT or UPDATE results in new files on HDFS containing the operation flag and the data values. For updates, both the old and new values are captured, allowing for full change reconstruction.
DDL and Truncate Behavior
DDL changes behave slightly differently:
- Truncate: Interestingly, a
TRUNCATEcommand might not trigger a new file on HDFS until subsequent DML occurs. - Alter Table: Adding a column won't generate a new schema definition file until the next DML operation is performed on that table.
8. Performance and Summary
In testing with JMeter, the combination of GoldenGate, Kafka, and Flume easily sustained 225 transactions per second (30% inserts, 70% updates) with negligible lag.
While this test is a simple standalone scenario, it demonstrates that the adapters are robust and stable. In a production environment, you would likely integrate Kafka with a processing engine like Spark or Storm, but for raw data movement to a Big Data lake, this setup is highly effective.
Stay tuned for our next post, where we will explore the HBASE adapter.
Data Analytics Consulting Services
Ready to make smarter, data-driven decisions?
Share this
Share this
More resources
Learn more about Pythian by reading the following blogs and articles.
How to Fix a Target with a Pending Status in OEM 12cR2
Silent Installation of RAC 12c database
How to Troubleshoot OEM 12c Cloud Control Auto-Discovery
Ready to unlock value from your data?
With Pythian, you can accomplish your data transformation goals and more.