This article continues our deep dive into the GoldenGate Big Data adapters, specifically focusing on the Kafka adapter. If you missed the previous entries, check out Part 1 on HDFS and Part 2 on Flume.
Kafka is a distributed streaming subscriber-publisher system. A common question is: How is it different from Flume? While Flume uses pre-created sources, sinks, and interceptors to move data, Kafka is a general-purpose system. In Kafka, the destination receives exactly what you put in at the source, and most of the control resides in the consumer programs you build. In this guide, we will actually use Kafka and Flume together to leverage the strengths of both.
Our configuration remains streamlined to focus on the adapter functionality:
To begin, copy the necessary configuration files from the adapter examples directory into your GoldenGate directory.
cp $OGG_HOME/AdapterExamples/big-data/kafka/* $OGG_HOME/dirdat/
kafka.propsYou must define the Kafka/Zookeeper topics for data flow and schema changes, as well as the Java classpath for Kafka and Avro classes.
# Snippet of dirprm/kafka.props gg.handlerlist = kafkahandler gg.handler.kafkahandler.type = kafka gg.handler.kafkahandler.TopicName = oggtopic gg.handler.kafkahandler.SchemaTopicName = mySchemaTopic gg.handler.kafkahandler.format = avro_op gg.classpath = dirprm/:/u01/kafka/libs/*:/usr/lib/avro/*
Next, edit custom_kafka_producer.properties to point to your Kafka service and define compression settings.
bootstrap.servers=sandbox:9092 acks=1 compression.type=gzip
Kafka requires Zookeeper to manage topics. For this setup, we assume Zookeeper is running on port 2181 and Kafka is in standalone mode.
While OGG can create topics automatically, creating them manually allows for custom partitions and replication factors.
bin/kafka-topics.sh --zookeeper sandbox:2181 --create --topic oggtopic --partitions 1 --replication-factor 1
Because Kafka is a streaming system, it doesn't natively "save" files to HDFS. We use Flume as a consumer to pick up data from Kafka topics and write them to HDFS in Avro format.
We define two sources (one for data, one for schema) and two sinks pointing to the same HDFS location.
agent.sources.ogg1.type = org.apache.flume.source.kafka.KafkaSource agent.sources.ogg1.topic = oggtopic agent.sinks.hdfs1.hdfs.path = hdfs://sandbox/user/oracle/ggflume
With the infrastructure ready, we can run a passive replicat for the initial data load.
Upon running the replicat, Flume generates files in HDFS. Typically, schema definitions for each table are stored in separate files, while data changes are bundled together.
For active replication, ensure your replicat points to the correct trail file sequence. If your extract has progressed, you may need to alter the replicat position.
GGSCI> alter replicat rkafka EXTSEQNO 43 GGSCI> start replicat rkafka
To verify the pipeline, we perform DML and DDL operations on the Oracle source and monitor the HDFS output.
Performing an INSERT or UPDATE results in new files on HDFS containing the operation flag and the data values. For updates, both the old and new values are captured, allowing for full change reconstruction.
DDL changes behave slightly differently:
TRUNCATE command might not trigger a new file on HDFS until subsequent DML occurs.In testing with JMeter, the combination of GoldenGate, Kafka, and Flume easily sustained 225 transactions per second (30% inserts, 70% updates) with negligible lag.
While this test is a simple standalone scenario, it demonstrates that the adapters are robust and stable. In a production environment, you would likely integrate Kafka with a processing engine like Spark or Storm, but for raw data movement to a Big Data lake, this setup is highly effective.
Stay tuned for our next post, where we will explore the HBASE adapter.
Ready to make smarter, data-driven decisions?