GoldenGate 12.2 big data adapters: part 1 - HDFS

8 min read
Feb 29, 2016 12:00:00 AM

December 2015 brought us a new version of GoldenGate, and a new version for Big Data adapters for the GoldenGate. Let's have a look at what we have now and how it works. I am going to start from the HDFS adapter.

Preparing the Source Oracle Database

As a first step, we need to prepare our source database for replication. It becomes easier with every new GoldenGate version. We will need to perform several steps:

a) Enable archive logging on our database. This particular step requires downtime.

orcl> alter database mount;  Database altered.  orcl> alter database archivelog;  Database altered.  orcl> alter database open; 

b) Enable force logging and minimal supplemental logging. No need to shutdown database for this.

orcl> alter database add supplemental log data;  Database altered.  orcl> alter database force logging;  Database altered.  orcl> SELECT supplemental_log_data_min, force_logging FROM v$database;  SUPPLEME FORCE_LOGGING  -------- ---------------------------------------  YES YES 

c) Switch parameter "enable_goldengate_replication" to "TRUE". Can be done online.

orcl> alter system set enable_goldengate_replication=true sid='*' scope=both;  System altered.  orcl> 

And we are almost done. Now we can create a schema for a GoldenGate administrator, and provide required privileges. I've just granted DBA role to the user to simplify process. In any case you will need it in case of integrated capture. For a production installation I advise you to have a look at the documentation to verify necessary privileges and roles

orcl> create user ogg identified by welcome1 default tablespace users temporary tablespace temp;  orcl> grant connect, dba to ogg; 

Let's create a test schema to be replicated. We will call it schema on the source as ggtest and I will name the destination schema as bdtest. It will allow us also to check how the mapping works in our replication.

orcl> create tablespace ggtest; -- optional step  orcl> create user ggtest identified by welcome1 default tablespace ggtest temporary tablespace temp;  orcl> grant connect, resource to ggtest; 

Installing and Configuring Oracle GoldenGate for Oracle

Everything is ready on our source database for the replication. Now we are installing Oracle GoledenGate for Oracle to our database server. We can get the software from the Oracle site on the download page in the Middleware section, GoldenGate, Oracle GoldenGate for Oracle databases. We are going to use 12.2.0.1.1 version of the software.

The installation is easy - you need to unzip the software and run Installer which will guide you through couple of simple steps. The installer will unpack the software to the destination location, create subdirectories, and register GoldenGate in the Oracle global registry.

[oracle@sandbox distr]$ unzip fbo_ggs_Linux_x64_shiphome.zip  [oracle@sandbox distr]$ cd fbo_ggs_Linux_x64_shiphome/Disk1/  [oracle@sandbox Disk1]$ ./runInstaller 

We continue by setting up parameters for Oracle GoldenGate (OGG) manager and starting it up. You can see that I've used a default blowfish encryption for the password. In a production environment you may consider another encryption like AES256. I've also used a non-default port for the manager since I have more than one GoldenGate installation on my test sandbox.

[oracle@sandbox ~]$ export OGG_HOME=/u01/oggora  [oracle@sandbox ~]$ cd $OGG_HOME  [oracle@sandbox oggora]$ ./ggsci  GGSCI (sandbox.localdomain) 1> encrypt password welcome1 BLOWFISH ENCRYPTKEY DEFAULT  Using Blowfish encryption with DEFAULT key.  Encrypted password: AACAAAAAAAAAAAIARIXFKCQBMFIGFARA  Algorithm used: BLOWFISH  GGSCI (sandbox.localdomain) 2> edit params mgr  PORT 7829  userid ogg@orcl,password AACAAAAAAAAAAAIARIXFKCQBMFIGFARA, BLOWFISH, ENCRYPTKEY DEFAULT  purgeoldextracts /u01/oggora/dirdat/*, usecheckpoints  GGSCI (sandbox.localdomain) 3> start manager  Manager started. 

Setting Up Initial Load and Continuous Extraction

Let's prepare everything for initial load, and later online replication. I've decided to use GoldenGate initial load extract as the way for initial load for the sake of consistency for the resulted dataset on Hadoop. I prepared the parameter file to replicate my ggtest schema and upload all data to the trail file on remote site. I've used a minimum number of options for all my processes, providing only handful of parameters required for replication.

Extract options is a subject deserving a dedicated blog post. Here is my simple initial extract:

[oracle@sandbox oggora]$ cat /u01/oggora/dirprm/ini_ext.prm  SOURCEISTABLE  userid ogg@orcl,password AACAAAAAAAAAAAIARIXFKCQBMFIGFARA, BLOWFISH, ENCRYPTKEY DEFAULT  --RMTHOSTOPTIONS  RMTHOST sandbox, MGRPORT 7839  RMTFILE /u01/oggbd/dirdat/initld, MEGABYTES 2, PURGE  --DDL include objname ggtest.* TABLE ggtest.*; 

Then we run the initial load extract in passive node and it will create a trail file with the data. The trail file will be used later for our initial load on the target side.

[oracle@sandbox oggora]$ ./extract paramfile dirprm/ini_ext.prm reportfile dirrpt/ini_ext.rpt  [oracle@sandbox oggora]$ ll /u01/oggbd/dirdat/initld* -rw-r-----. 1 oracle oinstall 3028 Feb 16 14:17 /u01/oggbd/dirdat/initld  [oracle@sandbox oggora]$ 

We can also prepare our extract on the source site as well. I haven't used datapump in my configuration limiting the topology only by simplest and strait-forward extract to replicat configuration. Of course, in any production configuration I would advise using datapump on source for staging our data.

Here are my extract parameters, and how I added it. I am not starting it yet because I must have an Oracle GoldenGate Manager running on the target, and the directory for the trail file should be created. You may have guessed that the Big Data GoldenGate will be located in /u01/oggbd directory.

[oracle@sandbox oggora]$ ggsci  GGSCI (sandbox.localdomain) 1> edit params ggext  extract ggext  userid ogg@orcl,password AACAAAAAAAAAAAIARIXFKCQBMFIGFARA, BLOWFISH, ENCRYPTKEY DEFAULT  --RMTHOSTOPTIONS  RMTHOST sandbox, MGRPORT 7839  RMTFILE /u01/oggbd/dirdat/or, MEGABYTES 2, PURGE  DDL include objname ggtest.* TABLE ggtest.*;   GGSCI (sandbox.localdomain) 2> dblogin userid ogg@orcl,password AACAAAAAAAAAAAIARIXFKCQBMFIGFARA, BLOWFISH, ENCRYPTKEY DEFAULT  Successfully logged into database.   GGSCI (sandbox.localdomain as ogg@orcl) 3> register extract GGEXT database  2016-02-16 15:37:21 INFO OGG-02003 Extract GGEXT successfully registered with database at SCN 17151616.   GGSCI (sandbox.localdomain as ogg@orcl) 4> add extract ggext, INTEGRATED TRANLOG, BEGIN NOW  EXTRACT (Integrated) added. 

Installing and Initializing OGG for Big Data on the Target

Let's leave our source site for a while and switch to the target. Our target is going to be a box where we have hadoop client and all requirement java classes. I used the same box just to save resources on my sandbox environment. You may run different GoldeGate versions on the same box considering, that Manager ports for each of them will be different. Essentially we need a Hadoop client on the box, which can connect to HDFS and write data there.

Having all required Hadoop classes we continue by installing Oracle GoldenGate for Big Data, configuring and starting it up. The Adapters were well "hidden" on "Oracle edelivery", but now it is way simpler. You go to the GoldenGate download page on the Oracle site and find the section "Oracle GoldenGate for Big Data 12.2.0.1.0".

I created a directory /u01/oggbd as our GoldenGate home and unpacked the tar archive there. The next step is to create all necessary directories.

[oracle@sandbox ~]$ cd /u01/oggbd/  [oracle@sandbox oggbd]$ ./ggsci  GGSCI (sandbox.localdomain) 1> create subdirs  Creating subdirectories under current directory /u01/oggbd  ...  GGSCI (sandbox.localdomain) 2> 

We are changing port for our manager process from default and starting it up.

GGSCI (sandbox.localdomain) 2> edit params mgr  PORT 7839  .....  GGSCI (sandbox.localdomain) 3> start manager  Manager started. 

Configuring the HDFS Adapter and Replicat Processes

Now we have to prepare parameter files for our replicat processes. Examples for the parameter files can be taken from $OGGHOME/AdapterExamples/big-data directories. Today we are going to work with the HDFS adapter.

I copied files to my parameter files directory ($OGGHOME/dirprm) and modified them accordingly:

[oracle@bigdata dirprm]$ cat hdfs.props  gg.handlerlist=hdfs  gg.handler.hdfs.type=hdfs  gg.handler.hdfs.includeTokens=false  gg.handler.hdfs.maxFileSize=1g  gg.handler.hdfs.rootFilePath=/user/oracle/gg  gg.handler.hdfs.fileRollInterval=0  gg.handler.hdfs.inactivityRollInterval=0  gg.handler.hdfs.fileSuffix=.txt  gg.handler.hdfs.partitionByTable=true  gg.handler.hdfs.rollOnMetadataChange=true  gg.handler.hdfs.authType=none  gg.handler.hdfs.format=delimitedtext  #gg.handler.hdfs.format.includeColumnNames=true  gg.handler.hdfs.mode=tx  goldengate.userexit.timestamp=utc  goldengate.userexit.writers=javawriter  javawriter.stats.display=TRUE  javawriter.stats.full=TRUE  gg.log=log4j  gg.log.level=INFO  gg.report.time=30sec  gg.classpath=/usr/lib/hadoop/*:/usr/lib/hadoop/lib/*:/usr/lib/hadoop-hdfs/*:/usr/lib/hadoop/etc/hadoop/:/usr/lib/hadoop/lib/native/* javawriter.bootoptions=-Xmx512m -Xms32m -Djava.class.path=ggjava/ggjava.jar 

Key parameters to consider:

  • gg.handler.hdfs.rootFilePath: Location on HDFS.
  • gg.handler.hdfs.format: Supported formats by adapter.
  • gg.classpath: Location for your hadoop jar classes and native libraries.

The next parameter file is for our data initialization replicat file.

[oracle@sandbox dirprm]$ cat /u01/oggbd/dirprm/irhdfs.prm  --passive REPLICAT for initial load irhdfs  SPECIALRUN  END RUNTIME  EXTFILE /u01/oggbd/dirdat/initld  --DDLERROR default discard  setenv HADOOP_COMMON_LIB_NATIVE_DIR=/usr/lib/hadoop/lib/native  DDL include all  TARGETDB LIBFILE libggjava.so SET property=dirprm/hdfs.props  REPORTCOUNT EVERY 1 MINUTES, RATE  GROUPTRANSOPS 10000  MAP ggtest.*, TARGET bdtest.*; 

Executing the Initial Load to Hadoop

I was running the initial load in passive mode, without creating a managed process and just running it from command line.

[oracle@sandbox oggbd]$ ./replicat paramfile dirprm/irhdfs.prm reportfile dirrpt/ini_rhdfs.rpt  [oracle@sandbox oggbd]$ hadoop fs -ls /user/oracle/gg/  Found 2 items  drwxr-xr-x - oracle oracle 0 2016-02-16 14:37 /user/oracle/gg/bdtest.test_tab_1  drwxr-xr-x - oracle oracle 0 2016-02-16 14:37 /user/oracle/gg/bdtest.test_tab_2 

As soon as the initial load has run we can start our extract and replicat to keep the destination side updated.

GGSCI (sandbox.localdomain as ogg@orcl) 6>start extract ggext  Sending START request to MANAGER ...  EXTRACT GGEXT starting 

On the target site, prepare and start the online replicat:

GGSCI (sandbox.localdomain) 1> add replicat rhdfs, exttrail dirdat/or  REPLICAT added.  GGSCI (sandbox.localdomain) 2> start replicat rhdfs  REPLICAT RHDFS starting 

Integrating with Hive and Analyzing Replicated Metadata

Our replication is up and running. I've created an external Hive table for the table test_tab_1 to have a better look.

hive> CREATE EXTERNAL TABLE BDTEST.TEST_TAB_1 (tran_flag string, tab_name string, tran_time_utc timestamp, tran_time_loc string,something string, something1 string,  > PK_ID INT, RND_STR VARCHAR(10),USE_DATE string,RND_STR_1 string, ACC_DATE string)  > stored as textfile location '/user/oracle/gg/bdtest.test_tab_1';   hive> select * from BDTEST.TEST_TAB_1;  OK  I BDTEST.TEST_TAB_1 2016-02-16 19:17:40.746699 2016-02-16T14:37:43.373000 00000000-10000002012 1 371O62FX 2014-01-24:19:09:20 RJ68QYM5 2014-01-22:12:14:30  ... 

You can see the table definition is a bit different because we have additional columns on the destination side for auditing and transaction tracking:

  1. tran_flag: Flag for operation ("I" for Insert, "D" for Delete, "U" for Update).
  2. tab_name: Source table name.
  3. tran_time_utc: Timestamp of when the transaction occurred.
  4. tran_time_loc: Commit time in local timezone.
  5. Commit sequence number.

Evaluating DML and DDL Support Constraints

It works for deletes too, only flag will be "D" instead of "I". Let's test an insert and update:

orcl> insert into ggtest.test_tab_1 values (5,'TEST_1',sysdate,'TEST_1',sysdate);  orcl> commit;  orcl> update ggtest.test_tab_1 set RND_STR='TEST_1_1' where PK_ID=5;  orcl> commit; 

The Hive table reflects these changes as new rows, preserving the history of operations.

Regarding DDL support, truncating the table results in no records in the HDFS files. However, creating a new table or adding/dropping columns works well by creating new directories or files.

One major limitation discovered: Create Table As Select (CTAS) broke the replicat with an error stating "DDL Replication is not supported for this database."

2016-02-16 15:48:49 ERROR OGG-00453 Oracle GoldenGate Delivery, rhdfs.prm: DDL Replication is not supported for this database.  2016-02-16 15:48:49 ERROR OGG-01668 Oracle GoldenGate Delivery, rhdfs.prm: PROCESS ABENDING. 

What can we say in summary? The replication works and supports all DML and some DDL commands. You will need to prepare to get consistent datasets for any given time using flags and transaction timestamps.

Oracle Database Consulting Services

Ready to optimize your Oracle Database for the future?

 

On this page

Ready to unlock value from your data?

With Pythian, you can accomplish your data transformation goals and more.