GoldenGate 12.2 big data adapters: part 1 - HDFS
December 2015 brought us a new version of GoldenGate, and a new version for Big Data adapters for the GoldenGate. Let's have a look at what we have now and how it works. I am going to start from the HDFS adapter.
Preparing the Source Oracle Database
As a first step, we need to prepare our source database for replication. It becomes easier with every new GoldenGate version. We will need to perform several steps:
a) Enable archive logging on our database. This particular step requires downtime.
orcl> alter database mount; Database altered. orcl> alter database archivelog; Database altered. orcl> alter database open;
b) Enable force logging and minimal supplemental logging. No need to shutdown database for this.
orcl> alter database add supplemental log data; Database altered. orcl> alter database force logging; Database altered. orcl> SELECT supplemental_log_data_min, force_logging FROM v$database; SUPPLEME FORCE_LOGGING -------- --------------------------------------- YES YES
c) Switch parameter "enable_goldengate_replication" to "TRUE". Can be done online.
orcl> alter system set enable_goldengate_replication=true sid='*' scope=both; System altered. orcl>
And we are almost done. Now we can create a schema for a GoldenGate administrator, and provide required privileges. I've just granted DBA role to the user to simplify process. In any case you will need it in case of integrated capture. For a production installation I advise you to have a look at the documentation to verify necessary privileges and roles
orcl> create user ogg identified by welcome1 default tablespace users temporary tablespace temp; orcl> grant connect, dba to ogg;
Let's create a test schema to be replicated. We will call it schema on the source as ggtest and I will name the destination schema as bdtest. It will allow us also to check how the mapping works in our replication.
orcl> create tablespace ggtest; -- optional step orcl> create user ggtest identified by welcome1 default tablespace ggtest temporary tablespace temp; orcl> grant connect, resource to ggtest;
Installing and Configuring Oracle GoldenGate for Oracle
Everything is ready on our source database for the replication. Now we are installing Oracle GoledenGate for Oracle to our database server. We can get the software from the Oracle site on the download page in the Middleware section, GoldenGate, Oracle GoldenGate for Oracle databases. We are going to use 12.2.0.1.1 version of the software.
The installation is easy - you need to unzip the software and run Installer which will guide you through couple of simple steps. The installer will unpack the software to the destination location, create subdirectories, and register GoldenGate in the Oracle global registry.
[oracle@sandbox distr]$ unzip fbo_ggs_Linux_x64_shiphome.zip [oracle@sandbox distr]$ cd fbo_ggs_Linux_x64_shiphome/Disk1/ [oracle@sandbox Disk1]$ ./runInstaller
We continue by setting up parameters for Oracle GoldenGate (OGG) manager and starting it up. You can see that I've used a default blowfish encryption for the password. In a production environment you may consider another encryption like AES256. I've also used a non-default port for the manager since I have more than one GoldenGate installation on my test sandbox.
[oracle@sandbox ~]$ export OGG_HOME=/u01/oggora [oracle@sandbox ~]$ cd $OGG_HOME [oracle@sandbox oggora]$ ./ggsci GGSCI (sandbox.localdomain) 1> encrypt password welcome1 BLOWFISH ENCRYPTKEY DEFAULT Using Blowfish encryption with DEFAULT key. Encrypted password: AACAAAAAAAAAAAIARIXFKCQBMFIGFARA Algorithm used: BLOWFISH GGSCI (sandbox.localdomain) 2> edit params mgr PORT 7829 userid ogg@orcl,password AACAAAAAAAAAAAIARIXFKCQBMFIGFARA, BLOWFISH, ENCRYPTKEY DEFAULT purgeoldextracts /u01/oggora/dirdat/*, usecheckpoints GGSCI (sandbox.localdomain) 3> start manager Manager started.
Setting Up Initial Load and Continuous Extraction
Let's prepare everything for initial load, and later online replication. I've decided to use GoldenGate initial load extract as the way for initial load for the sake of consistency for the resulted dataset on Hadoop. I prepared the parameter file to replicate my ggtest schema and upload all data to the trail file on remote site. I've used a minimum number of options for all my processes, providing only handful of parameters required for replication.
Extract options is a subject deserving a dedicated blog post. Here is my simple initial extract:
[oracle@sandbox oggora]$ cat /u01/oggora/dirprm/ini_ext.prm SOURCEISTABLE userid ogg@orcl,password AACAAAAAAAAAAAIARIXFKCQBMFIGFARA, BLOWFISH, ENCRYPTKEY DEFAULT --RMTHOSTOPTIONS RMTHOST sandbox, MGRPORT 7839 RMTFILE /u01/oggbd/dirdat/initld, MEGABYTES 2, PURGE --DDL include objname ggtest.* TABLE ggtest.*;
Then we run the initial load extract in passive node and it will create a trail file with the data. The trail file will be used later for our initial load on the target side.
[oracle@sandbox oggora]$ ./extract paramfile dirprm/ini_ext.prm reportfile dirrpt/ini_ext.rpt [oracle@sandbox oggora]$ ll /u01/oggbd/dirdat/initld* -rw-r-----. 1 oracle oinstall 3028 Feb 16 14:17 /u01/oggbd/dirdat/initld [oracle@sandbox oggora]$
We can also prepare our extract on the source site as well. I haven't used datapump in my configuration limiting the topology only by simplest and strait-forward extract to replicat configuration. Of course, in any production configuration I would advise using datapump on source for staging our data.
Here are my extract parameters, and how I added it. I am not starting it yet because I must have an Oracle GoldenGate Manager running on the target, and the directory for the trail file should be created. You may have guessed that the Big Data GoldenGate will be located in /u01/oggbd directory.
[oracle@sandbox oggora]$ ggsci GGSCI (sandbox.localdomain) 1> edit params ggext extract ggext userid ogg@orcl,password AACAAAAAAAAAAAIARIXFKCQBMFIGFARA, BLOWFISH, ENCRYPTKEY DEFAULT --RMTHOSTOPTIONS RMTHOST sandbox, MGRPORT 7839 RMTFILE /u01/oggbd/dirdat/or, MEGABYTES 2, PURGE DDL include objname ggtest.* TABLE ggtest.*; GGSCI (sandbox.localdomain) 2> dblogin userid ogg@orcl,password AACAAAAAAAAAAAIARIXFKCQBMFIGFARA, BLOWFISH, ENCRYPTKEY DEFAULT Successfully logged into database. GGSCI (sandbox.localdomain as ogg@orcl) 3> register extract GGEXT database 2016-02-16 15:37:21 INFO OGG-02003 Extract GGEXT successfully registered with database at SCN 17151616. GGSCI (sandbox.localdomain as ogg@orcl) 4> add extract ggext, INTEGRATED TRANLOG, BEGIN NOW EXTRACT (Integrated) added.
Installing and Initializing OGG for Big Data on the Target
Let's leave our source site for a while and switch to the target. Our target is going to be a box where we have hadoop client and all requirement java classes. I used the same box just to save resources on my sandbox environment. You may run different GoldeGate versions on the same box considering, that Manager ports for each of them will be different. Essentially we need a Hadoop client on the box, which can connect to HDFS and write data there.
Having all required Hadoop classes we continue by installing Oracle GoldenGate for Big Data, configuring and starting it up. The Adapters were well "hidden" on "Oracle edelivery", but now it is way simpler. You go to the GoldenGate download page on the Oracle site and find the section "Oracle GoldenGate for Big Data 12.2.0.1.0".
I created a directory /u01/oggbd as our GoldenGate home and unpacked the tar archive there. The next step is to create all necessary directories.
[oracle@sandbox ~]$ cd /u01/oggbd/ [oracle@sandbox oggbd]$ ./ggsci GGSCI (sandbox.localdomain) 1> create subdirs Creating subdirectories under current directory /u01/oggbd ... GGSCI (sandbox.localdomain) 2>
We are changing port for our manager process from default and starting it up.
GGSCI (sandbox.localdomain) 2> edit params mgr PORT 7839 ..... GGSCI (sandbox.localdomain) 3> start manager Manager started.
Configuring the HDFS Adapter and Replicat Processes
Now we have to prepare parameter files for our replicat processes. Examples for the parameter files can be taken from $OGGHOME/AdapterExamples/big-data directories. Today we are going to work with the HDFS adapter.
I copied files to my parameter files directory ($OGGHOME/dirprm) and modified them accordingly:
[oracle@bigdata dirprm]$ cat hdfs.props gg.handlerlist=hdfs gg.handler.hdfs.type=hdfs gg.handler.hdfs.includeTokens=false gg.handler.hdfs.maxFileSize=1g gg.handler.hdfs.rootFilePath=/user/oracle/gg gg.handler.hdfs.fileRollInterval=0 gg.handler.hdfs.inactivityRollInterval=0 gg.handler.hdfs.fileSuffix=.txt gg.handler.hdfs.partitionByTable=true gg.handler.hdfs.rollOnMetadataChange=true gg.handler.hdfs.authType=none gg.handler.hdfs.format=delimitedtext #gg.handler.hdfs.format.includeColumnNames=true gg.handler.hdfs.mode=tx goldengate.userexit.timestamp=utc goldengate.userexit.writers=javawriter javawriter.stats.display=TRUE javawriter.stats.full=TRUE gg.log=log4j gg.log.level=INFO gg.report.time=30sec gg.classpath=/usr/lib/hadoop/*:/usr/lib/hadoop/lib/*:/usr/lib/hadoop-hdfs/*:/usr/lib/hadoop/etc/hadoop/:/usr/lib/hadoop/lib/native/* javawriter.bootoptions=-Xmx512m -Xms32m -Djava.class.path=ggjava/ggjava.jar
Key parameters to consider:
- gg.handler.hdfs.rootFilePath: Location on HDFS.
- gg.handler.hdfs.format: Supported formats by adapter.
- gg.classpath: Location for your hadoop jar classes and native libraries.
The next parameter file is for our data initialization replicat file.
[oracle@sandbox dirprm]$ cat /u01/oggbd/dirprm/irhdfs.prm --passive REPLICAT for initial load irhdfs SPECIALRUN END RUNTIME EXTFILE /u01/oggbd/dirdat/initld --DDLERROR default discard setenv HADOOP_COMMON_LIB_NATIVE_DIR=/usr/lib/hadoop/lib/native DDL include all TARGETDB LIBFILE libggjava.so SET property=dirprm/hdfs.props REPORTCOUNT EVERY 1 MINUTES, RATE GROUPTRANSOPS 10000 MAP ggtest.*, TARGET bdtest.*;
Executing the Initial Load to Hadoop
I was running the initial load in passive mode, without creating a managed process and just running it from command line.
[oracle@sandbox oggbd]$ ./replicat paramfile dirprm/irhdfs.prm reportfile dirrpt/ini_rhdfs.rpt [oracle@sandbox oggbd]$ hadoop fs -ls /user/oracle/gg/ Found 2 items drwxr-xr-x - oracle oracle 0 2016-02-16 14:37 /user/oracle/gg/bdtest.test_tab_1 drwxr-xr-x - oracle oracle 0 2016-02-16 14:37 /user/oracle/gg/bdtest.test_tab_2
As soon as the initial load has run we can start our extract and replicat to keep the destination side updated.
GGSCI (sandbox.localdomain as ogg@orcl) 6>start extract ggext Sending START request to MANAGER ... EXTRACT GGEXT starting
On the target site, prepare and start the online replicat:
GGSCI (sandbox.localdomain) 1> add replicat rhdfs, exttrail dirdat/or REPLICAT added. GGSCI (sandbox.localdomain) 2> start replicat rhdfs REPLICAT RHDFS starting
Integrating with Hive and Analyzing Replicated Metadata
Our replication is up and running. I've created an external Hive table for the table test_tab_1 to have a better look.
hive> CREATE EXTERNAL TABLE BDTEST.TEST_TAB_1 (tran_flag string, tab_name string, tran_time_utc timestamp, tran_time_loc string,something string, something1 string, > PK_ID INT, RND_STR VARCHAR(10),USE_DATE string,RND_STR_1 string, ACC_DATE string) > stored as textfile location '/user/oracle/gg/bdtest.test_tab_1'; hive> select * from BDTEST.TEST_TAB_1; OK I BDTEST.TEST_TAB_1 2016-02-16 19:17:40.746699 2016-02-16T14:37:43.373000 00000000-10000002012 1 371O62FX 2014-01-24:19:09:20 RJ68QYM5 2014-01-22:12:14:30 ...
You can see the table definition is a bit different because we have additional columns on the destination side for auditing and transaction tracking:
- tran_flag: Flag for operation ("I" for Insert, "D" for Delete, "U" for Update).
- tab_name: Source table name.
- tran_time_utc: Timestamp of when the transaction occurred.
- tran_time_loc: Commit time in local timezone.
- Commit sequence number.
Evaluating DML and DDL Support Constraints
It works for deletes too, only flag will be "D" instead of "I". Let's test an insert and update:
orcl> insert into ggtest.test_tab_1 values (5,'TEST_1',sysdate,'TEST_1',sysdate); orcl> commit; orcl> update ggtest.test_tab_1 set RND_STR='TEST_1_1' where PK_ID=5; orcl> commit;
The Hive table reflects these changes as new rows, preserving the history of operations.
Regarding DDL support, truncating the table results in no records in the HDFS files. However, creating a new table or adding/dropping columns works well by creating new directories or files.
One major limitation discovered: Create Table As Select (CTAS) broke the replicat with an error stating "DDL Replication is not supported for this database."
2016-02-16 15:48:49 ERROR OGG-00453 Oracle GoldenGate Delivery, rhdfs.prm: DDL Replication is not supported for this database. 2016-02-16 15:48:49 ERROR OGG-01668 Oracle GoldenGate Delivery, rhdfs.prm: PROCESS ABENDING.
What can we say in summary? The replication works and supports all DML and some DDL commands. You will need to prepare to get consistent datasets for any given time using flags and transaction timestamps.
Oracle Database Consulting Services
Ready to optimize your Oracle Database for the future?
Share this
Share this
More resources
Learn more about Pythian by reading the following blogs and articles.
Reviewing the operation modes of Oracle GoldenGate BigQuery Handler
Oracle GoldenGate Installation, Part 1
Moving Oracle Datafiles to a ZFS Filesystem with the Correct Recordsize
Ready to unlock value from your data?
With Pythian, you can accomplish your data transformation goals and more.