Using the MariaDB CONNECT engine to merge heterogeneous data sources

I was reminded recently of a project I did during a DevDays week at WebAssign in 2014. The engineering department wanted to move a large dataset into MongoDB, and my DevDays experiment was to demonstrate the Tungsten Replicator’s role in translating data from MySQL to MongoDB. It worked handily but was ultimately not implemented because the development teams got bogged down in the Tungsten replication filters desired for data transformation. At All Things Open in 2015, Max Mether presented new features in MariaDB and the Connect engine, introduced in MariaDB 10.0, caught my eye as having similar capabilities to combine heterogenous datasets. I finally got around to giving this feature a whirl over the holiday break. My goal for this test is opposite from that done in my DevDays example; I want to move data back into the RDBMS for querying; specifically, I want to test joining data from a MongoDB installation to a table in my MariaDB installation. I’ll export the MongoDB collection to JSON once a day and run my reports overnight. This is a good use case for the CONNECT engine, which is meant for BI applications rather than transactional workload. I first installed MongoDB on the same server (although the same server is not required), and imported sample data as described in the MongoDB docs. Then I installed the necessary ha_connect.so plugin: The ha_connect.so was not within the package release for Centos7 (and this appears to be a common problem). The easiest way for me to get it was to download the MariaDB binary with glib support and move the ha_connect.so to the plugins folder on the server. But the library was missing:

MariaDB [(none)]> INSTALL SONAME ‘ha_connect.so’;
 ERROR 1126 (HY000): Can’t open shared library ‘/usr/lib64/mysql/plugin/ha_connect.so’ (errno: 2, libodbc.so.1: cannot open shared object file: No such file or directory)
 MariaDB [(none)]> exit

I’ve had that library version error before, often. Here’s a fix:

ln -s libodbc.so.2 libodbc.so.1

Now the installation works:

MariaDB [(none)]> INSTALL SONAME ‘ha_connect.so’;
 Query OK, 0 rows affected (0.01 sec)

(On Ubuntu, it is possible to use “apt-get install libodbc1” and it appears on Centos you can also use “yum install unixODBC-libs” to install the same.) Here’s the daily export using the mongoexport tool:

mongoexport –db test –collection restaurants –out /var/lib/mysql/test/restaurants.json –jsonArray

The jsonArray flag is important, so the output is formatted the way the CONNECT engine is expecting it. Also, place the export within the relevant database directory inside /var/lib/mysql. Snippet of exported file (manually pretty-printed):

{
 “_id” : ObjectId(“580cda5aaf1de8b908ccfe40”),
 “address” : { “building” : “522”, “coord” : [ -73.95171, 40.767461 ], “street” : “East 74 Street”, “zipcode” : “10021” },
 “borough” : “Manhattan”,
 “cuisine” : “American”,
 “grades” : [ { “date” : ISODate(“2014-09-02T00:00:00Z”), “grade” : “A”, “score” : 12 }, { “date” : ISODate(“2013-12-19T00:00:00Z”), “grade” : “B”, “score” : 16 }, { “date” : ISODate(“2013-05-28T00:00:00Z”), “grade” : “A”, “score” : 9 }, { “date” : ISODate(“2012-12-07T00:00:00Z”), “grade” : “A”, “score” : 13 }, { “date” : ISODate(“2012-03-29T00:00:00Z”), “grade” : “A”, “score” : 11 } ],
 “name” : “Glorious Food”,
 “restaurant_id” : “40361521”
 }

I had MariaDB create the table by discovery instead of explicitly defining the fields:

use test;
 
 create table restaurants engine=connect table_type=JSON file_name=‘restaurants.json’ option_list=‘level=1’;
 
 MariaDB [test]> show create table restaurants\G
  Table: restaurants
 Create Table: CREATE TABLE `restaurants` (
  `id$oid` char(24) NOT NULL `FIELD_FORMAT`=’_id:$oid’,
  `address_building` char(9) NOT NULL `FIELD_FORMAT`=‘address:building’,
  `address_coord` varchar(256) DEFAULT NULL `FIELD_FORMAT`=‘address:coord’,
  `address_street` char(38) NOT NULL `FIELD_FORMAT`=‘address:street’,
  `address_zipcode` char(5) NOT NULL `FIELD_FORMAT`=‘address:zipcode’,
  `borough` char(13) NOT NULL,
  `cuisine` char(64) NOT NULL,
  `grades` char(1) DEFAULT NULL,
  `grades_date` varchar(256) DEFAULT NULL `FIELD_FORMAT`=‘grades::date’,
  `grades_grade` char(14) DEFAULT NULL `FIELD_FORMAT`=‘grades::grade’,
  `grades_score` bigint(2) DEFAULT NULL `FIELD_FORMAT`=‘grades::score’,
  `name` char(65) NOT NULL,
  `restaurant_id` char(8) NOT NULL
 ) ENGINE=CONNECT DEFAULT CHARSET=latin1 `TABLE_TYPE`=‘JSON’ `FILE_NAME`=‘restaurants.json’ `OPTION_LIST`=‘level=1’

Here’s the MariaDB table I’ll be joining to:

create table proprietors (
 restaurant_id char(80),
 owner char(128)
 ) ENGINE=InnoDB DEFAULT CHARSET=latin1;

Now I can run my daily report of new inspections recorded in MongoDB, linked to randomly generated owner names:

MariaDB [test]> select proprietors.owner, restaurants.name, grades_grade from proprietors, restaurants where proprietors.restaurant_id=restaurants.restaurant_id and grades_date>‘2015-01-08’ and grades_grade in(‘A’,‘B’);
 +———————+————————————————+————–+
 | owner | name | grades_grade |
 +———————+————————————————+————–+
 | Treva Tullis | Bagels N Buns | A |
 | Maud Morningstar | La Fusta Restaurant | A |
 | Vanessa Vowell | The Greek Corner | A |
 | Teressa Truex | Mcdonald’S | A |
 | Eli Enright | Neron’S Pizza | A |
 | Cruz Cypert | Mcdonald’S | A |
 | Beatrice Bramlett | Sidewalk Bar & Restaurant | A |
 | Roscoe Rapozo | Tortilla Flats | A |
 | Justa Jasik | Veselka Restaurant | A |
 | Eliza Edmundson | Dakota Diner | A |
 | Becky Branson | Arturo’S Pizza | A |
 | Beatrice Bramlett | Areo Restaurant | A |
 | Mariko Moos | Pomodoro Ristorante | A |
 | Chae Champine | Miracali Bakery | A |
 …

Once everything is set up, this is a quick way to merge data from heterogenous data sources, and a reasonable solution for a reporting application. Thank you to Abhinav Gupta for the technical review.

What We Do

Using the MariaDB CONNECT engine to merge heterogeneous data sources

Share this

Share this

More resources

Duplicate indexes in MySQL

How to find unused indexes and drop them safely in MongoDB

What Data Type is Returned by a Mathematical Function?

Ready to unlock value from your data?