Comparing schemas between hive clusters

2 min read

Nov 13, 2015 12:00:00 AM

When running several different hive instances, we found the process of maintaining/confirming synchronization to be quite time consuming. While several tools made available by Cloudera and other frameworks do provide a visual interface for an individual instance of hive, there was no available tool for comparing across clusters. We came up with a useful way to compare hive schemas between clusters. Our process contains 3 main steps:

Fetch the schema from all hive environments, store the schema as a dictionary in a file.
Create a class to perform comparisons leveraging python list, dict and pandas dataframe structures.
Display the results/visualize the changes using a web interface (this approach uses python Flask).

Step 1- Fetch the Schema From all Environments To fetch the schema we use the hive metastore api via the hive-thrift.py package. First, we create a socket to connect to the hive thrift server.

transport = TSocket.TSocket(host, port) transport = TTransport.TBufferedTransport(transport) protocol = TBinaryProtocol.TBinaryProtocol(transport)

global metastore_client metastore_client = ThriftHiveMetastore.Client(protocol) transport.open() Once that is done, it's straightforward to get all of the tables in a db (there are lots of other great methods that have been documented). db = 'myDatabase' tables = metastore_client.get_all_tables(db) Once we get the tables, we can call the get_fields() method

for table in tables: for field in metastore_client.get_fields(db, table): # field is a named tuple FieldSchema(comment, type, name) print field

Using the methods above, we retrieve all of the values and store them in a dictionary, locally in a text file, one file per environment.

{'myDatabase': { 'tbl_customer': [ FieldSchema(comment=None, type=int, name='user_id'), FieldSchema(comment=None, type='string', name='first_name'), FieldSchema(comment=None, type='string', name='last_name') ], 'tbl_orders': [ FieldSchema(comment=None, type=int, name='order_id'), FieldSchema(comment=None, type='string', name='item_id'), FieldSchema(comment=None, type='string', name='price') ],

etc.. Note that thrift gives us a tuple for the ddl, so for ease, we declare this named tuple locally. FieldSchema = collections.namedtuple(‘FieldSchema’, [‘comment’,’type’, ‘name’]) Step 2 - Create Class to Compare Dictionaries Now that we have a separate schema file for each hive instance, we can load these files into dictionaries and begin to compare them.

f_stage_dict = open('schema_files/stage_schema_thrift.out') f_prod_dict = open('schema_files/prod_schema_thrift.out') d_stage = eval(f_stage_dict.read()) d_prod = eval(f_prod_dict.read())

Create a class to compare the dictionaries. Comparisons are done leveraging a variety of nice python tools (lists, set comparison, pandas dataframes, etc..). More detail in the repo.

Class DictCompare: def get_fields_by_table(self, db, table):... # returns list of fields for tableA and tableB def get_databases(self):... # union of all dbs in all schemas A,B def compare_fields(self, db, table)... # returns tablediff tuple showing all field differences def compare_dbs(self, db)... #returns list of differences between dbs tables_a = [table for table in self.dict_a[db]] tables_b = [table for table in self.dict_b[db]] db_diffs_a = set(tables_a).difference(set(tables_b)) db_diffs_b = set(tables_b).difference(set(tables_a)) .....

Instantiate the class by passing the path of the schema files, and the friendly names used in the config.json file. dc = sc.DictCompare(schema_files_path,"dev", "stage" ) dbs = dc.get_databases() # show all of the databases.. print "DBS=",dbs # get all diffs for a given database compare_dict = dc.compare_dbs("my_db") Step 3 Display the Results / Visualize the Changes I used a quick and simple python flask website to display the differences. Note there is a form on the page that allows users to select the hive instances to compare.

@app.route('/home.html') @app.route('/home') def homepage(): form=EnvForm(csrf_enabled=False) env_a = form.env_a.data # 'stage', for example env_b = form.env_b.data # 'prod', for example dc = sc.DictCompare(schema_files_path,env_a, env_b ) dbs = dc.get_databases() return render_template("home.html", dbs=dbs, a_name =dc.a_name, b_name = dc.b_name,form =form )

Source code (open sourced under Apache2 license): I Look forward to any comments or feedback. Discover more about our expertise in Big Data and Infrastructure technologies.

On this page

Ready to unlock value from your data?

With Pythian, you can accomplish your data transformation goals and more.

Speak with Pythian consultants now →

What We Do

Comparing schemas between hive clusters

Share this

Share this

More resources

Leading Through Crisis – What Ernest Shackleton Can Teach Us About the COVID Pandemic

A look at current "Infrastructure as Code" Trends

Part One: Deploying High Available Applications in Oracle Cloud Infrastructure: Preparing Your Environment

Ready to unlock value from your data?