When running several different hive instances, we found the process of maintaining/confirming synchronization to be quite time consuming. While several tools made available by Cloudera and other frameworks do provide a visual interface for an individual instance of hive, there was no available tool for comparing across clusters. We came up with a useful way to compare hive schemas between clusters. Our process contains 3 main steps:
- Fetch the schema from all hive environments, store the schema as a dictionary in a file.
- Create a class to perform comparisons leveraging python list, dict and pandas dataframe structures.
- Display the results/visualize the changes using a web interface (this approach uses python Flask).
transport = TSocket.TSocket(host, port) transport = TTransport.TBufferedTransport(transport) protocol = TBinaryProtocol.TBinaryProtocol(transport)
global metastore_client metastore_client = ThriftHiveMetastore.Client(protocol) transport.open()
Once that is done, it's straightforward to get all of the tables in a db (there are lots of other great methods that have been documented).
db = 'myDatabase' tables = metastore_client.get_all_tables(db)
Once we get the tables, we can call the get_fields() method
for table in tables: for field in metastore_client.get_fields(db, table): # field is a named tuple FieldSchema(comment, type, name) print field
Using the methods above, we retrieve all of the values and store them in a dictionary, locally in a text file, one file per environment.
{'myDatabase': { 'tbl_customer': [ FieldSchema(comment=None, type=int, name='user_id'), FieldSchema(comment=None, type='string', name='first_name'), FieldSchema(comment=None, type='string', name='last_name') ], 'tbl_orders': [ FieldSchema(comment=None, type=int, name='order_id'), FieldSchema(comment=None, type='string', name='item_id'), FieldSchema(comment=None, type='string', name='price') ],
etc.. Note that thrift gives us a tuple for the ddl, so for ease, we declare this named tuple locally. FieldSchema = collections.namedtuple(‘FieldSchema’, [‘comment’,’type’, ‘name’])
Step 2 - Create Class to Compare Dictionaries Now that we have a separate schema file for each hive instance, we can load these files into dictionaries and begin to compare them.
f_stage_dict = open('schema_files/stage_schema_thrift.out') f_prod_dict = open('schema_files/prod_schema_thrift.out') d_stage = eval(f_stage_dict.read()) d_prod = eval(f_prod_dict.read())
Create a class to compare the dictionaries. Comparisons are done leveraging a variety of nice python tools (lists, set comparison, pandas dataframes, etc..). More detail in the
repo.
Class DictCompare: def get_fields_by_table(self, db, table):... # returns list of fields for tableA and tableB def get_databases(self):... # union of all dbs in all schemas A,B def compare_fields(self, db, table)... # returns tablediff tuple showing all field differences def compare_dbs(self, db)... #returns list of differences between dbs tables_a = [table for table in self.dict_a[db]] tables_b = [table for table in self.dict_b[db]] db_diffs_a = set(tables_a).difference(set(tables_b)) db_diffs_b = set(tables_b).difference(set(tables_a)) .....
Instantiate the class by passing the path of the schema files, and the friendly names used in the
config.json file. dc = sc.DictCompare(schema_files_path,"dev", "stage" ) dbs = dc.get_databases() # show all of the databases..
print "DBS=",dbs
# get all diffs for a given database
compare_dict = dc.compare_dbs("my_db")
Step 3 Display the Results / Visualize the Changes I used a quick and simple python flask website to display the differences. Note there is a form on the page that allows users to select the hive instances to compare.
@app.route('/home.html') @app.route('/home') def homepage(): form=EnvForm(csrf_enabled=False) env_a = form.env_a.data # 'stage', for example env_b = form.env_b.data # 'prod', for example dc = sc.DictCompare(schema_files_path,env_a, env_b ) dbs = dc.get_databases() return render_template("home.html", dbs=dbs, a_name =dc.a_name, b_name = dc.b_name,form =form )
Source code (open sourced under Apache2 license): I Look forward to any comments or feedback.
Discover more about our expertise in Big Data and Infrastructure technologies.
Share this
Previous story
← Oracle ASM rebalance - Turn it up. To 11?
Next story
High "cursor: pin S wait on X" waits? →
You May Also Like
These Related Stories
HOWTO: Oracle Cross-Platform Migration with Minimal Downtime
HOWTO: Oracle Cross-Platform Migration with Minimal Downtime
Aug 20, 2009
6
min read
Part 1: How to effectively use a performance schema
Part 1: How to effectively use a performance schema
May 30, 2015
12
min read
How to Fix the “Triggers on Memory-Optimized Tables Must Use WITH NATIVE_COMPILATION” Azure SQL Data Sync Error
How to Fix the “Triggers on Memory-Optimized Tables Must Use WITH NATIVE_COMPILATION” Azure SQL Data Sync Error
Jul 21, 2022
1
min read
No Comments Yet
Let us know what you think