- Fetch the schema from all hive environments, store the schema as a dictionary in a file.
- Create a class to perform comparisons leveraging python list, dict and pandas dataframe structures.
- Display the results/visualize the changes using a web interface (this approach uses python Flask).
transport = TSocket.TSocket(host, port) transport = TTransport.TBufferedTransport(transport) protocol = TBinaryProtocol.TBinaryProtocol(transport)
global metastore_client metastore_client = ThriftHiveMetastore.Client(protocol) transport.open() Once that is done, it's straightforward to get all of the tables in a db (there are lots of other great methods that have been documented).
db = 'myDatabase' tables = metastore_client.get_all_tables(db) Once we get the tables, we can call the get_fields() method
for table in tables: for field in metastore_client.get_fields(db, table): # field is a named tuple FieldSchema(comment, type, name) print field Using the methods above, we retrieve all of the values and store them in a dictionary, locally in a text file, one file per environment.
{'myDatabase': { 'tbl_customer': [ FieldSchema(comment=None, type=int, name='user_id'), FieldSchema(comment=None, type='string', name='first_name'), FieldSchema(comment=None, type='string', name='last_name') ], 'tbl_orders': [ FieldSchema(comment=None, type=int, name='order_id'), FieldSchema(comment=None, type='string', name='item_id'), FieldSchema(comment=None, type='string', name='price') ], etc.. Note that thrift gives us a tuple for the ddl, so for ease, we declare this named tuple locally. FieldSchema = collections.namedtuple(‘FieldSchema’, [‘comment’,’type’, ‘name’])
Step 2 - Create Class to Compare Dictionaries Now that we have a separate schema file for each hive instance, we can load these files into dictionaries and begin to compare them.
f_stage_dict = open('schema_files/stage_schema_thrift.out') f_prod_dict = open('schema_files/prod_schema_thrift.out') d_stage = eval(f_stage_dict.read()) d_prod = eval(f_prod_dict.read()) Create a class to compare the dictionaries. Comparisons are done leveraging a variety of nice python tools (lists, set comparison, pandas dataframes, etc..). More detail in the
repo.
Class DictCompare: def get_fields_by_table(self, db, table):... # returns list of fields for tableA and tableB def get_databases(self):... # union of all dbs in all schemas A,B def compare_fields(self, db, table)... # returns tablediff tuple showing all field differences def compare_dbs(self, db)... #returns list of differences between dbs tables_a = [table for table in self.dict_a[db]] tables_b = [table for table in self.dict_b[db]] db_diffs_a = set(tables_a).difference(set(tables_b)) db_diffs_b = set(tables_b).difference(set(tables_a)) ..... Instantiate the class by passing the path of the schema files, and the friendly names used in the
config.json file. dc = sc.DictCompare(schema_files_path,"dev", "stage" ) dbs = dc.get_databases() # show all of the databases..
print "DBS=",dbs # get all diffs for a given database
compare_dict = dc.compare_dbs("my_db")
Step 3 Display the Results / Visualize the Changes I used a quick and simple python flask website to display the differences. Note there is a form on the page that allows users to select the hive instances to compare.
@app.route('/home.html') @app.route('/home') def homepage(): form=EnvForm(csrf_enabled=False) env_a = form.env_a.data # 'stage', for example env_b = form.env_b.data # 'prod', for example dc = sc.DictCompare(schema_files_path,env_a, env_b ) dbs = dc.get_databases() return render_template("home.html", dbs=dbs, a_name =dc.a_name, b_name = dc.b_name,form =form )
Source code (open sourced under Apache2 license): I Look forward to any comments or feedback.
Discover more about our expertise in Big Data and Infrastructure technologies.
On this page
Share this
Share this
More resources
Learn more about Pythian by reading the following blogs and articles.
Leading Through Crisis – What Ernest Shackleton Can Teach Us About the COVID Pandemic
Leading Through Crisis – What Ernest Shackleton Can Teach Us About the COVID Pandemic
Sep 17, 2020 12:00:00 AM
4
min read
Part One: Deploying High Available Applications in Oracle Cloud Infrastructure: Preparing Your Environment
Part One: Deploying High Available Applications in Oracle Cloud Infrastructure: Preparing Your Environment
Jul 30, 2021 12:00:00 AM
6
min read
A look at current "Infrastructure as Code" Trends
A look at current "Infrastructure as Code" Trends
Jan 19, 2018 12:00:00 AM
7
min read
Ready to unlock value from your data?
With Pythian, you can accomplish your data transformation goals and more.