Loading data into Azure Data Warehouse with Data Factory - SQL on the edge episode 18

Earlier this year Microsoft released the next generation of its data pipeline product Azure Data Factory. The first release of Data Factory did not receive widespread adoption due to limitations in terms of scheduling, execution triggers and lack of pipeline flow control. Microsoft took this feedback to heart and came back with a more feature-rich version that can now cover a larger percentage of production scenarios and can be a good fit for many projects. With this in mind, I recently had to load about 1TB worth of data into Azure SQL Data Warehouse and thought that this was a perfect opportunity to test Data Factory on higher volumes. I ran into a few issues that are worthy of documenting publicly and share my current workarounds so others can benefit from my experience.
The importance of Polybase
Before we dive into the two specific issues I faced, it's important to touch on Polybase and how it relates to loading data into Azure SQL Data Warehouse (ASDW). There are two ways you can load data into ASDW:- Through the Control Node: This is a "trickle INSERT" scenario where you do some individual INSERTS through your connection or if you use a tool like BCP or a regular SSIS load through the SQL driver into ASDW.
- Through Polybase: Polybase is a highly parallel and efficient data loading module inside ASDW. It bypasses the Control Node and loads directly from storage into the different distributions using the compute capacity of the multiple Compute Nodes. It can be used directly through T-SQL or with tools that are Polybase-aware like Azure Data Factory or the Polybase target task in SSIS.