Tips that Help Alleviate Pressures on DataOps
Data is hard. Its always been hard, and it’s not getting easier. We always have more of it, there are more sources to integrate, it changes all the time, the quality is questionable, and the business wants it all right away. Working at this pace requires a sound operational mindset to avoid driving your teams crazy once the business starts using the data. This mindset needs to develop very early in every data project to ensure that you can keep your operational costs at a minimum and, most importantly, enable teams to easily maintain the data moving forward. So how do you alleviate the pressures on DataOps teams? It comes down to four key components:
The better way to do this is to ensure the business can always use what they had before the reload and only show them the new data if it has been successfully refreshed. This way, the DataOps team doesn't have to scramble to have something in there and allows the business to continue with most of its functionality, except for anything that requires the last 24 hours of data.
One fairly easy way to implement resiliency is to introduce the concept of auto-retry on failure. The goal here is to have the pipeline try to correct itself a certain number of times before involving manual intervention. Often times, something can happen during (for example) a file transfer and simply re-running the transfer resolves the problem. Why wake someone up in the middle of the night when a little extra development effort can resolve it?
The above scenarios are not new. I’ve seen this in the early days of databases, data warehouses, and now modern data engineering platforms. The old adage still applies. You either learn from history, or you're doomed to repeat it.
- Proper alerting hygiene
- Client visibility
ResiliencyOne of the most important things you can do is ensure that what you create is built with resiliency in mind. I'm not talking about infrastructure redundancy or auto-scaling but rather the end-data product that the business is using. In other word, the data should always be in a usable state. You might not always have the latest, but what you do have is complete and accurate. One typical example is a daily traditional batch full refresh of a data source. I don’t know how many times I’ve seen this scenario:
|Job Steps||Business Impact|
|Truncate the target data set.||No data available until the load is finished. Completely unusable.|
|Bulk load the data (can take minutes to hours).||No data available until the load is finished. Completely unusable.|
|Outcome: Success.||Usable data again.|
|Outcome: Error.||Empty or inconsistent data set.|