We took two data experts and asked one to argue that data should be processed in the data warehouse using SQL, and the other to argue that data should be processed in the data lake using Spark which is available as a managed service from all three of the public cloud vendors: Azure Databricks, Google Cloud’s Dataproc and EMR from AWS. Here’s the summary of the arguments:
| Processing data in the data lake via Spark | Processing data in the data warehouse via SQL | |
| Flexibility | Processing done in the data lake brings additional flexibility; outputs can be used not just for data served in the data warehouse, but also for data that can be delivered to or consumed by other users and / or systems. | Outputs of data processing are typically restricted to use in the data warehouse. |
| Developer Productivity | Once trained, developers typically appreciate the power and flexibility of Spark, with its sophisticated testing frameworks and libraries to help accelerate code delivery. | While not designed as a programming language, SQL’s popularity means that finding people who know it is relatively easy, which can mean a faster time to value. |
| Data Governance | Processing data as close to the source as possible supports more consistent use of transformed data across different sinks, and reduces the risk of multiple people transforming data and defining data differently. | Processing data in the data warehouse can support a data governance program but if processing is also done in the data lake, conflicting data definitions may emerge. |
| Cross- Platform Portability | Spark produces fully portable code independent of the cloud vendor. That means changing from one data warehouse to another is easier; no migration is required, testing is minimal, and transformations don’t need changing. | Transformations done in ANSI-SQL are supported by all the major cloud providers’ data warehouse offerings, and are portable provided that no cloud vendor-specific add-ons have been added. Work, however, will be required to migrate and test code. |
| Performance | No amount of processing will impact data warehouse users when processing is done outside the data warehouse. | Most modern cloud data warehouses provide great out-of-the-box performance, but some may suffer as processing load increases. |
| Speed of Processing | Real-time analytics is always possible. | Real-time analytics is possible in some cloud data warehouses, but involves multiple steps and / or products. |
| Cost | It’s typically less expensive to do processing in the data lake BUT we strongly encourage data platform designers to research and understand their cloud consumption pricing as we’ve seen instances where the opposite is true. | Depending on the data warehouse selected and the associated commercial terms, processing in the data warehouse can be expensive. |
| Reusability | Reusable functions and modules are readily available in Spark. All processing jobs are available, not just to deliver processed data to the data warehouse but also to other destinations (an increasingly popular use of cloud data platforms). | When available in cloud data warehouses, stored procedures and functions can provide reusable code. |
Creating, scaling and maintaining a modern data platform for the first time is a significant change for any organization. There will certainly be times when the need to minimize change – such as using the more popular SQL for processing – may outweigh the prevailing wisdom of using new frameworks like Spark. SQL talent is usually readily available, and is likely already in your organization. It’s also likely to get you to a solution fairly quickly, and for small platform solutions it might be fine for the mid- to long-term. In most cases, as your platform scales, the challenges of using SQL for processing in the data warehouse will continue to mount and at some point will likely become untenable. In our experience, it’s certainly not the best route if you want to take full advantage of the flexibility of a cloud data platform. Most enterprises (especially those with large amounts of incoming data and various data types) will want to do processing outside of the data warehouse, or possibly even consider blended solutions. For example, when the first use case for a data platform is to replace a traditional data warehouse, repurposing existing SQL code can be faster than redoing transforms in Spark. Once the data warehouse has been migrated, the transform jobs can be moved to the data lake in earnest. No matter which route you take, Pythian experts can help you plan and architect the best enterprise data platform solution for your specific needs and budget. We’ve guided hundreds of organizations across dozens of business verticals to take their first steps toward a modern, efficient, scalable data platform. Want to discuss this topic with a technical expert? Schedule a tech call with our team to get the conversation started.
Ready to optimize your Database for the future?