Building an EDP: Best Practices and Considerations: Part 2
When architecting a data platform that will be utilized across an enterprise, it is important to design with scale, operational efficiency, and platform growth in mind from the beginning. Think big picture during the platform’s inception, define a clear vision of the target state, and ideate how it will add value to consumers across the enterprise. At this stage, it’s not necessary to solve the technical challenges you may encounter with the target state; instead, outline some key principles and design tenets that will guide teams as the platform grows.
Once you establish the foundation of your platform and users onboard new use cases, it will be challenging to pivot from this foundation without causing significant knock-on effects to the platform. Therefore it is crucial that the platform design aligns with best practices from the beginning and design considerations factor in the overall platform vision, rather than accommodating initial use cases.
The following are some best practices and design considerations we have learned through our experiences helping customers architect and implement large-scale data platforms:
Ensure security from the start
The data platform must begin and end with security. From ingestion of the raw, potentially sensitive data to delivery of cleaned reporting data to an end user, processes must be in place to ensure the data is secured and delivered to only the correct people. In addition, there must be the ability to audit that the processes are correctly secured and to detect any potential breaches.
Google Cloud offers a wide-range of capabilities to ensure this is possible, including the following areas:
- Google Cloud organization structure and policy
- Authentication and authorization
- Resource hierarchy and deployment
- Networking (segmentation and security)
- Row and Column level security in data tables
- Key and secret management
- ML-powered threat and risk detection
- VPC Service controls
Managing configuration programmatically using automation
Programmatic configuration management is a systems engineering process that tracks and monitors changes to a software systems configuration metadata. Organizations commonly use configuration management in software development alongside version control and CI/CD infrastructure. Configuration management helps teams build robust and stable systems by using tools that automatically manage and monitor updates to configuration data.
For a data platform, several areas should be considered for programmatic configuration management. The most critical of these are the pipelines themselves. When a data pipeline template is available to extract, stage, clean, and land the data in a secure, repeatable method, the configuration of any given pipeline can be trusted by the business with fewer concerns.
Leverage managed services whenever possible
There are many reasons to use the Google Cloud-managed services. Often, the resources necessary to create a comparable system are impossible to duplicate or far beyond the reach of an IT team. For example, consider the amount of effort required to build a custom BigQuery or Dataflow implementation.
Generally, unless the business requires a vital feature, Google Cloud-managed services should be used as the first choice of any data platform.
Some of the more critical items to consider when choosing a native cloud-managed service or a third-party application includes:
- License fees
- Pricing model
- Cost to run on Google Cloud
- Future optimization and upgrade effort
- Existing investments in open source or 3rd party technologies
- Ability to interact with other managed services
- Ease of deployment
- Availability of technical support resources within your organization, Google, and the wider community
Enable data lineage and traceability
The ability of your organization to understand the source and destination of your data, along with a complete picture of any transformations performed along the way, is extremely valuable.
As a data platform grows, the lineage becomes more complex, and representing it visually gives analysts and executives a way to see exactly how the data is made. In addition, it creates a foundation of trust by showing the willingness to surface the logic and transformations.
As data lineage can be a challenging problem, it is often the best approach to use a specialist tool such as Dataform, Data Fusion, Collibra, or Informatica.
Design for scalability, not only scale
As your platform gains enterprise adoption and onboard new use cases, it is imperative that each component of the architecture can scale linearly with the increased volume.
In a traditional on-premises platform, you define estimated volume, then provision computing and storage for this scale. However, with a cloud-based data platform, you should instead have a small initial footprint and a modular-based architecture with components that can scale independently in real-time, based on demand.
This enables the platform to be performant, cost-effective, and scale seamlessly as your platform and data volumes grow. With such an approach, you minimize run costs while the platform is in the early stages of maturity and can experiment with new features and capabilities without requiring significant capital expenditure.
Enhance, enrich, and innovate with ML
Establishing an EDP is an excellent way to consolidate, organize, and control your organization’s data. The next step is using this data to unlock business value and drive insights.
Traditional data analysis continues to be an excellent approach; however, leading organizations are taking this to the next level by applying ML against their data.
By using ML against your data on Google Cloud you can seamlessly enable use cases, such as personalized customer experiences, extracting underlying trends on customer buying patterns, or building advanced models to proactively predict machine failure quickly and cost-effectively.
Several approaches exist to integrate and implement ML in your Enterprise Data Platform. The following are a few examples of how you can put this into practice:
- ML data enrichment
- Data enrichment is the process of augmenting, appending, and cleansing your initial dataset using additional or possibly third-party data to provide a more comprehensive, transformed dataset. ML data enrichment is the process of running inferences against ML models on your existing data to create new ML features. An example could be applying an Autoregression (AR) model against time series data with BigQuery ML to generate forecasting data.
- ML Ops workflows
- MLOps is an ML engineering culture and practice that aims to unify ML system development (Dev) and ML system operation (Ops) in ML workflows. In the context of an EDP, MLOps pipelines can automatically trigger ML workflows to train or retrain models based on new data in your data platform. This enables continuously integrated real-time or daily data updates into your ML training process, improving model accuracy and leading to better customer outcomes and/or experiences.
- Data scientist experimentation
- The data hosted in your platform is the key to unlocking innovation and identifying untapped business opportunities, creating exponential revenue growth. Once you have a consolidated view of your data and implemented role-cobased access control to ensure security, you can securely enable data scientists to explore and experiment with these rich datasets.
In traditional data architectures, data scientists are usually limited to small silos of data on which to train models, limiting the potential of their insights. Giving them access to data from across your enterprise means tackling more use cases and unlocking more powerful insights.
Allow for platform evolution
To make the most of this new data platform, new system and process tie-ins are required to integrate new data sources and users. In addition, the method of sending and receiving the data may change dramatically as the platform is opened to a wider audience. The initial architecture for the platform does not need to consider every unknown and potential use case. Still, it should be built to allow for new features and functionality to be more easily added. For example, as Google deploys new solutions, such as the Google Cloud Cortex Framework to enable SAP workload migrations, a flexibly designed platform can accommodate this new data without any major rework.
Some options to consider when designing the data platform include:
- Use a modular architecture based on microservices
- Use API calls for all communication
- Use a queuing mechanism between processing stages
- Allow for flexible data access at logical processing points
- Avoid hardcoding or strong dependencies whenever possible
- Align with industry norms when selecting technologies or formats (e.g., Kubernetes, Parquet)
- Expect new technologies to emerge and embrace them
For further reading on general best practices and design principles in designing applications in Google Cloud, refer to the Google Cloud Architecture Framework.
Building a data platform to serve the modern enterprise is a challenging but rewarding endeavor given the exponential business value it can unlock. Define a vision, establish a set of platform tenets aligned with best practices then incrementally add capabilities as you integrate data sources across the enterprise.
Leveraging the broad capabilities of a cloud provider such as Google Cloud can rapidly accelerate such initiatives. Using managed services such as Google Cloud Storage, Dataflow, BigQuery, and Looker, you can quickly and cost-effectively develop end-to-end data workflows taking you from raw data to business insights.
Continue reading this blog series by reading part three, which offers an overview of Pythian’s EDP QuickStart.