Pythian’s QuickStart Solution Step-by-Step Overview: Part 3
Pythian’s EDP QuickStart for Google Cloud couples a modern, integrated, cloud-native analytics platform based on Google BigQuery with the professional services required to customize it to your needs. Pythian turns your data into insights by leveraging your data sources and modern business intelligence (BI) tools like Looker or Tableau.
At a high level, data processing is handled within two loops.
Initially, any data entering the system is processed by the ingestion engine. Depending on the file type, it may require preprocessing, which is routed to the preprocessing tools. This loop will happen as many times as necessary. For example, a file may need to be preprocessed by multiple scripts before entering the final state, so it might complete this loop multiple times. Generally, we want to limit this looping as much as possible. We introduced optimized preprocessing scripts as we see recurring preprocessing loops continuously performed.
The point of this ingestion loop is to provide the option of a separate project for the raw, very sensitive source data, which is highly restricted and unavailable to anyone without administrator access.
Eventually, the file will be ready for the processing engine, which groups files into batches to process as much as possible in each action. The batches are sent to the respective Data Processing Engine(s), such as Dataflow.
The data processing engine performs the required actions, and may write the file to the archive bucket for loading into BigQuery. If further actions are required on the data, the next action will also be queued up, and the loop will continue.
Step 1: The ingestion bucket where raw files are stored. The client is responsible for delivering the files to this bucket.
Step 1b: When a file lands on the bucket, a trigger sends the information to a pub/sub-topic. This is used just in case the cloud function in Step 2 is unavailable for some reason as a staging area of the “New File” message.
Step 2: A cloud function is executed for every “New File” message in the staging topic. The cloud function performs simple validations of the file, for example, is this a supported format, and extracts the domain, task group, and file type.
Step 2a: The cloud function queries the orchestration layer to detect if any preprocessing of the file is required.
Step 2b: If so, a message is written to the Preprocessing Router pub/sub-topic.
Step 2c: The preprocessing router topic has an autoscaling Managed Instance Group (MIG) subscribed to it. A compute instance in the MIG will read the file and process it.
After preprocessing, the MIG will write new, processed file(s) to the ingestion bucket, which will start the loop again. Some files may require multiple rounds of preprocessing, but we want to limit this as much as possible.
Eventually, the file will have completed all preprocessing rounds, and the final version will be written to the ingestion bucket. This starts the process over again in Step 1.
Step 3: If the file does not require preprocessing, the cloud function will route the file to the data-batch pub/sub-topic.
Step 4: A cloud function is executed, which:
- Pulls any new messages from the data-batch pub/sub-topic.
- Queries the Orchestration Layer to identify if & how the files can be batched together (Step 4a)
- Updates the Orchestration Layer with the batch details such as filenames (Step 4a)
- Writes a message containing the job metadata to the job batches pub/sub-topic
- An attribute of this message will be the Processing Engine
Step 5: At this point, the batch-processing pub/sub-topic contains a message for each batch of data that should be processed.
Step 5a: The batch-processing-router pub/sub-topic has three cloud functions subscribed to it.
- spark-batch: Run any batches against serverless Spark
- dataflow-batch: Run any batches against Dataflow
- bigquery-batch: Run any batches against BigQuery
Each cloud function is subscribed to the pub/sub-topic with a custom attribute to filter messages only for it.
Step 5b: The cloud function executes an Argo workflow running a GKE cluster. In addition, the workflow updates the orchestration layer with the status when the job is complete.
Step 6a/b/c: The data processing engine batch jobs are started by the Argo Workflows described in Step 5.
When a job is defined, the appropriate engine is associated with it. For example, if data needs to be aggregated after it has been loaded to BigQuery, then the engine will be set as Dataform or dbt. If a client needs to send data to a third-party API, and Dataflow is the best choice for that action, then the engine will be set to Dataflow.
Step 6d: When the data processing engine completes the batch, it writes new Avro/Parquet files to a data lake bucket, if necessary.
Step 7: The argo workflow publishes the job metadata to the processed-data-router topic with an attribute of “bigquery_load.”
Step 7a: For any messages with a custom attribute of “bigquery_load,” the bigquery-load-processing-trigger cloud function will start. This cloud function will:
- Define a BigQuery Load Job
- Call an argo workflow which runs and monitors the load job
Step 7b: The cloud function starts an argo workflow. This argo workflow:
- Monitors the load job
- After the Load job completes, update the data lake file metadata attributes on the Data Lake bucket. These should include:
- status [loaded/failed/unknown]
- Load timestamp
For any messages without an attribute of “bigquery_load,” the next-stage-processing-trigger cloud function will start. This cloud function will:
- Query the Orchestration Layer for the next action that needs to be run
- If a next action exists, it will create a batch for that action.
- Then it will publish the metadata to the batch-processing pub/sub-topic. This is Step 5, and the processing engine will execute this new job.
Pythian’s EDP QuickStart Components
Now that you have the lay of the land, you might be curious about the components that make up Pythian’s EDP QuickStart. If so, continue to part four of this blog series.