Google Cloud Dataproc in ETL pipeline - part 1 (logging)

Google Cloud Dataproc
, now generally available, provides access to fully managed Hadoop and Apache Spark clusters, and leverages open source data tools for querying, batch/stream processing, and at-scale machine learning. To get more technical information on the specifics of the platform, refer to Google’s original blog
post
and
product home page
.
Having access to fully managed Hadoop/Spark based technology and powerful Machine Learning Library (MLlib) as part of Google Cloud Platform makes perfect sense as it allows you to reuse existing code and helps many to overcome the fear of being “locked into” one specific vendor while taking a step into big data processing in the cloud. That said, I would still recommend evaluating Google Cloud Dataflow first while implementing new projects and processes for its efficiency, simplicity and semantic-rich analytics capabilities, especially around stream processing.
When Cloud Dataproc was first released to the public, it received positive reviews. Many blogs were written on the subject with
few
taking it through some “tough” challenges on its promise to deliver cluster startup in "less than 90 seconds”. In general the product was well received, with the overall consensus that it is well positioned against the AWS EMR offering.
Being able, in a matter of minutes, to start Spark Cluster without any knowledge of the Hadoop ecosystem and having access to a powerful interactive shell such as
Jupyter
or
Zeppelin
is no doubt a Data Scientist’s dream. But with extremely fast startup/shutdown, “by the minute” billing and widely adopted technology stack, it also appears to be a perfect candidate for a processing block in bigger ETL pipelines. Orchestration, workflow engine, and logging are all crucial aspects of such solutions and I am planning to publish a few blog entries as I go through evaluation of each of these areas starting with Logging in this blog.
Cloud Dataproc Logging
Cluster's system and daemon logs are accessible through cluster UIs as well as through SSH-ing to the cluster, but there is a much better way to do this. By default these logs are also pushed to Google Cloud Logging consolidating all logs in one place with flexible Log Viewer UI and filtering. One can even create custom log-based metrics and use these for baselining and/or alerting purposes. All cluster logs are aggregated under a "dataproc-hadoop” tag but “structPayload.filename” field can be used as a filter for specific log file. In addition to relying on Logs Viewer UI, there is a way to integrate specific log messages into Cloud Storage or BigQuery for analysis. Just to get an idea on what logs are available by default, I have exported all Cloud Dataproc messages into BigQuery and queried new table with the following query: SELECT structPayload.filename AS file_name, count(*) AS cnt FROM [dataproc_logs.dataproc_hadoop_20160217] WHERE metadata.labels.key='dataproc.googleapis.com/cluster_id' AND metadata.labels.value = 'cluster-2:205c03ea-6bea-4c80-bdca-beb6b9ffb0d6' GROUP BY file_name- hadoop-hdfs-namenode-cluster-2-m.log
- yarn-yarn-nodemanager-cluster-2-w-0.log
- container_1455740844290_0001_01_000004.stderr
- hadoop-hdfs-secondarynamenode-cluster-2-m.log
- hive-metastore.log
- hadoop-hdfs-datanode-cluster-2-w-1.log
- hive-server2.log
- container_1455740844290_0001_01_000001.stderr
- container_1455740844290_0001_01_000002.stderr
- hadoop-hdfs-datanode-cluster-2-w-0.log
- yarn-yarn-nodemanager-cluster-2-w-1.log
- yarn-yarn-resourcemanager-cluster-2-m.log
- container_1455740844290_0001_01_000003.stderr
- mapred-mapred-historyserver-cluster-2-m.log