Google Cloud Dataproc in ETL pipeline - part 1 (logging)
Cloud Dataproc Logging
Cluster's system and daemon logs are accessible through cluster UIs as well as through SSH-ing to the cluster, but there is a much better way to do this. By default these logs are also pushed to Google Cloud Logging consolidating all logs in one place with flexible Log Viewer UI and filtering. One can even create custom log-based metrics and use these for baselining and/or alerting purposes. All cluster logs are aggregated under a "dataproc-hadoop” tag but “structPayload.filename” field can be used as a filter for specific log file. In addition to relying on Logs Viewer UI, there is a way to integrate specific log messages into Cloud Storage or BigQuery for analysis. Just to get an idea on what logs are available by default, I have exported all Cloud Dataproc messages into BigQuery and queried new table with the following query: SELECT structPayload.filename AS file_name, count(*) AS cnt FROM [dataproc_logs.dataproc_hadoop_20160217] WHERE metadata.labels.key='dataproc.googleapis.com/cluster_id' AND metadata.labels.value = 'cluster-2:205c03ea-6bea-4c80-bdca-beb6b9ffb0d6' GROUP BY file_name- hadoop-hdfs-namenode-cluster-2-m.log
- yarn-yarn-nodemanager-cluster-2-w-0.log
- container_1455740844290_0001_01_000004.stderr
- hadoop-hdfs-secondarynamenode-cluster-2-m.log
- hive-metastore.log
- hadoop-hdfs-datanode-cluster-2-w-1.log
- hive-server2.log
- container_1455740844290_0001_01_000001.stderr
- container_1455740844290_0001_01_000002.stderr
- hadoop-hdfs-datanode-cluster-2-w-0.log
- yarn-yarn-nodemanager-cluster-2-w-1.log
- yarn-yarn-resourcemanager-cluster-2-m.log
- container_1455740844290_0001_01_000003.stderr
- mapred-mapred-historyserver-cluster-2-m.log
Application Logging
You can submit a job to the cluster using Cloud Console, Cloud SDK or REST API. Cloud Dataproc automatically gathers driver (console) output from all the workers, and makes it available through Cloud Console . Logs from the job are also uploaded to the staging bucket specified when starting a cluster and can be accessed from there. Note: One thing I found confusing is that when referencing driver output directory in Cloud Dataproc staging bucket you need Cluster ID (dataproc-cluster-uuid), however it is not yet listed on Cloud Dataproc Console. Having this ID or a direct link to the directory available from the Cluster Overview page is especially critical when starting/stopping many clusters as part of scheduled jobs. One way to get dataproc-cluster-uuid and a few other useful references is to navigate from Cluster "Overview" section to "VM Instances" and then to click on Master or any worker node and scroll down to "Custom metadata” section. Indeed, you can also get it using " gcloud beta dataproc clusters describe <CLUSTER_NAME> |grep clusterUuid" command but it would be nice to have it available through the console in a first place. The job (driver) output however is currently dumped into console ONLY (refer to /etc/spark/conf/log4j.properties on master node) and although accessible through Dataproc Job interface, it is not currently available in Cloud Logging. The easiest way around this issue, which can be easily implemented as part of Cluster initialization actions, is to modify /etc/spark/conf/log4j.properties by replacing " log4j.rootCategory=INFO, console ” with " log4j.rootCategory=INFO, console, file ” and add the following appender: # Adding file appender log4j.appender.file=org.apache.log4j.RollingFileAppender log4j.appender.file.File=/var/log/spark/spark-log4j.log log4j.appender.file.layout=org.apache.log4j.PatternLayout log4j.appender.file.layout.conversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c: %m%n Existing Cloud Dataproc fluentd configuration will automatically tail through all files under /var/log/spark directory adding events into Cloud Logging and should automatically pick up messages going into /var/log/spark/spark-log4j.log . You can verify that logs from the job started to appear in Cloud Logging by firing up one of the examples provided with Cloud Dataproc and filtering Logs Viewer using the following rule: node.metadata.serviceName="dataproc.googleapis.com" structPayload.filename="spark-log4j.log" If after this change messages are still not appearing in Cloud Logging, try restarting fluentd daemon by running "/etc/init.d/google-fluentd restart” command on master node. Once changes are implemented and output is verified you can declare logger in your process as: import pyspark sc = pyspark.SparkContext() logger = sc._jvm.org.apache.log4j.Logger.getLogger(__name__) and submit the job redefining logging level (INFO by default) using "--driver-log-levels". Learn more here.On this page
Share this
Share this
More resources
Learn more about Pythian by reading the following blogs and articles.
How Google Cloud and Pythian Keep Your Data—and Your Reputation—Secure

How Google Cloud and Pythian Keep Your Data—and Your Reputation—Secure
Jun 7, 2022 12:00:00 AM
2
min read
GoldenGate 12.2 big data adapters: part 3 - Kafka
GoldenGate 12.2 big data adapters: part 3 - Kafka
Mar 31, 2016 12:00:00 AM
16
min read
Building an ETL Pipeline with Multiple External Data Sources in Cloud Data Fusion

Building an ETL Pipeline with Multiple External Data Sources in Cloud Data Fusion
Aug 23, 2022 12:00:00 AM
12
min read
Ready to unlock value from your data?
With Pythian, you can accomplish your data transformation goals and more.