ETL-99/spark UI #39

tthyer · 2022-01-25T01:10:39Z

This enables the Spark UI for Spark jobs -- the config options is a bit misnamed. Actually it just saves the spark logs so you can ingest them into Spark History Server. I also added CONTRIBUTING.md to explain how to spin up a docker container for that.

tthyer · 2022-01-25T01:12:32Z

config/develop/glue-jobs.yaml

@@ -6,7 +6,7 @@ dependencies:
 parameters:
  SourceBucketName: {{ stack_group_config.artifact_bucket_name }}
  JobRole: !stack_output_external glue-job-role::RoleArn
-  S3OutputBucketName: !stack_output_external bridge-downstream-dev-parquet-bucket::BucketName
+  S3BucketName: !stack_output_external bridge-downstream-dev-intermediate-bucket::BucketName


The name was changed because S3OutputBucketName is misleading -- this value is not actually what's being used to write the output. That's passed (confusingly) at the workflows level instead. But we pass in something here to write temp files too, as well as logs. Would be good eventually to be more consistent about this.

tthyer · 2022-01-25T01:17:27Z

CONTRIBUTING.md

+
+### Monitoring
+
+Use a combination of Workflows and Glue Studio, both within the [Glue console](https://console.aws.amazon.com/glue/), for graphical monitoring. Workflows gives a good view of what's happening within a workflow, but for a better overview of all the job runs, Glue Studio is recommended. Either option will give you a link to the Cloudwatch logs for each job.


A fourth monitoring option will be the Cloudwatch metrics, but I'm still working on those.

tthyer · 2022-01-25T01:18:13Z

templates/glue-spark-job.yaml

        --enable-continuous-cloudwatch-log: !Ref ContinuousLog
+        --enable-metrics: true
+        --enable-spark-ui: true
+        --spark-event-logs-path: !Sub s3://${S3BucketName}/spark-logs/${AWS::StackName}/


Lines 117 and 118 were the main reason for this PR

Are we writing the logs to the same bucket as our JSON datasets?

Yes, to the "intermediate" bucket, which is basically the work bucket. Do you think that's problematic?

No, that makes sense.

tthyer · 2022-01-25T01:19:29Z

templates/glue-spark-job.yaml

        --job-bookmark-option: !Ref BookmarkOption
        --job-language: !Ref JobLanguage
        --table: !Ref GlueTableName
+        # --conf spark.sql.adaptive.enabled


This is something I am starting to tweak to control the number of partitions Spark is using, which determines the number of files output, unless you coalesce.

tthyer added 2 commits January 24, 2022 16:41

Enable spark ui and metrics

34d45bd

Added a CONTRIBUTING doc

cb8518d

tthyer requested a review from a team as a code owner January 25, 2022 01:10

tthyer commented Jan 25, 2022

View reviewed changes

thomasyu888 approved these changes Jan 25, 2022

View reviewed changes

philerooski approved these changes Jan 26, 2022

View reviewed changes

tthyer merged commit 027ccc8 into main Jan 26, 2022

tthyer deleted the ETL-99/spark-ui branch January 26, 2022 19:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ETL-99/spark UI #39

ETL-99/spark UI #39

tthyer commented Jan 25, 2022

tthyer Jan 25, 2022

tthyer Jan 25, 2022

tthyer Jan 25, 2022

philerooski Jan 25, 2022

tthyer Jan 25, 2022

philerooski Jan 26, 2022

tthyer Jan 25, 2022


		### Monitoring

		Use a combination of Workflows and Glue Studio, both within the [Glue console](https://console.aws.amazon.com/glue/), for graphical monitoring. Workflows gives a good view of what's happening within a workflow, but for a better overview of all the job runs, Glue Studio is recommended. Either option will give you a link to the Cloudwatch logs for each job.

ETL-99/spark UI #39

ETL-99/spark UI #39

Conversation

tthyer commented Jan 25, 2022

tthyer Jan 25, 2022

Choose a reason for hiding this comment

tthyer Jan 25, 2022

Choose a reason for hiding this comment

tthyer Jan 25, 2022

Choose a reason for hiding this comment

philerooski Jan 25, 2022

Choose a reason for hiding this comment

tthyer Jan 25, 2022

Choose a reason for hiding this comment

philerooski Jan 26, 2022

Choose a reason for hiding this comment

tthyer Jan 25, 2022

Choose a reason for hiding this comment