-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ETL-99/spark UI #39
ETL-99/spark UI #39
Conversation
@@ -6,7 +6,7 @@ dependencies: | |||
parameters: | |||
SourceBucketName: {{ stack_group_config.artifact_bucket_name }} | |||
JobRole: !stack_output_external glue-job-role::RoleArn | |||
S3OutputBucketName: !stack_output_external bridge-downstream-dev-parquet-bucket::BucketName | |||
S3BucketName: !stack_output_external bridge-downstream-dev-intermediate-bucket::BucketName |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name was changed because S3OutputBucketName is misleading -- this value is not actually what's being used to write the output. That's passed (confusingly) at the workflows level instead. But we pass in something here to write temp files too, as well as logs. Would be good eventually to be more consistent about this.
|
||
### Monitoring | ||
|
||
Use a combination of Workflows and Glue Studio, both within the [Glue console](https://console.aws.amazon.com/glue/), for graphical monitoring. Workflows gives a good view of what's happening within a workflow, but for a better overview of all the job runs, Glue Studio is recommended. Either option will give you a link to the Cloudwatch logs for each job. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A fourth monitoring option will be the Cloudwatch metrics, but I'm still working on those.
--enable-continuous-cloudwatch-log: !Ref ContinuousLog | ||
--enable-metrics: true | ||
--enable-spark-ui: true | ||
--spark-event-logs-path: !Sub s3://${S3BucketName}/spark-logs/${AWS::StackName}/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lines 117 and 118 were the main reason for this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we writing the logs to the same bucket as our JSON datasets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, to the "intermediate" bucket, which is basically the work bucket. Do you think that's problematic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that makes sense.
--job-bookmark-option: !Ref BookmarkOption | ||
--job-language: !Ref JobLanguage | ||
--table: !Ref GlueTableName | ||
# --conf spark.sql.adaptive.enabled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is something I am starting to tweak to control the number of partitions Spark is using, which determines the number of files output, unless you coalesce.
This enables the Spark UI for Spark jobs -- the config options is a bit misnamed. Actually it just saves the spark logs so you can ingest them into Spark History Server. I also added CONTRIBUTING.md to explain how to spin up a docker container for that.