From 54e894a0b2875bdb9f23821b51a15a85dff19efa Mon Sep 17 00:00:00 2001 From: Natalie White Date: Wed, 19 Apr 2023 14:28:03 -0700 Subject: [PATCH 01/15] RFC for Glue L2 Construct --- text/0497-glue-l2-construct.md | 342 +++++++++++++++++++++++++++++++++ 1 file changed, 342 insertions(+) create mode 100644 text/0497-glue-l2-construct.md diff --git a/text/0497-glue-l2-construct.md b/text/0497-glue-l2-construct.md new file mode 100644 index 000000000..8bfb88b81 --- /dev/null +++ b/text/0497-glue-l2-construct.md @@ -0,0 +1,342 @@ +# RFC - Glue CDK L2 Construct +https://github.com/aws/aws-cdk-rfcs/issues/497 + +## L2 Construct for AWS Glue Connections, Jobs, and Workflows + +* Original Author(s): @natalie-white-aws, @mjanardhan @parag-shah-aws +* Tracking Issue: +* API Bar Raiser: [Kendra Neil](https://quip-amazon.com/AZX9EAmb6vG) + +## Working Backwards - README + +[AWS Glue](https://aws.amazon.com/glue/) is a serverless data integration service that makes it easier to discover, prepare, move, and integrate data from multiple sources for analytics, machine learning (ML), and application development. Glue was released on 2017/08. Launch: https://aws.amazon.com/blogs/aws/launch-aws-glue-now-generally-available/ + +Today, customers define Glue data sources, connections, jobs, and workflows to define their data and ETL solutions via the AWS console, the AWS CLI, and Infrastructure as Code tools like CloudFormation and the CDK. However, they have challenges defining the required and optional parameters depending on job type, networking constraints for data source connections, secrets for JDBC connections, and least-privilege IAM Roles and Policies. We will build convenience methods working backwards from common use cases and default to recommended best practices. + +This RFC proposes updates to the L2 construct for Glue which will provide convenience features and abstractions for the existing [L1 (CloudFormation) Constructs](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/AWS_Glue.html) building on the functionality already supported in the [@aws-cdk/aws-glue-alpha module](https://github.com/aws/aws-cdk/blob/v2.51.1/packages/%40aws-cdk/aws-glue/README.md). + + +## Create a Glue Job + +The glue-alpha-module already supports three of the four common types of Glue Jobs: Spark (ETL and Streaming), Python Shell, Ray. This RFC will add the more recent Flex Job. The construct also implements AWS practice recommendations when creating a Glue Job such use of Secrets Management for Connection JDBC strings, Glue Job Autoscaling, least privileges in terms of IAM permissions and also sane defaults for Glue job specification (more details are mentioned in the below table). + +This RFC will introduce breaking changes to the existing glue-alpha-module to streamline the developer experience and introduce new constants and validations. The L2 construct will determine the job type by the job type and language provided by the developer, rather than having separate methods in every permutation that Glue jobs allow. + + +### Spark Jobs + +1. **ETL Jobs + ** + +ETL jobs supports python and Scala language. ETL job type supports G1, G2, G4 and G8 worker type default as G2 which customer can override. Also preferred version for ETL is 4.0 but customer can override the version to 3.0. We by default enable several features for ETL jobs these are` —enable-metrics, —enable-spark-ui, —enable-continuous-cloudwatch-log.` We recommend to use these feature for ETL jobs. You can find more details about version, worker type and other feature on public documentation. + +``` +glue.Job(this, 'ScalaSparkEtlJob', { + jobType: glue.JobType.ETL, + jobLanguage: glue.JobLanguage.SCALA_SPARK, + scriptS3Url: 's3://bucket-name/path-to-scala-jar', + className: 'com.example.HelloWorld', +}); + +glue.Job(this, 'pySparkEtlJob', { + jobType: glue.JobType.ETL, + jobLanguage: glue.JobLanguage.PYSPARK, + scriptS3Url: 's3://bucket-name/path-to-python-script', +}); +``` + +Optionally, developers can override the glueVersion and add extra jars and a description: + +``` +glue.Job(this, 'ScalaSparkEtlJob', { + jobType: glue.JobType.ETL, + jobLanguage: glue.JobLanguage.SCALA_SPARK, + glueVersion: glue.GlueVersion.V3_0, + scriptS3Url: 's3://bucket-name/path-to-scala-jar', + className: 'com.example.HelloWorld', + extraJarsS3Url: ['s3://bucket-name/path-to-extra-scala-jar',], + description: 'an example Scala Spark ETL job', + numberOfWorkers: 20 +}); + +glue.Job(this, 'pySparkEtlJob', { + jobType: glue.JobType.ETL, + jobLanguage: glue.JobLanguage.PYSPARK, + glueVersion: glue.GlueVersion.V3_0, + pythonVersion: glue.PythonVersion.3_6, + scriptS3Url: 's3://bucket-name/path-to-scala-jar', + extraJarsS3Url: ['s3://bucket-name/path-to-extra-scala-jar'], + description: 'an example pySpark ETL job', + numberOfWorkers: 20 +}); +``` + +1. **Streaming Jobs** + +A Streaming job is similar to an ETL job, except that it performs ETL on data streams. It uses the Apache Spark Structured Streaming framework. Some Spark job features are not available to streaming ETL jobs. These jobs will default to use Python 3.6. + +Similar to ETL streaming job supports Scala and python language. Similar to ETL, it supports G1 and G2 worker type and 2.0, 3.0 and 4.0 version. We’ll default to G2 worker and 4.0 version for streaming jobs which customer can override. Some of the feature we’ll enable are `—enable-metrics, —enable-spark-ui, —enable-continuous-cloudwatch-log` + +``` +new glue.Job(this, 'PythonSparkStreamingJob', { + jobType: glue.JobType.STREAMING, + jobLanguage: glue.JobLanguage.PYSPARK, + scriptS3Url: 's3://bucket-name/path-to-python-script', +}); + + +new glue.Job(this, 'ScalaSparkStreamingJob', { + jobType: glue.JobType.STREAMING, + jobLanguage: glue.JobLanguage.SCALA_SPARK, + scriptS3Url: 's3://bucket-name/path-to-scala-jar', + className: 'com.example.HelloWorld', +}); + +``` + +Optionally, developers can override the glueVersion and add extraJars and a description: + +``` +new glue.Job(this, 'PythonSparkStreamingJob', { + jobType: glue.JobType.STREAMING, + jobLanguage: glue.JobLanguage.PYSPARK, + glueVersion: glue.GlueVersion.V3_0, + pythonVersion: glue.PythonVersion.3_6, + scriptS3Url: 's3://bucket-name/path-to-python-script', + description: 'an example Python Streaming job', + numberOfWorkers: 20, +}); + + +new glue.Job(this, 'ScalaSparkStreamingJob', { + jobType: glue.JobType.STREAMING, + jobLanguage: glue.JobLanguage.SCALA_SPARK, + glueVersion: glue.GlueVersion.V3_0, + pythonVersion: glue.PythonVersion.3_6, + scriptS3Url: 's3://bucket-name/path-to-scala-jar', + className: 'com.example.HelloWorld', + description: 'an example Python Streaming job', + numberOfWorkers: 20, +}); +``` + +1. **Flex Jobs** + +The flexible execution class is appropriate for non-urgent jobs such as pre-production jobs, testing, and one-time data loads. Flexible job runs are supported for jobs using AWS Glue version 3.0 or later and `G.1X` or `G.2X` worker types but will default to the latest version of Glue (currently Glue 3.0.) Also similar to ETL we’ll enable these feature `—enable-metrics, —enable-spark-ui, —enable-continuous-cloudwatch-log` + +``` +glue.Job(this, 'ScalaSparkFlexEtlJob', { + jobType: glue.JobType.FLEX, + jobLanguage: glue.JobLanguage.SCALA_SPARK, + scriptS3Url: 's3://bucket-name/path-to-scala-jar', + className: 'com.example.HelloWorld', +}); + +glue.Job(this, 'pySparkFlexEtlJob', { + jobType: glue.JobType.FLEX, + jobLanguage: glue.JobLanguage.PYSPARK, + scriptS3Url: 's3://bucket-name/path-to-python-script', +}); +``` + +Optionally, developers can override the glue version, python version, provide extra jars, and a description + +``` +glue.Job(this, 'pySparkFlexEtlJob', { + jobType: glue.JobType.FLEX, + jobLanguage: glue.JobLanguage.SCALA_SPARK, + glueVersion: glue.GlueVersion.V3_0, + scriptS3Url: 's3://bucket-name/path-to-python-script', + extraJarsS3Url: ['s3://bucket-name/path-to-extra-python-scripts'], + description: 'an example pySpark ETL job', + numberOfWorkers: 20, +}); + +new glue.Job(this, 'FlexJob', { + jobType: glue.JobType.FLEX, + jobLanguage: glue.JobLanguage.PYSPARK, + glueVersion: glue.GlueVersion.V3_0, + pythonVersion: glue.PythonVersion.3_6, + scriptS3Url: 's3://bucket-name/path-to-python-script', + description: 'an example Flex job', + numberOfWorkers: 20, +}); +``` + +### Python Shell Jobs + +A Python shell job runs Python scripts as a shell and supports a Python version that depends on the AWS Glue version you are using. This can be used to schedule and run tasks that don't require an Apache Spark environment. Python 3.6 and 3.9 are supported. + +We’ll default to `PythonVersion.3_9` which customer can override. Python shell jobs doesn’t support different worker type, instead it has MaxDPU feature. Customer can choose Max DPU = `0.0625` or Max DPU = `1`. By default MaxDPU will be set `0.0625`. Also `PythonVersion.3_9` supports preloaded analytics libraries using flag `library-set=analytics` , this feature will be enable by default. + + +``` +new glue.Job(this, 'PythonShellJob', { + jobType: glue.JobType.PYTHON_SHELL, + jobLanguage: glue.JobLanguage.PYSPARK, + scriptS3Url: 's3://bucket-name/path-to-python-script', +}); +``` + +Optional overrides: + +``` +new glue.Job(this, 'PythonShellJob', { + jobType: glue.JobType.PYTHON_SHELL, + jobLanguage: glue.JobLanguage.PYSPARK, + glueVersion: glue.GlueVersion.V1_0, + pythonVersion: glue.PythonVersion.3_6, + scriptS3Url: 's3://bucket-name/path-to-python-script', + description: 'an example Python Shell job', + numberOfWorkers: 20, +}); +``` + + + +### Ray Jobs + +Glue ray only supports Z.2X worker type and 4.0 Glue version. Runtime will default to `Ray2.3` and min workers will default to 3. + + +``` +declare const bucket: s3.Bucket; +new glue.Job(this, 'GlueRayJob', { + jobType: glue.JobType.GLUE_RAY, + jobLanguage: glue.JobLanguage.PYTHON, + scriptS3Url: 's3://bucket-name/path-to-python-script', +}); +``` + +Optionally customer can override min workers and other Glue job fields + + +``` +declare const bucket: s3.Bucket; +new glue.Job(this, 'GlueRayJob', { + jobType: glue.JobType.GLUE_RAY, + jobLanguage: glue.JobLanguage.PYTHON, + runtime: glue.Runtime.RAY_2_2 + scriptS3Url: 's3://bucket-name/path-to-python-script', + minWorkers: 20, + numberOfWorkers: 50 +}); +``` + +### Job Triggers + +We will add convenience functions for adding triggers to jobs. Standalone triggers are an anti-pattern, so we will only create triggers from within a workflow. + + +1. **On Demand Triggers** + +On demand triggers can start glue jobs or crawlers. We’ll add convenience functions to create on-demand crawler or job triggers. The trigger method will take an optional description but abstract the requirement of an actions list using the job or crawler name. + +``` +myGlueJob.createOnDemandTrigger(this, 'MyJobTrigger', { + description: 'On demand run for ' + myGlueJob.name, +}); +``` + +``` +myGlueCrawler.createOnDemandTrigger(this, 'MyCrawlerTrigger'); +``` + + + +1. **Scheduled Triggers** + +Schedule triggers are a way for customers to create jobs using cron expressions. We’ll provide daily, weekly and hourly options which customer can override using custom cron expression. The trigger method will take an optional description but abstract the requirement of an actions list using the job or crawler name. + +``` +myGlueJob.createDailyTrigger(this, 'MyDailyTrigger'); + +myGlueJob.createHourlyTrigger(this, 'MyHourlyTrigger'); + +myGlueJob.createWeeklyTrigger(this, 'MyWeeklyTrigger'); + +myGlueJob.createScheduledTrigger(this, 'MyScheduledTrigger', { + description: 'Scheduled run for ' + crawler.name, + schedule: '`cron(15 12 * * ? *)'`` //``every day at 12:15 UTC` +}); +``` + + + +#### **3. Notify Event Trigger** + +This type of trigger is only supported with Glue workflow. There are two types of notify event triggers, batching and non-batching trigger. For batching trigger `BatchSize` customer has to specify but for non-batching `BatchSize` will be set to 1. For both trigger type `BatchWindow will be default to 900 seconds` + +``` +myGlueJob.createNotifyEventBatchingTrigger(this, 'MyNotifyTrigger', batchSize, + workFlowName: workflow.name, + batchSize: batchSize +); + +myGlueCrawler.createNotifyEventBatchingTrigger(this, 'MyNotifyTrigger', batchSize, + workFlowName: workflow.name, + batchSize: batchSize +); + +myGlueJob.createNotifyEventNonBatchingTrigger(this, 'MyNotifyTrigger', + workFlowName: workflow.name +); + +myGlueCrawler.createNotifyEventNonBatchingTrigger(this, 'MyNotifyTrigger', + workFlowName: workflow.name +); + +``` + +#### **4. Conditional Trigger** + +Conditional trigger has predicate and action associated with it. Based on predicate, trigger action will be executed. + +``` +// Triggers on Job and Crawler status +myGlueJob.addConditionalTrigger( + jobs: [ + {jobArn: "job1-arn", status: glue.JobStatus.SUCCEEDED}, + {jobArn: "job2-arn", status: glue.JobStatus.FAILED}, + ], crawlers: [ + {crawlerArn: "crawler1-arn", status: glue.CrawlerStatus.SUCCEEDED}, + {crawlerArn: "crawler2-arn", status: glue.CrawlerStatus.TIMEOUT}, +]); + +``` + + + +### Connection Properties + +A `Connection` allows Glue jobs, crawlers and development endpoints to access certain types of data stores. + + +* **Secrets Management + **User needs to specify JDBC connection credentials in Secrets Manager and provide the Secrets Manager Key name as a property to the Job connection property. + +* **Networking - CDK determines the best fit subnet for Glue Connection configuration + **The current glue-alpha-module requires the developer to specify the subnet of the Connection when it’s defined. This L2 RFC will make the best choice selection for subnet by default by using the data source provided during Job provisioning, traverse the source’s existing networking configuration, and determine the best subnet to provide to the Glue Job parameters to allow the Job to access the data source. The developer can override this subnet parameter, but no longer has to provide it directly. + + + + +## Public FAQ + +### What are we launching today? + +We’re launching new features to an AWS CDK Glue L2 Construct to provide best-practice defaults and convenience methods to create Glue Jobs, Connections, Triggers, Workflows, and the underlying permissions and configuration. + +### Why should I use this Construct? + +Developers should use this Construct to reduce the amount of boilerplate code and complexity each individual has to navigate, and make it easier to create best-practice Glue resources. + +### What’s not in scope? + +Glue Crawlers and other resources that are now managed by the AWS LakeFormation team are not in scope for this effort. Developers should use existing methods to create these resources, and the new Glue L2 construct assumes they already exist as inputs. While best practice is for application and infrastructure code to be as close as possible for teams using fully-implemented DevOps mechanisms, in practice these ETL scripts will likely be managed by a data science team who know Python or Scala and don’t necessarily own or manage their own infrastructure deployments. We want to meet developers where they are, and not assume that all of the code resides in the same repository, Developers can automate this themselves via the CDK, however, if they do own both. + +Uploading Job scripts to S3 buckets is also not in scope for this effort. + +Validating Glue version and feature use per AWS region at synth time is also not in scope. AWS’ intention is for all features to eventually be propagated to all Global regions, so the complexity involved in creating and updating region-specific configuration to match shifting feature sets does not out-weigh the likelihood that a developer will use this construct to deploy resources to a region without a particular new feature to a region that doesn’t yet support it without researching or manually attempting to use that feature before developing it via IaC. The developer will, of course, still get feedback from the underlying Glue APIs as CloudFormation deploys the resources similar to the current CDK L1 Glue experience. + + From c8d0f22a94fbb2c8cc5b39843f85a59a8f23782e Mon Sep 17 00:00:00 2001 From: Natalie White Date: Wed, 19 Apr 2023 15:32:36 -0700 Subject: [PATCH 02/15] Fix markdown linter findings --- text/0497-glue-l2-construct.md | 180 ++++++++++++++++++++++++--------- 1 file changed, 135 insertions(+), 45 deletions(-) diff --git a/text/0497-glue-l2-construct.md b/text/0497-glue-l2-construct.md index 8bfb88b81..ff94ca4b8 100644 --- a/text/0497-glue-l2-construct.md +++ b/text/0497-glue-l2-construct.md @@ -1,26 +1,51 @@ # RFC - Glue CDK L2 Construct + https://github.com/aws/aws-cdk-rfcs/issues/497 ## L2 Construct for AWS Glue Connections, Jobs, and Workflows -* Original Author(s): @natalie-white-aws, @mjanardhan @parag-shah-aws -* Tracking Issue: -* API Bar Raiser: [Kendra Neil](https://quip-amazon.com/AZX9EAmb6vG) +* **Original Author(s):** @natalie-white-aws, @mjanardhan @parag-shah-aws +* **Tracking Issue:** +* **API Bar Raiser:** @TheRealAmazonKendra ## Working Backwards - README -[AWS Glue](https://aws.amazon.com/glue/) is a serverless data integration service that makes it easier to discover, prepare, move, and integrate data from multiple sources for analytics, machine learning (ML), and application development. Glue was released on 2017/08. Launch: https://aws.amazon.com/blogs/aws/launch-aws-glue-now-generally-available/ +[AWS Glue](https://aws.amazon.com/glue/) is a serverless data integration +service that makes it easier to discover, prepare, move, and integrate data +from multiple sources for analytics, machine learning (ML), and application +development. Glue was released on 2017/08. +[Launch](https://aws.amazon.com/blogs/aws/launch-aws-glue-now-generally-available/) -Today, customers define Glue data sources, connections, jobs, and workflows to define their data and ETL solutions via the AWS console, the AWS CLI, and Infrastructure as Code tools like CloudFormation and the CDK. However, they have challenges defining the required and optional parameters depending on job type, networking constraints for data source connections, secrets for JDBC connections, and least-privilege IAM Roles and Policies. We will build convenience methods working backwards from common use cases and default to recommended best practices. +Today, customers define Glue data sources, connections, jobs, and workflows +to define their data and ETL solutions via the AWS console, the AWS CLI, and +Infrastructure as Code tools like CloudFormation and the CDK. However, they +have challenges defining the required and optional parameters depending on +job type, networking constraints for data source connections, secrets for +JDBC connections, and least-privilege IAM Roles and Policies. We will build +convenience methods working backwards from common use cases and default to +recommended best practices. -This RFC proposes updates to the L2 construct for Glue which will provide convenience features and abstractions for the existing [L1 (CloudFormation) Constructs](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/AWS_Glue.html) building on the functionality already supported in the [@aws-cdk/aws-glue-alpha module](https://github.com/aws/aws-cdk/blob/v2.51.1/packages/%40aws-cdk/aws-glue/README.md). +This RFC proposes updates to the L2 construct for Glue which will provide +convenience features and abstractions for the existing +[L1 (CloudFormation) Constructs](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/AWS_Glue.html) building on the +functionality already supported in the [@aws-cdk/aws-glue-alpha module](https://github.com/aws/aws-cdk/blob/v2.51.1/packages/%40aws-cdk/aws-glue/README.md). ## Create a Glue Job -The glue-alpha-module already supports three of the four common types of Glue Jobs: Spark (ETL and Streaming), Python Shell, Ray. This RFC will add the more recent Flex Job. The construct also implements AWS practice recommendations when creating a Glue Job such use of Secrets Management for Connection JDBC strings, Glue Job Autoscaling, least privileges in terms of IAM permissions and also sane defaults for Glue job specification (more details are mentioned in the below table). +The glue-alpha-module already supports three of the four common types of Glue +Jobs: Spark (ETL and Streaming), Python Shell, Ray. This RFC will add the +more recent Flex Job. The construct also implements AWS practice +recommendations when creating a Glue Job such use of Secrets Management for +Connection JDBC strings, Glue Job Autoscaling, least privileges in terms of +IAM permissions and also sane defaults for Glue job specification (more details +are mentioned in the below table). -This RFC will introduce breaking changes to the existing glue-alpha-module to streamline the developer experience and introduce new constants and validations. The L2 construct will determine the job type by the job type and language provided by the developer, rather than having separate methods in every permutation that Glue jobs allow. +This RFC will introduce breaking changes to the existing glue-alpha-module to +streamline the developer experience and introduce new constants and validations. +The L2 construct will determine the job type by the job type and language +provided by the developer, rather than having separate methods in every +permutation that Glue jobs allow. ### Spark Jobs @@ -28,7 +53,13 @@ This RFC will introduce breaking changes to the existing glue-alpha-module to st 1. **ETL Jobs ** -ETL jobs supports python and Scala language. ETL job type supports G1, G2, G4 and G8 worker type default as G2 which customer can override. Also preferred version for ETL is 4.0 but customer can override the version to 3.0. We by default enable several features for ETL jobs these are` —enable-metrics, —enable-spark-ui, —enable-continuous-cloudwatch-log.` We recommend to use these feature for ETL jobs. You can find more details about version, worker type and other feature on public documentation. +ETL jobs supports python and Scala language. ETL job type supports G1, G2, G4 +and G8 worker type default as G2 which customer can override. Also preferred +version for ETL is 4.0 but customer can override the version to 3.0. We by +default enable several features for ETL jobs these are +` —enable-metrics, —enable-spark-ui, —enable-continuous-cloudwatch-log.` We +recommend to use these feature for ETL jobs. You can find more details about +version, worker type and other feature on public documentation. ``` glue.Job(this, 'ScalaSparkEtlJob', { @@ -45,7 +76,8 @@ glue.Job(this, 'pySparkEtlJob', { }); ``` -Optionally, developers can override the glueVersion and add extra jars and a description: +Optionally, developers can override the glueVersion and add extra jars and a +description: ``` glue.Job(this, 'ScalaSparkEtlJob', { @@ -73,9 +105,16 @@ glue.Job(this, 'pySparkEtlJob', { 1. **Streaming Jobs** -A Streaming job is similar to an ETL job, except that it performs ETL on data streams. It uses the Apache Spark Structured Streaming framework. Some Spark job features are not available to streaming ETL jobs. These jobs will default to use Python 3.6. +A Streaming job is similar to an ETL job, except that it performs ETL on data +streams. It uses the Apache Spark Structured Streaming framework. Some Spark +job features are not available to streaming ETL jobs. These jobs will default +to use Python 3.6. -Similar to ETL streaming job supports Scala and python language. Similar to ETL, it supports G1 and G2 worker type and 2.0, 3.0 and 4.0 version. We’ll default to G2 worker and 4.0 version for streaming jobs which customer can override. Some of the feature we’ll enable are `—enable-metrics, —enable-spark-ui, —enable-continuous-cloudwatch-log` +Similar to ETL streaming job supports Scala and python language. Similar to ETL, +it supports G1 and G2 worker type and 2.0, 3.0 and 4.0 version. We’ll default +to G2 worker and 4.0 version for streaming jobs which customer can override. +Some of the feature we’ll enable are +`—enable-metrics, —enable-spark-ui, —enable-continuous-cloudwatch-log` ``` new glue.Job(this, 'PythonSparkStreamingJob', { @@ -84,7 +123,6 @@ new glue.Job(this, 'PythonSparkStreamingJob', { scriptS3Url: 's3://bucket-name/path-to-python-script', }); - new glue.Job(this, 'ScalaSparkStreamingJob', { jobType: glue.JobType.STREAMING, jobLanguage: glue.JobLanguage.SCALA_SPARK, @@ -94,7 +132,8 @@ new glue.Job(this, 'ScalaSparkStreamingJob', { ``` -Optionally, developers can override the glueVersion and add extraJars and a description: +Optionally, developers can override the glueVersion and add extraJars and a +description: ``` new glue.Job(this, 'PythonSparkStreamingJob', { @@ -107,7 +146,6 @@ new glue.Job(this, 'PythonSparkStreamingJob', { numberOfWorkers: 20, }); - new glue.Job(this, 'ScalaSparkStreamingJob', { jobType: glue.JobType.STREAMING, jobLanguage: glue.JobLanguage.SCALA_SPARK, @@ -122,7 +160,12 @@ new glue.Job(this, 'ScalaSparkStreamingJob', { 1. **Flex Jobs** -The flexible execution class is appropriate for non-urgent jobs such as pre-production jobs, testing, and one-time data loads. Flexible job runs are supported for jobs using AWS Glue version 3.0 or later and `G.1X` or `G.2X` worker types but will default to the latest version of Glue (currently Glue 3.0.) Also similar to ETL we’ll enable these feature `—enable-metrics, —enable-spark-ui, —enable-continuous-cloudwatch-log` +The flexible execution class is appropriate for non-urgent jobs such as +pre-production jobs, testing, and one-time data loads. Flexible job runs +are supported for jobs using AWS Glue version 3.0 or later and `G.1X` or +`G.2X` worker types but will default to the latest version of Glue +(currently Glue 3.0.) Also similar to ETL we’ll enable these feature +`—enable-metrics, —enable-spark-ui, —enable-continuous-cloudwatch-log` ``` glue.Job(this, 'ScalaSparkFlexEtlJob', { @@ -139,7 +182,8 @@ glue.Job(this, 'pySparkFlexEtlJob', { }); ``` -Optionally, developers can override the glue version, python version, provide extra jars, and a description +Optionally, developers can override the glue version, python version, +provide extra jars, and a description ``` glue.Job(this, 'pySparkFlexEtlJob', { @@ -165,9 +209,17 @@ new glue.Job(this, 'FlexJob', { ### Python Shell Jobs -A Python shell job runs Python scripts as a shell and supports a Python version that depends on the AWS Glue version you are using. This can be used to schedule and run tasks that don't require an Apache Spark environment. Python 3.6 and 3.9 are supported. +A Python shell job runs Python scripts as a shell and supports a Python +version that depends on the AWS Glue version you are using. This can be used +to schedule and run tasks that don't require an Apache Spark environment. +Python 3.6 and 3.9 are supported. -We’ll default to `PythonVersion.3_9` which customer can override. Python shell jobs doesn’t support different worker type, instead it has MaxDPU feature. Customer can choose Max DPU = `0.0625` or Max DPU = `1`. By default MaxDPU will be set `0.0625`. Also `PythonVersion.3_9` supports preloaded analytics libraries using flag `library-set=analytics` , this feature will be enable by default. +We’ll default to `PythonVersion.3_9` which customer can override. Python +shell jobs doesn’t support different worker type, instead it has MaxDPU +feature. Customer can choose Max DPU = `0.0625` or Max DPU = `1`. By default +MaxDPU will be set `0.0625`. Also `PythonVersion.3_9` supports preloaded +analytics libraries using flag `library-set=analytics` , this feature will +be enable by default. ``` @@ -196,7 +248,8 @@ new glue.Job(this, 'PythonShellJob', { ### Ray Jobs -Glue ray only supports Z.2X worker type and 4.0 Glue version. Runtime will default to `Ray2.3` and min workers will default to 3. +Glue ray only supports Z.2X worker type and 4.0 Glue version. Runtime +will default to `Ray2.3` and min workers will default to 3. ``` @@ -225,12 +278,17 @@ new glue.Job(this, 'GlueRayJob', { ### Job Triggers -We will add convenience functions for adding triggers to jobs. Standalone triggers are an anti-pattern, so we will only create triggers from within a workflow. +We will add convenience functions for adding triggers to jobs. Standalone +triggers are an anti-pattern, so we will only create triggers from within a +workflow. 1. **On Demand Triggers** -On demand triggers can start glue jobs or crawlers. We’ll add convenience functions to create on-demand crawler or job triggers. The trigger method will take an optional description but abstract the requirement of an actions list using the job or crawler name. +On demand triggers can start glue jobs or crawlers. We’ll add convenience +functions to create on-demand crawler or job triggers. The trigger method +will take an optional description but abstract the requirement of an actions +list using the job or crawler name. ``` myGlueJob.createOnDemandTrigger(this, 'MyJobTrigger', { @@ -246,7 +304,11 @@ myGlueCrawler.createOnDemandTrigger(this, 'MyCrawlerTrigger'); 1. **Scheduled Triggers** -Schedule triggers are a way for customers to create jobs using cron expressions. We’ll provide daily, weekly and hourly options which customer can override using custom cron expression. The trigger method will take an optional description but abstract the requirement of an actions list using the job or crawler name. +Schedule triggers are a way for customers to create jobs using cron +expressions. We’ll provide daily, weekly and hourly options which customer +can override using custom cron expression. The trigger method will take an +optional description but abstract the requirement of an actions list using +the job or crawler name. ``` myGlueJob.createDailyTrigger(this, 'MyDailyTrigger'); @@ -265,7 +327,10 @@ myGlueJob.createScheduledTrigger(this, 'MyScheduledTrigger', { #### **3. Notify Event Trigger** -This type of trigger is only supported with Glue workflow. There are two types of notify event triggers, batching and non-batching trigger. For batching trigger `BatchSize` customer has to specify but for non-batching `BatchSize` will be set to 1. For both trigger type `BatchWindow will be default to 900 seconds` +This type of trigger is only supported with Glue workflow. There are two types +of notify event triggers, batching and non-batching trigger. For batching trigger +`BatchSize` customer has to specify but for non-batching `BatchSize` will be set +to 1. For both trigger type `BatchWindow will be default to 900 seconds` ``` myGlueJob.createNotifyEventBatchingTrigger(this, 'MyNotifyTrigger', batchSize, @@ -278,7 +343,7 @@ myGlueCrawler.createNotifyEventBatchingTrigger(this, 'MyNotifyTrigger', batchSiz batchSize: batchSize ); -myGlueJob.createNotifyEventNonBatchingTrigger(this, 'MyNotifyTrigger', +myGlueJob.createNotifyEventNonBatchingTrigger(this, 'MyNotifyTrigger', workFlowName: workflow.name ); @@ -290,7 +355,8 @@ myGlueCrawler.createNotifyEventNonBatchingTrigger(this, 'MyNotifyTrigger', #### **4. Conditional Trigger** -Conditional trigger has predicate and action associated with it. Based on predicate, trigger action will be executed. +Conditional trigger has predicate and action associated with it. Based on +predicate, trigger action will be executed. ``` // Triggers on Job and Crawler status @@ -305,38 +371,62 @@ myGlueJob.addConditionalTrigger( ``` - - ### Connection Properties -A `Connection` allows Glue jobs, crawlers and development endpoints to access certain types of data stores. - +A `Connection` allows Glue jobs, crawlers and development endpoints to access +certain types of data stores. * **Secrets Management - **User needs to specify JDBC connection credentials in Secrets Manager and provide the Secrets Manager Key name as a property to the Job connection property. + **User needs to specify JDBC connection credentials in Secrets Manager and + provide the Secrets Manager Key name as a property to the Job connection + property. -* **Networking - CDK determines the best fit subnet for Glue Connection configuration - **The current glue-alpha-module requires the developer to specify the subnet of the Connection when it’s defined. This L2 RFC will make the best choice selection for subnet by default by using the data source provided during Job provisioning, traverse the source’s existing networking configuration, and determine the best subnet to provide to the Glue Job parameters to allow the Job to access the data source. The developer can override this subnet parameter, but no longer has to provide it directly. - - - +* **Networking - CDK determines the best fit subnet for Glue Connection +configuration + **The current glue-alpha-module requires the developer to specify the + subnet of the Connection when it’s defined. This L2 RFC will make the + best choice selection for subnet by default by using the data source + provided during Job provisioning, traverse the source’s existing networking + configuration, and determine the best subnet to provide to the Glue Job + parameters to allow the Job to access the data source. The developer can + override this subnet parameter, but no longer has to provide it directly. ## Public FAQ ### What are we launching today? -We’re launching new features to an AWS CDK Glue L2 Construct to provide best-practice defaults and convenience methods to create Glue Jobs, Connections, Triggers, Workflows, and the underlying permissions and configuration. +We’re launching new features to an AWS CDK Glue L2 Construct to provide +best-practice defaults and convenience methods to create Glue Jobs, Connections, +Triggers, Workflows, and the underlying permissions and configuration. ### Why should I use this Construct? -Developers should use this Construct to reduce the amount of boilerplate code and complexity each individual has to navigate, and make it easier to create best-practice Glue resources. +Developers should use this Construct to reduce the amount of boilerplate +code and complexity each individual has to navigate, and make it easier to +create best-practice Glue resources. ### What’s not in scope? -Glue Crawlers and other resources that are now managed by the AWS LakeFormation team are not in scope for this effort. Developers should use existing methods to create these resources, and the new Glue L2 construct assumes they already exist as inputs. While best practice is for application and infrastructure code to be as close as possible for teams using fully-implemented DevOps mechanisms, in practice these ETL scripts will likely be managed by a data science team who know Python or Scala and don’t necessarily own or manage their own infrastructure deployments. We want to meet developers where they are, and not assume that all of the code resides in the same repository, Developers can automate this themselves via the CDK, however, if they do own both. - -Uploading Job scripts to S3 buckets is also not in scope for this effort. - -Validating Glue version and feature use per AWS region at synth time is also not in scope. AWS’ intention is for all features to eventually be propagated to all Global regions, so the complexity involved in creating and updating region-specific configuration to match shifting feature sets does not out-weigh the likelihood that a developer will use this construct to deploy resources to a region without a particular new feature to a region that doesn’t yet support it without researching or manually attempting to use that feature before developing it via IaC. The developer will, of course, still get feedback from the underlying Glue APIs as CloudFormation deploys the resources similar to the current CDK L1 Glue experience. - - +Glue Crawlers and other resources that are now managed by the AWS LakeFormation +team are not in scope for this effort. Developers should use existing methods +to create these resources, and the new Glue L2 construct assumes they already +exist as inputs. While best practice is for application and infrastructure code +to be as close as possible for teams using fully-implemented DevOps mechanisms, +in practice these ETL scripts will likely be managed by a data science team who +know Python or Scala and don’t necessarily own or manage their own +infrastructure deployments. We want to meet developers where they are, and not +assume that all of the code resides in the same repository, Developers can +automate this themselves via the CDK, however, if they do own both. + +Uploading Job scripts to S3 buckets is also not in scope for this effort. + +Validating Glue version and feature use per AWS region at synth time is also +not in scope. AWS’ intention is for all features to eventually be propagated to +all Global regions, so the complexity involved in creating and updating region- +specific configuration to match shifting feature sets does not out-weigh the +likelihood that a developer will use this construct to deploy resources to a +region without a particular new feature to a region that doesn’t yet support +it without researching or manually attempting to use that feature before +developing it via IaC. The developer will, of course, still get feedback from +the underlying Glue APIs as CloudFormation deploys the resources similar to the +current CDK L1 Glue experience. From f0f57a6582ad666cbdf5bd4b7b5df11ab1df341e Mon Sep 17 00:00:00 2001 From: Natalie White Date: Wed, 19 Apr 2023 15:46:38 -0700 Subject: [PATCH 03/15] Fix the rest of the markdown linter findings --- text/0497-glue-l2-construct.md | 74 ++++++++++++++-------------------- 1 file changed, 30 insertions(+), 44 deletions(-) diff --git a/text/0497-glue-l2-construct.md b/text/0497-glue-l2-construct.md index ff94ca4b8..4982d008b 100644 --- a/text/0497-glue-l2-construct.md +++ b/text/0497-glue-l2-construct.md @@ -1,63 +1,59 @@ # RFC - Glue CDK L2 Construct -https://github.com/aws/aws-cdk-rfcs/issues/497 - ## L2 Construct for AWS Glue Connections, Jobs, and Workflows * **Original Author(s):** @natalie-white-aws, @mjanardhan @parag-shah-aws -* **Tracking Issue:** +* **Tracking Issue:** * **API Bar Raiser:** @TheRealAmazonKendra +[Link to RFC Issue](https://github.com/aws/aws-cdk-rfcs/issues/497) ## Working Backwards - README -[AWS Glue](https://aws.amazon.com/glue/) is a serverless data integration -service that makes it easier to discover, prepare, move, and integrate data -from multiple sources for analytics, machine learning (ML), and application -development. Glue was released on 2017/08. +[AWS Glue](https://aws.amazon.com/glue/) is a serverless data integration +service that makes it easier to discover, prepare, move, and integrate data +from multiple sources for analytics, machine learning (ML), and application +development. Glue was released on 2017/08. [Launch](https://aws.amazon.com/blogs/aws/launch-aws-glue-now-generally-available/) -Today, customers define Glue data sources, connections, jobs, and workflows -to define their data and ETL solutions via the AWS console, the AWS CLI, and -Infrastructure as Code tools like CloudFormation and the CDK. However, they -have challenges defining the required and optional parameters depending on -job type, networking constraints for data source connections, secrets for -JDBC connections, and least-privilege IAM Roles and Policies. We will build -convenience methods working backwards from common use cases and default to +Today, customers define Glue data sources, connections, jobs, and workflows +to define their data and ETL solutions via the AWS console, the AWS CLI, and +Infrastructure as Code tools like CloudFormation and the CDK. However, they +have challenges defining the required and optional parameters depending on +job type, networking constraints for data source connections, secrets for +JDBC connections, and least-privilege IAM Roles and Policies. We will build +convenience methods working backwards from common use cases and default to recommended best practices. -This RFC proposes updates to the L2 construct for Glue which will provide -convenience features and abstractions for the existing -[L1 (CloudFormation) Constructs](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/AWS_Glue.html) building on the -functionality already supported in the [@aws-cdk/aws-glue-alpha module](https://github.com/aws/aws-cdk/blob/v2.51.1/packages/%40aws-cdk/aws-glue/README.md). - +This RFC proposes updates to the L2 construct for Glue which will provide +convenience features and abstractions for the existing +[L1 (CloudFormation) Constructs](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/AWS_Glue.html) building on the +functionality already supported in the [@aws-cdk/aws-glue-alpha module](https://github.com/aws/aws-cdk/blob/v2.51.1/packages/%40aws-cdk/aws-glue/README.md). ## Create a Glue Job The glue-alpha-module already supports three of the four common types of Glue -Jobs: Spark (ETL and Streaming), Python Shell, Ray. This RFC will add the -more recent Flex Job. The construct also implements AWS practice -recommendations when creating a Glue Job such use of Secrets Management for -Connection JDBC strings, Glue Job Autoscaling, least privileges in terms of +Jobs: Spark (ETL and Streaming), Python Shell, Ray. This RFC will add the +more recent Flex Job. The construct also implements AWS practice +recommendations when creating a Glue Job such use of Secrets Management for +Connection JDBC strings, Glue Job Autoscaling, least privileges in terms of IAM permissions and also sane defaults for Glue job specification (more details are mentioned in the below table). -This RFC will introduce breaking changes to the existing glue-alpha-module to -streamline the developer experience and introduce new constants and validations. -The L2 construct will determine the job type by the job type and language +This RFC will introduce breaking changes to the existing glue-alpha-module to +streamline the developer experience and introduce new constants and validations. +The L2 construct will determine the job type by the job type and language provided by the developer, rather than having separate methods in every -permutation that Glue jobs allow. - +permutation that Glue jobs allow. ### Spark Jobs -1. **ETL Jobs - ** +1. **ETL Jobs** ETL jobs supports python and Scala language. ETL job type supports G1, G2, G4 and G8 worker type default as G2 which customer can override. Also preferred version for ETL is 4.0 but customer can override the version to 3.0. We by default enable several features for ETL jobs these are -` —enable-metrics, —enable-spark-ui, —enable-continuous-cloudwatch-log.` We +`—enable-metrics, —enable-spark-ui, —enable-continuous-cloudwatch-log.` We recommend to use these feature for ETL jobs. You can find more details about version, worker type and other feature on public documentation. @@ -221,7 +217,6 @@ MaxDPU will be set `0.0625`. Also `PythonVersion.3_9` supports preloaded analytics libraries using flag `library-set=analytics` , this feature will be enable by default. - ``` new glue.Job(this, 'PythonShellJob', { jobType: glue.JobType.PYTHON_SHELL, @@ -244,14 +239,11 @@ new glue.Job(this, 'PythonShellJob', { }); ``` - - ### Ray Jobs Glue ray only supports Z.2X worker type and 4.0 Glue version. Runtime will default to `Ray2.3` and min workers will default to 3. - ``` declare const bucket: s3.Bucket; new glue.Job(this, 'GlueRayJob', { @@ -263,7 +255,6 @@ new glue.Job(this, 'GlueRayJob', { Optionally customer can override min workers and other Glue job fields - ``` declare const bucket: s3.Bucket; new glue.Job(this, 'GlueRayJob', { @@ -282,13 +273,12 @@ We will add convenience functions for adding triggers to jobs. Standalone triggers are an anti-pattern, so we will only create triggers from within a workflow. - 1. **On Demand Triggers** On demand triggers can start glue jobs or crawlers. We’ll add convenience functions to create on-demand crawler or job triggers. The trigger method will take an optional description but abstract the requirement of an actions -list using the job or crawler name. +list using the job or crawler name. ``` myGlueJob.createOnDemandTrigger(this, 'MyJobTrigger', { @@ -300,15 +290,13 @@ myGlueJob.createOnDemandTrigger(this, 'MyJobTrigger', { myGlueCrawler.createOnDemandTrigger(this, 'MyCrawlerTrigger'); ``` - - 1. **Scheduled Triggers** Schedule triggers are a way for customers to create jobs using cron expressions. We’ll provide daily, weekly and hourly options which customer can override using custom cron expression. The trigger method will take an optional description but abstract the requirement of an actions list using -the job or crawler name. +the job or crawler name. \ ``` myGlueJob.createDailyTrigger(this, 'MyDailyTrigger'); @@ -323,8 +311,6 @@ myGlueJob.createScheduledTrigger(this, 'MyScheduledTrigger', { }); ``` - - #### **3. Notify Event Trigger** This type of trigger is only supported with Glue workflow. There are two types @@ -380,7 +366,7 @@ certain types of data stores. **User needs to specify JDBC connection credentials in Secrets Manager and provide the Secrets Manager Key name as a property to the Job connection property. - + * **Networking - CDK determines the best fit subnet for Glue Connection configuration **The current glue-alpha-module requires the developer to specify the From e12ab035d3238e81913b26399bc2583264a29c90 Mon Sep 17 00:00:00 2001 From: Natalie White Date: Tue, 23 May 2023 08:30:51 -0700 Subject: [PATCH 04/15] Updates after initial review - script files, job patterns, workflow and trigger definitions --- text/0497-glue-l2-construct.md | 300 ++++++++++++++++----------------- 1 file changed, 147 insertions(+), 153 deletions(-) diff --git a/text/0497-glue-l2-construct.md b/text/0497-glue-l2-construct.md index 4982d008b..d40cd8288 100644 --- a/text/0497-glue-l2-construct.md +++ b/text/0497-glue-l2-construct.md @@ -43,32 +43,30 @@ This RFC will introduce breaking changes to the existing glue-alpha-module to streamline the developer experience and introduce new constants and validations. The L2 construct will determine the job type by the job type and language provided by the developer, rather than having separate methods in every -permutation that Glue jobs allow. +permutation that Glue jobs allow. As an opinionated construct, it will enforce +best practices and not allow developers to create resources that use deprecated +libraries and tool sets (e.g. deprecated versions of Python). ### Spark Jobs 1. **ETL Jobs** ETL jobs supports python and Scala language. ETL job type supports G1, G2, G4 -and G8 worker type default as G2 which customer can override. Also preferred -version for ETL is 4.0 but customer can override the version to 3.0. We by -default enable several features for ETL jobs these are -`—enable-metrics, —enable-spark-ui, —enable-continuous-cloudwatch-log.` We -recommend to use these feature for ETL jobs. You can find more details about -version, worker type and other feature on public documentation. +and G8 worker type default as G2, which customer can override. It wil default to +the best practice version of ETL 4.0, but allow developers to override to 3.0. +We will also default to best practice enablement the following ETL features: +`—enable-metrics, —enable-spark-ui, —enable-continuous-cloudwatch-log.` +You can find more details about version, worker type and other features in Glue's +public documentation. ``` -glue.Job(this, 'ScalaSparkEtlJob', { - jobType: glue.JobType.ETL, - jobLanguage: glue.JobLanguage.SCALA_SPARK, - scriptS3Url: 's3://bucket-name/path-to-scala-jar', +glue.ScalaSparkEtlJob(this, 'ScalaSparkEtlJob', { + script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-scala-jar'), className: 'com.example.HelloWorld', }); -glue.Job(this, 'pySparkEtlJob', { - jobType: glue.JobType.ETL, - jobLanguage: glue.JobLanguage.PYSPARK, - scriptS3Url: 's3://bucket-name/path-to-python-script', +glue.pySparkEtlJob(this, 'pySparkEtlJob', { + script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-python-script'), }); ``` @@ -76,53 +74,46 @@ Optionally, developers can override the glueVersion and add extra jars and a description: ``` -glue.Job(this, 'ScalaSparkEtlJob', { - jobType: glue.JobType.ETL, - jobLanguage: glue.JobLanguage.SCALA_SPARK, +glue.ScalaSparkEtlJob(this, 'ScalaSparkEtlJob', { glueVersion: glue.GlueVersion.V3_0, - scriptS3Url: 's3://bucket-name/path-to-scala-jar', + script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-scala-jar'), className: 'com.example.HelloWorld', - extraJarsS3Url: ['s3://bucket-name/path-to-extra-scala-jar',], + extraJarsS3Url: [glue.Code.fromBucket('bucket-name', 'path-to-extra-scala-jar'),], description: 'an example Scala Spark ETL job', numberOfWorkers: 20 }); -glue.Job(this, 'pySparkEtlJob', { +glue.pySparkEtlJob(this, 'pySparkEtlJob', { jobType: glue.JobType.ETL, - jobLanguage: glue.JobLanguage.PYSPARK, glueVersion: glue.GlueVersion.V3_0, - pythonVersion: glue.PythonVersion.3_6, - scriptS3Url: 's3://bucket-name/path-to-scala-jar', - extraJarsS3Url: ['s3://bucket-name/path-to-extra-scala-jar'], + pythonVersion: glue.PythonVersion.3_9, + script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-python-script'), + extraJarsS3Url: [glue.Code.fromBucket('bucket-name', 'path-to-extra-scala-jar')], description: 'an example pySpark ETL job', numberOfWorkers: 20 }); ``` -1. **Streaming Jobs** +2. **Streaming Jobs** A Streaming job is similar to an ETL job, except that it performs ETL on data -streams. It uses the Apache Spark Structured Streaming framework. Some Spark +streams using the Apache Spark Structured Streaming framework. Some Spark job features are not available to streaming ETL jobs. These jobs will default -to use Python 3.6. +to use Python 3.9. -Similar to ETL streaming job supports Scala and python language. Similar to ETL, +Similar to ETL streaming job supports Scala and Python languages. Similar to ETL, it supports G1 and G2 worker type and 2.0, 3.0 and 4.0 version. We’ll default -to G2 worker and 4.0 version for streaming jobs which customer can override. -Some of the feature we’ll enable are -`—enable-metrics, —enable-spark-ui, —enable-continuous-cloudwatch-log` +to G2 worker and 4.0 version for streaming jobs which developers can override. +We will enable `—enable-metrics, —enable-spark-ui, —enable-continuous-cloudwatch-log`. ``` -new glue.Job(this, 'PythonSparkStreamingJob', { - jobType: glue.JobType.STREAMING, - jobLanguage: glue.JobLanguage.PYSPARK, - scriptS3Url: 's3://bucket-name/path-to-python-script', +new glue.PythonSparkStreamingJob(this, 'PythonSparkStreamingJob', { + script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-python-script'), }); -new glue.Job(this, 'ScalaSparkStreamingJob', { - jobType: glue.JobType.STREAMING, - jobLanguage: glue.JobLanguage.SCALA_SPARK, - scriptS3Url: 's3://bucket-name/path-to-scala-jar', + +new glue.ScalaSparkStreamingJob(this, 'ScalaSparkStreamingJob', { + script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-scala-jar'), className: 'com.example.HelloWorld', }); @@ -132,49 +123,42 @@ Optionally, developers can override the glueVersion and add extraJars and a description: ``` -new glue.Job(this, 'PythonSparkStreamingJob', { - jobType: glue.JobType.STREAMING, - jobLanguage: glue.JobLanguage.PYSPARK, +new glue.PythonSparkStreamingJob(this, 'PythonSparkStreamingJob', { glueVersion: glue.GlueVersion.V3_0, pythonVersion: glue.PythonVersion.3_6, - scriptS3Url: 's3://bucket-name/path-to-python-script', + script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-python-script'), description: 'an example Python Streaming job', numberOfWorkers: 20, }); -new glue.Job(this, 'ScalaSparkStreamingJob', { - jobType: glue.JobType.STREAMING, - jobLanguage: glue.JobLanguage.SCALA_SPARK, + +new glue.ScalaSparkStreamingJob(this, 'ScalaSparkStreamingJob', { glueVersion: glue.GlueVersion.V3_0, pythonVersion: glue.PythonVersion.3_6, - scriptS3Url: 's3://bucket-name/path-to-scala-jar', + script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-scala-jar'), className: 'com.example.HelloWorld', description: 'an example Python Streaming job', numberOfWorkers: 20, }); ``` -1. **Flex Jobs** +3. **Flex Jobs** The flexible execution class is appropriate for non-urgent jobs such as pre-production jobs, testing, and one-time data loads. Flexible job runs are supported for jobs using AWS Glue version 3.0 or later and `G.1X` or `G.2X` worker types but will default to the latest version of Glue -(currently Glue 3.0.) Also similar to ETL we’ll enable these feature +(currently Glue 3.0.) Similar to ETL, we’ll enable these features: `—enable-metrics, —enable-spark-ui, —enable-continuous-cloudwatch-log` ``` -glue.Job(this, 'ScalaSparkFlexEtlJob', { - jobType: glue.JobType.FLEX, - jobLanguage: glue.JobLanguage.SCALA_SPARK, - scriptS3Url: 's3://bucket-name/path-to-scala-jar', +glue.ScalaSparkFlexEtlJob(this, 'ScalaSparkFlexEtlJob', { + script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-scala-jar'), className: 'com.example.HelloWorld', }); -glue.Job(this, 'pySparkFlexEtlJob', { - jobType: glue.JobType.FLEX, - jobLanguage: glue.JobLanguage.PYSPARK, - scriptS3Url: 's3://bucket-name/path-to-python-script', +glue.pySparkFlexEtlJob(this, 'pySparkFlexEtlJob', { + script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-python-script'), }); ``` @@ -182,22 +166,19 @@ Optionally, developers can override the glue version, python version, provide extra jars, and a description ``` -glue.Job(this, 'pySparkFlexEtlJob', { - jobType: glue.JobType.FLEX, - jobLanguage: glue.JobLanguage.SCALA_SPARK, +glue.ScalaSparkFlexEtlJob(this, 'ScalaSparkFlexEtlJob', { glueVersion: glue.GlueVersion.V3_0, - scriptS3Url: 's3://bucket-name/path-to-python-script', - extraJarsS3Url: ['s3://bucket-name/path-to-extra-python-scripts'], + script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-scala-jar'), + className: 'com.example.HelloWorld', + extraJarsS3Url: [glue.Code.fromBucket('bucket-name', 'path-to-extra-python-scripts')], description: 'an example pySpark ETL job', numberOfWorkers: 20, }); -new glue.Job(this, 'FlexJob', { - jobType: glue.JobType.FLEX, - jobLanguage: glue.JobLanguage.PYSPARK, +new glue.pySparkFlexEtlJob(this, 'pySparkFlexEtlJob', { glueVersion: glue.GlueVersion.V3_0, pythonVersion: glue.PythonVersion.3_6, - scriptS3Url: 's3://bucket-name/path-to-python-script', + script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-python-script'), description: 'an example Flex job', numberOfWorkers: 20, }); @@ -208,32 +189,26 @@ new glue.Job(this, 'FlexJob', { A Python shell job runs Python scripts as a shell and supports a Python version that depends on the AWS Glue version you are using. This can be used to schedule and run tasks that don't require an Apache Spark environment. -Python 3.6 and 3.9 are supported. -We’ll default to `PythonVersion.3_9` which customer can override. Python -shell jobs doesn’t support different worker type, instead it has MaxDPU -feature. Customer can choose Max DPU = `0.0625` or Max DPU = `1`. By default -MaxDPU will be set `0.0625`. Also `PythonVersion.3_9` supports preloaded -analytics libraries using flag `library-set=analytics` , this feature will -be enable by default. +We’ll default to `PythonVersion.3_9`. Python shell jobs don't support different +worker typesbut do have they have a MaxDPU feature. Developers can choose +MaxDPU = `0.0625` or MaxDPU = `1`. By default, axDPU will be set `0.0625`. +Python 3.9 supports preloaded analytics libraries using the `library-set=analytics` +flag, and this feature will be enabled by default. ``` -new glue.Job(this, 'PythonShellJob', { - jobType: glue.JobType.PYTHON_SHELL, - jobLanguage: glue.JobLanguage.PYSPARK, - scriptS3Url: 's3://bucket-name/path-to-python-script', +new glue.PythonShellJob(this, 'PythonShellJob', { + script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-python-script'), }); ``` Optional overrides: ``` -new glue.Job(this, 'PythonShellJob', { - jobType: glue.JobType.PYTHON_SHELL, - jobLanguage: glue.JobLanguage.PYSPARK, +new glue.PythonShellJob(this, 'PythonShellJob', { glueVersion: glue.GlueVersion.V1_0, pythonVersion: glue.PythonVersion.3_6, - scriptS3Url: 's3://bucket-name/path-to-python-script', + script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-python-script'), description: 'an example Python Shell job', numberOfWorkers: 20, }); @@ -241,120 +216,141 @@ new glue.Job(this, 'PythonShellJob', { ### Ray Jobs -Glue ray only supports Z.2X worker type and 4.0 Glue version. Runtime +Glue ray only supports worker type Z.2X and Glue version 4.0. Runtime will default to `Ray2.3` and min workers will default to 3. ``` -declare const bucket: s3.Bucket; -new glue.Job(this, 'GlueRayJob', { - jobType: glue.JobType.GLUE_RAY, - jobLanguage: glue.JobLanguage.PYTHON, - scriptS3Url: 's3://bucket-name/path-to-python-script', +new glue.GlueRayJob(this, 'GlueRayJob', { + script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-python-script'), }); ``` -Optionally customer can override min workers and other Glue job fields +Developers can override min workers and other Glue job fields ``` -declare const bucket: s3.Bucket; -new glue.Job(this, 'GlueRayJob', { - jobType: glue.JobType.GLUE_RAY, - jobLanguage: glue.JobLanguage.PYTHON, - runtime: glue.Runtime.RAY_2_2 - scriptS3Url: 's3://bucket-name/path-to-python-script', +new glue.GlueRayJob(this, 'GlueRayJob', { + runtime: glue.Runtime.RAY_2_2, + script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-python-script'), minWorkers: 20, numberOfWorkers: 50 }); ``` -### Job Triggers +### Uploading scripts from the same repo to S3 + +Similar to other L2 constructs, the Glue L2 will automate uploading / updating +scripts to S3 via an optional fromAsset parameter pointing to a script +in the local file structure. Developers will provide an existing S3 bucket and +the path to which they'd like the script to be uploaded. + +``` +glue.ScalaSparkEtlJob(this, 'ScalaSparkEtlJob', { + script: glue.Code.fromAsset('bucket-name', 'local/path/to/scala-jar'), + className: 'com.example.HelloWorld', +}); +``` + +### Workflow Triggers + +In AWS Glue, developers can use workflows to create and visualize complex +extract, transform, and load (ETL) activities involving multiple crawlers, +jobs, and triggers. Standalone triggers are an anti-pattern, so we will +only create triggers from within a workflow. + +Within the workflow object, there will be functions to create different +types of triggers with actions and predicates. Those triggers can then be +added to jobs. -We will add convenience functions for adding triggers to jobs. Standalone -triggers are an anti-pattern, so we will only create triggers from within a -workflow. +For all trigger types, the StartOnCreation property will be set to true by +default, but developers will have the option to override it. 1. **On Demand Triggers** On demand triggers can start glue jobs or crawlers. We’ll add convenience functions to create on-demand crawler or job triggers. The trigger method will take an optional description but abstract the requirement of an actions -list using the job or crawler name. +list using the job or crawler objects using conditional types. ``` -myGlueJob.createOnDemandTrigger(this, 'MyJobTrigger', { - description: 'On demand run for ' + myGlueJob.name, +myWorkflow = new glue.Workflow(this, "GlueWorkflow", { + name: "MyWorkflow"; + description: "New Workflow"; + properties: {'key', 'value'}; }); -``` -``` -myGlueCrawler.createOnDemandTrigger(this, 'MyCrawlerTrigger'); +myWorkflow.createOnDemandTrigger(this, 'TriggerJobOnDemand', { + description: 'On demand run for ' + myGlueJob.name, + actions: [glueJob1, glueJob2, glueJob3, glueCrawler] +}); + ``` 1. **Scheduled Triggers** -Schedule triggers are a way for customers to create jobs using cron -expressions. We’ll provide daily, weekly and hourly options which customer -can override using custom cron expression. The trigger method will take an -optional description but abstract the requirement of an actions list using -the job or crawler name. \ +Schedule triggers are a way for developers to create jobs using cron +expressions. We’ll provide daily, weekly, and monthly convenience functions, +as well as a custom function that will allow developers to create their own +custom timing using the [existing event Schedule object] +(https://docs.aws.amazon.com/cdk/api/v2/docs/aws-cdk-lib.aws_events.Schedule.html) +without having to build their own cron expressions. (The L2 will extract +the expression that Glue requires from the Schedule object). The trigger method will +take an optional description and list of Actions which can refer to Jobs or +crawlers via conditional types. ``` -myGlueJob.createDailyTrigger(this, 'MyDailyTrigger'); - -myGlueJob.createHourlyTrigger(this, 'MyHourlyTrigger'); +// Create Daily Schedule at 00 UTC +myWorkflow.createDailyScheduleTrigger(this, 'TriggerCrawlerOnDailySchedule', { + description: 'Scheduled run for ' + myGlueJob.name, + actions: [ myGlueCrawler] +}); -myGlueJob.createWeeklyTrigger(this, 'MyWeeklyTrigger'); +// Create Weekly schedule at 00 UTC on Sunday +myWorkflow.createWeeklyScheduleTrigger(this, 'TriggerJobOnWeeklySchedule', { + description: 'Scheduled run for ' + myGlueJob.name, + actions: [glueJob1, glueJob2, glueJob3, glueCrawler] +}); -myGlueJob.createScheduledTrigger(this, 'MyScheduledTrigger', { - description: 'Scheduled run for ' + crawler.name, - schedule: '`cron(15 12 * * ? *)'`` //``every day at 12:15 UTC` +// Create Custom schedule, e.g. Monthly on the 7th day at 15:30 UTC +myWorkflow.createCustomScheduleJobTrigger((this, 'TriggerCrawlerOnCustomSchedule', { + description: 'Scheduled run for ' + myGlueJob.name, + actions: [glueJob1, glueJob2, glueJob3, glueCrawler] + schedule: events.Schedule.cron(day: '7', hour: '15', minute: '30') }); ``` -#### **3. Notify Event Trigger** +#### **3. Notify Event Triggers** -This type of trigger is only supported with Glue workflow. There are two types -of notify event triggers, batching and non-batching trigger. For batching trigger -`BatchSize` customer has to specify but for non-batching `BatchSize` will be set -to 1. For both trigger type `BatchWindow will be default to 900 seconds` +This type of trigger is only supported with Glue workflows. There are two types +of notify event triggers, batching and non-batching trigger. For batching triggers, +developers must specify `BatchSize` but for non-batching `BatchSize` will be set +to 1. For both triggers, `BatchWindow` will be default to 900 seconds. ``` -myGlueJob.createNotifyEventBatchingTrigger(this, 'MyNotifyTrigger', batchSize, - workFlowName: workflow.name, - batchSize: batchSize -); - -myGlueCrawler.createNotifyEventBatchingTrigger(this, 'MyNotifyTrigger', batchSize, - workFlowName: workflow.name, - batchSize: batchSize -); - -myGlueJob.createNotifyEventNonBatchingTrigger(this, 'MyNotifyTrigger', - workFlowName: workflow.name -); - -myGlueCrawler.createNotifyEventNonBatchingTrigger(this, 'MyNotifyTrigger', - workFlowName: workflow.name -); +myWorkflow.createNotifyEventTrigger(this, 'MyNotifyTriggerBatching', { + batchSize: batchSize, + jobActions: [glueJob1, glueJob2, glueJob3], + actions: [glueJob1, glueJob2, glueJob3, glueCrawler] +}); +myWorkflow.createNotifyEventTrigger(this, 'MyNotifyTriggerNonBatching', { + actions: [glueJob1, glueJob2, glueJob3] +}); ``` -#### **4. Conditional Trigger** +#### **4. Conditional Triggers** -Conditional trigger has predicate and action associated with it. Based on -predicate, trigger action will be executed. +Conditional triggers have a predicate and actions associated with them. +When the predicateCondition is true, the trigger actions will be executed. ``` // Triggers on Job and Crawler status -myGlueJob.addConditionalTrigger( - jobs: [ - {jobArn: "job1-arn", status: glue.JobStatus.SUCCEEDED}, - {jobArn: "job2-arn", status: glue.JobStatus.FAILED}, - ], crawlers: [ - {crawlerArn: "crawler1-arn", status: glue.CrawlerStatus.SUCCEEDED}, - {crawlerArn: "crawler2-arn", status: glue.CrawlerStatus.TIMEOUT}, -]); - +myWorkflow.createConditionalTrigger(this, 'conditionalTrigger', { + description: 'Conditional trigger for ' + myGlueJob.name, + actions: [glueJob1, glueJob2, glueJob3, glueCrawler] + predicateCondition: glue.TriggerPredicateCondition.AND, + jobPredicates: [{'job': glueJobPred, 'state': glue.JobRunState.FAILED}, + {'job': glueJobPred1, 'state' : glue.JobRunState.SUCCEEDED}] +}); ``` ### Connection Properties @@ -404,8 +400,6 @@ infrastructure deployments. We want to meet developers where they are, and not assume that all of the code resides in the same repository, Developers can automate this themselves via the CDK, however, if they do own both. -Uploading Job scripts to S3 buckets is also not in scope for this effort. - Validating Glue version and feature use per AWS region at synth time is also not in scope. AWS’ intention is for all features to eventually be propagated to all Global regions, so the complexity involved in creating and updating region- From 6e59e57f2f1b766c85182b65c34656ef9d3130c0 Mon Sep 17 00:00:00 2001 From: Natalie White Date: Tue, 23 May 2023 08:35:49 -0700 Subject: [PATCH 05/15] Forgot to run the linter again --- text/0497-glue-l2-construct.md | 52 ++++++++++++++++------------------ 1 file changed, 25 insertions(+), 27 deletions(-) diff --git a/text/0497-glue-l2-construct.md b/text/0497-glue-l2-construct.md index d40cd8288..9e04eb684 100644 --- a/text/0497-glue-l2-construct.md +++ b/text/0497-glue-l2-construct.md @@ -43,19 +43,19 @@ This RFC will introduce breaking changes to the existing glue-alpha-module to streamline the developer experience and introduce new constants and validations. The L2 construct will determine the job type by the job type and language provided by the developer, rather than having separate methods in every -permutation that Glue jobs allow. As an opinionated construct, it will enforce -best practices and not allow developers to create resources that use deprecated -libraries and tool sets (e.g. deprecated versions of Python). +permutation that Glue jobs allow. As an opinionated construct, it will enforce +best practices and not allow developers to create resources that use deprecated +libraries and tool sets (e.g. deprecated versions of Python). ### Spark Jobs 1. **ETL Jobs** ETL jobs supports python and Scala language. ETL job type supports G1, G2, G4 -and G8 worker type default as G2, which customer can override. It wil default to -the best practice version of ETL 4.0, but allow developers to override to 3.0. +and G8 worker type default as G2, which customer can override. It wil default to +the best practice version of ETL 4.0, but allow developers to override to 3.0. We will also default to best practice enablement the following ETL features: -`—enable-metrics, —enable-spark-ui, —enable-continuous-cloudwatch-log.` +`—enable-metrics, —enable-spark-ui, —enable-continuous-cloudwatch-log.` You can find more details about version, worker type and other features in Glue's public documentation. @@ -190,9 +190,9 @@ A Python shell job runs Python scripts as a shell and supports a Python version that depends on the AWS Glue version you are using. This can be used to schedule and run tasks that don't require an Apache Spark environment. -We’ll default to `PythonVersion.3_9`. Python shell jobs don't support different -worker typesbut do have they have a MaxDPU feature. Developers can choose -MaxDPU = `0.0625` or MaxDPU = `1`. By default, axDPU will be set `0.0625`. +We’ll default to `PythonVersion.3_9`. Python shell jobs don't support different +worker typesbut do have they have a MaxDPU feature. Developers can choose +MaxDPU = `0.0625` or MaxDPU = `1`. By default, axDPU will be set `0.0625`. Python 3.9 supports preloaded analytics libraries using the `library-set=analytics` flag, and this feature will be enabled by default. @@ -240,8 +240,8 @@ new glue.GlueRayJob(this, 'GlueRayJob', { Similar to other L2 constructs, the Glue L2 will automate uploading / updating scripts to S3 via an optional fromAsset parameter pointing to a script -in the local file structure. Developers will provide an existing S3 bucket and -the path to which they'd like the script to be uploaded. +in the local file structure. Developers will provide an existing S3 bucket and +the path to which they'd like the script to be uploaded. ``` glue.ScalaSparkEtlJob(this, 'ScalaSparkEtlJob', { @@ -252,16 +252,16 @@ glue.ScalaSparkEtlJob(this, 'ScalaSparkEtlJob', { ### Workflow Triggers -In AWS Glue, developers can use workflows to create and visualize complex -extract, transform, and load (ETL) activities involving multiple crawlers, -jobs, and triggers. Standalone triggers are an anti-pattern, so we will +In AWS Glue, developers can use workflows to create and visualize complex +extract, transform, and load (ETL) activities involving multiple crawlers, +jobs, and triggers. Standalone triggers are an anti-pattern, so we will only create triggers from within a workflow. -Within the workflow object, there will be functions to create different +Within the workflow object, there will be functions to create different types of triggers with actions and predicates. Those triggers can then be added to jobs. -For all trigger types, the StartOnCreation property will be set to true by +For all trigger types, the StartOnCreation property will be set to true by default, but developers will have the option to override it. 1. **On Demand Triggers** @@ -282,19 +282,17 @@ myWorkflow.createOnDemandTrigger(this, 'TriggerJobOnDemand', { description: 'On demand run for ' + myGlueJob.name, actions: [glueJob1, glueJob2, glueJob3, glueCrawler] }); - ``` 1. **Scheduled Triggers** -Schedule triggers are a way for developers to create jobs using cron +Schedule triggers are a way for developers to create jobs using cron expressions. We’ll provide daily, weekly, and monthly convenience functions, -as well as a custom function that will allow developers to create their own -custom timing using the [existing event Schedule object] -(https://docs.aws.amazon.com/cdk/api/v2/docs/aws-cdk-lib.aws_events.Schedule.html) -without having to build their own cron expressions. (The L2 will extract -the expression that Glue requires from the Schedule object). The trigger method will -take an optional description and list of Actions which can refer to Jobs or +as well as a custom function that will allow developers to create their own +custom timing using the [existing event Schedule object](https://docs.aws.amazon.com/cdk/api/v2/docs/aws-cdk-lib.aws_events.Schedule.html) +without having to build their own cron expressions. (The L2 will extract +the expression that Glue requires from the Schedule object). The trigger method will +take an optional description and list of Actions which can refer to Jobs or crawlers via conditional types. ``` @@ -339,7 +337,7 @@ myWorkflow.createNotifyEventTrigger(this, 'MyNotifyTriggerNonBatching', { #### **4. Conditional Triggers** -Conditional triggers have a predicate and actions associated with them. +Conditional triggers have a predicate and actions associated with them. When the predicateCondition is true, the trigger actions will be executed. ``` @@ -348,7 +346,7 @@ myWorkflow.createConditionalTrigger(this, 'conditionalTrigger', { description: 'Conditional trigger for ' + myGlueJob.name, actions: [glueJob1, glueJob2, glueJob3, glueCrawler] predicateCondition: glue.TriggerPredicateCondition.AND, - jobPredicates: [{'job': glueJobPred, 'state': glue.JobRunState.FAILED}, + jobPredicates: [{'job': glueJobPred, 'state': glue.JobRunState.FAILED}, {'job': glueJobPred1, 'state' : glue.JobRunState.SUCCEEDED}] }); ``` @@ -358,7 +356,7 @@ myWorkflow.createConditionalTrigger(this, 'conditionalTrigger', { A `Connection` allows Glue jobs, crawlers and development endpoints to access certain types of data stores. -* **Secrets Management +***Secrets Management **User needs to specify JDBC connection credentials in Secrets Manager and provide the Secrets Manager Key name as a property to the Job connection property. From 19cb1029f76c324fe0f1ed8175ad452b5399023b Mon Sep 17 00:00:00 2001 From: Janardhan Molumuri Date: Tue, 6 Jun 2023 08:33:31 -0700 Subject: [PATCH 06/15] Do not use Deprecated Python versions --- text/0497-glue-l2-construct.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/text/0497-glue-l2-construct.md b/text/0497-glue-l2-construct.md index 9e04eb684..a061f468a 100644 --- a/text/0497-glue-l2-construct.md +++ b/text/0497-glue-l2-construct.md @@ -125,7 +125,7 @@ description: ``` new glue.PythonSparkStreamingJob(this, 'PythonSparkStreamingJob', { glueVersion: glue.GlueVersion.V3_0, - pythonVersion: glue.PythonVersion.3_6, + pythonVersion: glue.PythonVersion.3_9, script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-python-script'), description: 'an example Python Streaming job', numberOfWorkers: 20, @@ -134,7 +134,7 @@ new glue.PythonSparkStreamingJob(this, 'PythonSparkStreamingJob', { new glue.ScalaSparkStreamingJob(this, 'ScalaSparkStreamingJob', { glueVersion: glue.GlueVersion.V3_0, - pythonVersion: glue.PythonVersion.3_6, + pythonVersion: glue.PythonVersion.3_9, script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-scala-jar'), className: 'com.example.HelloWorld', description: 'an example Python Streaming job', @@ -177,7 +177,7 @@ glue.ScalaSparkFlexEtlJob(this, 'ScalaSparkFlexEtlJob', { new glue.pySparkFlexEtlJob(this, 'pySparkFlexEtlJob', { glueVersion: glue.GlueVersion.V3_0, - pythonVersion: glue.PythonVersion.3_6, + pythonVersion: glue.PythonVersion.3_9, script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-python-script'), description: 'an example Flex job', numberOfWorkers: 20, @@ -207,7 +207,7 @@ Optional overrides: ``` new glue.PythonShellJob(this, 'PythonShellJob', { glueVersion: glue.GlueVersion.V1_0, - pythonVersion: glue.PythonVersion.3_6, + pythonVersion: glue.PythonVersion.3_9, script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-python-script'), description: 'an example Python Shell job', numberOfWorkers: 20, From e7c7c3c775f7366b1f2ac9e69c938687890b4a9f Mon Sep 17 00:00:00 2001 From: Natalie White Date: Fri, 9 Jun 2023 11:46:39 -0700 Subject: [PATCH 07/15] Added parameter table, incorporating other review changes for Glue L2 --- text/0497-glue-l2-construct.md | 62 +++++++++++++++++++++++----------- 1 file changed, 43 insertions(+), 19 deletions(-) diff --git a/text/0497-glue-l2-construct.md b/text/0497-glue-l2-construct.md index a061f468a..91fd5ec94 100644 --- a/text/0497-glue-l2-construct.md +++ b/text/0497-glue-l2-construct.md @@ -41,9 +41,7 @@ are mentioned in the below table). This RFC will introduce breaking changes to the existing glue-alpha-module to streamline the developer experience and introduce new constants and validations. -The L2 construct will determine the job type by the job type and language -provided by the developer, rather than having separate methods in every -permutation that Glue jobs allow. As an opinionated construct, it will enforce +As an opinionated construct, the Glue L2 construct will enforce best practices and not allow developers to create resources that use deprecated libraries and tool sets (e.g. deprecated versions of Python). @@ -56,8 +54,8 @@ and G8 worker type default as G2, which customer can override. It wil default to the best practice version of ETL 4.0, but allow developers to override to 3.0. We will also default to best practice enablement the following ETL features: `—enable-metrics, —enable-spark-ui, —enable-continuous-cloudwatch-log.` -You can find more details about version, worker type and other features in Glue's -public documentation. +You can find more details about version, worker type and other features in +[Glue's public documentation](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-jobs-job.html). ``` glue.ScalaSparkEtlJob(this, 'ScalaSparkEtlJob', { @@ -190,11 +188,11 @@ A Python shell job runs Python scripts as a shell and supports a Python version that depends on the AWS Glue version you are using. This can be used to schedule and run tasks that don't require an Apache Spark environment. -We’ll default to `PythonVersion.3_9`. Python shell jobs don't support different -worker typesbut do have they have a MaxDPU feature. Developers can choose -MaxDPU = `0.0625` or MaxDPU = `1`. By default, axDPU will be set `0.0625`. -Python 3.9 supports preloaded analytics libraries using the `library-set=analytics` -flag, and this feature will be enabled by default. +We’ll default to `PythonVersion.3_9`. Python shell jobs have a MaxCapacity feature. +Developers can choose MaxCapacity = `0.0625` or MaxCapacity = `1`. By default, +MaxCapacity will be set `0.0625`. Python 3.9 supports preloaded analytics +libraries using the `library-set=analytics` flag, and this feature will +be enabled by default. ``` new glue.PythonShellJob(this, 'PythonShellJob', { @@ -231,11 +229,38 @@ Developers can override min workers and other Glue job fields new glue.GlueRayJob(this, 'GlueRayJob', { runtime: glue.Runtime.RAY_2_2, script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-python-script'), - minWorkers: 20, numberOfWorkers: 50 }); ``` +### Required, Optional, and Overridable Parameters + +Each of these parameters are documented in [Glue's public documentation](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api.html); +this table lists the parameters that will or will not be required, optional, +and/or overridable for the Glue L2 construct. + +|Parameter|Required|Default Value|Overridable| +|Command ScriptLocation|Yes: S3 location|CommandName|yes S3 location| +|NumberOfWorkers|No|10|Yes for ETL, STREAMING, RAY jobs| +|Name|Yes|Default: Auto Generated name|Yes| +|Role|Yes|None|Yes| +|Description|No|None|Yes| +|MaxConcurrentRuns|No|1|Yes| +|DefaultArguments|No|None|Yes| +|Connections|No|None|Yes| +|MaxRetries|No|0|Yes| +|Timeout|No|2 Days for non-streaming jobs|Yes| +|SecurityConfiguration|No|None|Yes| +|Tags|No|None|Yes| +|GlueVersion|No|Default: 3.0 for ETL, RAY: 4.0|Yes| +|Command name|No|ETL, PythonShell, Streaming and GlueRay|No| +|Command Runtime|No|GlueRay: Ray2.4|No| +|NonOverridableArguments|No|None|No| +|ExecutionClass|No|STANDARD / FLEX|No| +|Command PythonVersion|No|3 for ETL/Streaming, 3.9 for PythonShell| +|LogUri|No|None|This option is not widely used| +|MaxCapacity (Python Shell Jobs only)|No|0.0625|Yes| + ### Uploading scripts from the same repo to S3 Similar to other L2 constructs, the Glue L2 will automate uploading / updating @@ -309,7 +334,7 @@ myWorkflow.createWeeklyScheduleTrigger(this, 'TriggerJobOnWeeklySchedule', { }); // Create Custom schedule, e.g. Monthly on the 7th day at 15:30 UTC -myWorkflow.createCustomScheduleJobTrigger((this, 'TriggerCrawlerOnCustomSchedule', { +myWorkflow.createCustomScheduleJobTrigger(this, 'TriggerCrawlerOnCustomSchedule', { description: 'Scheduled run for ' + myGlueJob.name, actions: [glueJob1, glueJob2, glueJob3, glueCrawler] schedule: events.Schedule.cron(day: '7', hour: '15', minute: '30') @@ -325,7 +350,7 @@ to 1. For both triggers, `BatchWindow` will be default to 900 seconds. ``` myWorkflow.createNotifyEventTrigger(this, 'MyNotifyTriggerBatching', { - batchSize: batchSize, + batchSize: int, jobActions: [glueJob1, glueJob2, glueJob3], actions: [glueJob1, glueJob2, glueJob3, glueCrawler] }); @@ -364,12 +389,11 @@ certain types of data stores. * **Networking - CDK determines the best fit subnet for Glue Connection configuration **The current glue-alpha-module requires the developer to specify the - subnet of the Connection when it’s defined. This L2 RFC will make the - best choice selection for subnet by default by using the data source - provided during Job provisioning, traverse the source’s existing networking - configuration, and determine the best subnet to provide to the Glue Job - parameters to allow the Job to access the data source. The developer can - override this subnet parameter, but no longer has to provide it directly. + subnet of the Connection when it’s defined. The developer can still specify the + specific subnet they want to use, but no longer have to. This Glue L2 RFC will + allow developers to provide only a VPC and either a public or private subnet + selection. The L2 will then leverage the existing [EC2 Subnet Selection](https://docs.aws.amazon.com/cdk/api/v2/python/aws_cdk.aws_ec2/SubnetSelection.html) + library to make the best choice selection for the subnet. ## Public FAQ From bb0bce45ebcad4461f12e6e1de56c1672b1d683a Mon Sep 17 00:00:00 2001 From: Natalie White Date: Fri, 9 Jun 2023 12:39:39 -0700 Subject: [PATCH 08/15] Fix table renderig for Glue L2 --- text/0497-glue-l2-construct.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/text/0497-glue-l2-construct.md b/text/0497-glue-l2-construct.md index 91fd5ec94..c5591d75e 100644 --- a/text/0497-glue-l2-construct.md +++ b/text/0497-glue-l2-construct.md @@ -239,8 +239,10 @@ Each of these parameters are documented in [Glue's public documentation](https:/ this table lists the parameters that will or will not be required, optional, and/or overridable for the Glue L2 construct. +/md |Parameter|Required|Default Value|Overridable| -|Command ScriptLocation|Yes: S3 location|CommandName|yes S3 location| +|---|---|---|---| +|ScriptLocation|Yes: S3 location|CommandName|yes S3 location| |NumberOfWorkers|No|10|Yes for ETL, STREAMING, RAY jobs| |Name|Yes|Default: Auto Generated name|Yes| |Role|Yes|None|Yes| @@ -250,8 +252,7 @@ and/or overridable for the Glue L2 construct. |Connections|No|None|Yes| |MaxRetries|No|0|Yes| |Timeout|No|2 Days for non-streaming jobs|Yes| -|SecurityConfiguration|No|None|Yes| -|Tags|No|None|Yes| +|SecurityConfiguration|No|None|Yes| |Tags|No|None|Yes| |GlueVersion|No|Default: 3.0 for ETL, RAY: 4.0|Yes| |Command name|No|ETL, PythonShell, Streaming and GlueRay|No| |Command Runtime|No|GlueRay: Ray2.4|No| From 73d427b96f2c0a646d73cd5c5e3782bb97071057 Mon Sep 17 00:00:00 2001 From: Janardhan Molumuri Date: Fri, 9 Jun 2023 13:57:43 -0700 Subject: [PATCH 09/15] remove extra chars, fix table alignment --- text/0497-glue-l2-construct.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/text/0497-glue-l2-construct.md b/text/0497-glue-l2-construct.md index c5591d75e..a7893af48 100644 --- a/text/0497-glue-l2-construct.md +++ b/text/0497-glue-l2-construct.md @@ -239,9 +239,8 @@ Each of these parameters are documented in [Glue's public documentation](https:/ this table lists the parameters that will or will not be required, optional, and/or overridable for the Glue L2 construct. -/md |Parameter|Required|Default Value|Overridable| -|---|---|---|---| +|:---|:---|:---|:---| |ScriptLocation|Yes: S3 location|CommandName|yes S3 location| |NumberOfWorkers|No|10|Yes for ETL, STREAMING, RAY jobs| |Name|Yes|Default: Auto Generated name|Yes| From 60544aeea51ce8728f1e20a6c3da701478ae0d5d Mon Sep 17 00:00:00 2001 From: Natalie White Date: Fri, 23 Jun 2023 13:55:30 -0700 Subject: [PATCH 10/15] Removed action verbs, added role to each example, included ts code syntax highlighting, updated first section title. --- text/0497-glue-l2-construct.md | 79 +++++++++++++++++++++------------- 1 file changed, 48 insertions(+), 31 deletions(-) diff --git a/text/0497-glue-l2-construct.md b/text/0497-glue-l2-construct.md index a7893af48..e0b35116e 100644 --- a/text/0497-glue-l2-construct.md +++ b/text/0497-glue-l2-construct.md @@ -7,7 +7,7 @@ * **API Bar Raiser:** @TheRealAmazonKendra [Link to RFC Issue](https://github.com/aws/aws-cdk-rfcs/issues/497) -## Working Backwards - README +## Overview [AWS Glue](https://aws.amazon.com/glue/) is a serverless data integration service that makes it easier to discover, prepare, move, and integrate data @@ -33,11 +33,12 @@ functionality already supported in the [@aws-cdk/aws-glue-alpha module](https:// The glue-alpha-module already supports three of the four common types of Glue Jobs: Spark (ETL and Streaming), Python Shell, Ray. This RFC will add the -more recent Flex Job. The construct also implements AWS practice -recommendations when creating a Glue Job such use of Secrets Management for -Connection JDBC strings, Glue Job Autoscaling, least privileges in terms of -IAM permissions and also sane defaults for Glue job specification (more details -are mentioned in the below table). +more recent Flex Job. The construct also implements AWS best practice +recommendations, such as: + +* use of Secrets Management for Connection JDBC strings +* Glue Job Autoscaling +* defaults for Glue job specification This RFC will introduce breaking changes to the existing glue-alpha-module to streamline the developer experience and introduce new constants and validations. @@ -57,28 +58,31 @@ We will also default to best practice enablement the following ETL features: You can find more details about version, worker type and other features in [Glue's public documentation](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-jobs-job.html). -``` +```ts glue.ScalaSparkEtlJob(this, 'ScalaSparkEtlJob', { script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-scala-jar'), className: 'com.example.HelloWorld', + role: iam.IRole, }); glue.pySparkEtlJob(this, 'pySparkEtlJob', { script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-python-script'), + role: iam.IRole, }); ``` Optionally, developers can override the glueVersion and add extra jars and a description: -``` +```ts glue.ScalaSparkEtlJob(this, 'ScalaSparkEtlJob', { glueVersion: glue.GlueVersion.V3_0, script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-scala-jar'), className: 'com.example.HelloWorld', extraJarsS3Url: [glue.Code.fromBucket('bucket-name', 'path-to-extra-scala-jar'),], description: 'an example Scala Spark ETL job', - numberOfWorkers: 20 + numberOfWorkers: 20, + role: iam.IRole, }); glue.pySparkEtlJob(this, 'pySparkEtlJob', { @@ -88,7 +92,8 @@ glue.pySparkEtlJob(this, 'pySparkEtlJob', { script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-python-script'), extraJarsS3Url: [glue.Code.fromBucket('bucket-name', 'path-to-extra-scala-jar')], description: 'an example pySpark ETL job', - numberOfWorkers: 20 + numberOfWorkers: 20, + role: iam.IRole, }); ``` @@ -104,15 +109,17 @@ it supports G1 and G2 worker type and 2.0, 3.0 and 4.0 version. We’ll default to G2 worker and 4.0 version for streaming jobs which developers can override. We will enable `—enable-metrics, —enable-spark-ui, —enable-continuous-cloudwatch-log`. -``` +```ts new glue.PythonSparkStreamingJob(this, 'PythonSparkStreamingJob', { script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-python-script'), + role: iam.IRole, }); new glue.ScalaSparkStreamingJob(this, 'ScalaSparkStreamingJob', { script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-scala-jar'), className: 'com.example.HelloWorld', + role: iam.IRole, }); ``` @@ -120,13 +127,14 @@ new glue.ScalaSparkStreamingJob(this, 'ScalaSparkStreamingJob', { Optionally, developers can override the glueVersion and add extraJars and a description: -``` +```ts new glue.PythonSparkStreamingJob(this, 'PythonSparkStreamingJob', { glueVersion: glue.GlueVersion.V3_0, pythonVersion: glue.PythonVersion.3_9, script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-python-script'), description: 'an example Python Streaming job', numberOfWorkers: 20, + role: iam.IRole, }); @@ -137,6 +145,7 @@ new glue.ScalaSparkStreamingJob(this, 'ScalaSparkStreamingJob', { className: 'com.example.HelloWorld', description: 'an example Python Streaming job', numberOfWorkers: 20, + role: iam.IRole, }); ``` @@ -149,21 +158,23 @@ are supported for jobs using AWS Glue version 3.0 or later and `G.1X` or (currently Glue 3.0.) Similar to ETL, we’ll enable these features: `—enable-metrics, —enable-spark-ui, —enable-continuous-cloudwatch-log` -``` +```ts glue.ScalaSparkFlexEtlJob(this, 'ScalaSparkFlexEtlJob', { script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-scala-jar'), className: 'com.example.HelloWorld', + role: iam.IRole, }); glue.pySparkFlexEtlJob(this, 'pySparkFlexEtlJob', { script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-python-script'), + role: iam.IRole, }); ``` Optionally, developers can override the glue version, python version, provide extra jars, and a description -``` +```ts glue.ScalaSparkFlexEtlJob(this, 'ScalaSparkFlexEtlJob', { glueVersion: glue.GlueVersion.V3_0, script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-scala-jar'), @@ -171,6 +182,7 @@ glue.ScalaSparkFlexEtlJob(this, 'ScalaSparkFlexEtlJob', { extraJarsS3Url: [glue.Code.fromBucket('bucket-name', 'path-to-extra-python-scripts')], description: 'an example pySpark ETL job', numberOfWorkers: 20, + role: iam.IRole, }); new glue.pySparkFlexEtlJob(this, 'pySparkFlexEtlJob', { @@ -179,6 +191,7 @@ new glue.pySparkFlexEtlJob(this, 'pySparkFlexEtlJob', { script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-python-script'), description: 'an example Flex job', numberOfWorkers: 20, + role: iam.IRole, }); ``` @@ -194,21 +207,23 @@ MaxCapacity will be set `0.0625`. Python 3.9 supports preloaded analytics libraries using the `library-set=analytics` flag, and this feature will be enabled by default. -``` +```ts new glue.PythonShellJob(this, 'PythonShellJob', { script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-python-script'), + role: iam.IRole, }); ``` Optional overrides: -``` +```ts new glue.PythonShellJob(this, 'PythonShellJob', { glueVersion: glue.GlueVersion.V1_0, pythonVersion: glue.PythonVersion.3_9, script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-python-script'), description: 'an example Python Shell job', numberOfWorkers: 20, + role: iam.IRole, }); ``` @@ -217,19 +232,21 @@ new glue.PythonShellJob(this, 'PythonShellJob', { Glue ray only supports worker type Z.2X and Glue version 4.0. Runtime will default to `Ray2.3` and min workers will default to 3. -``` +```ts new glue.GlueRayJob(this, 'GlueRayJob', { script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-python-script'), + role: iam.IRole, }); ``` Developers can override min workers and other Glue job fields -``` +```ts new glue.GlueRayJob(this, 'GlueRayJob', { runtime: glue.Runtime.RAY_2_2, script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-python-script'), - numberOfWorkers: 50 + numberOfWorkers: 50, + role: iam.IRole, }); ``` @@ -268,7 +285,7 @@ scripts to S3 via an optional fromAsset parameter pointing to a script in the local file structure. Developers will provide an existing S3 bucket and the path to which they'd like the script to be uploaded. -``` +```ts glue.ScalaSparkEtlJob(this, 'ScalaSparkEtlJob', { script: glue.Code.fromAsset('bucket-name', 'local/path/to/scala-jar'), className: 'com.example.HelloWorld', @@ -296,14 +313,14 @@ functions to create on-demand crawler or job triggers. The trigger method will take an optional description but abstract the requirement of an actions list using the job or crawler objects using conditional types. -``` +```ts myWorkflow = new glue.Workflow(this, "GlueWorkflow", { name: "MyWorkflow"; description: "New Workflow"; properties: {'key', 'value'}; }); -myWorkflow.createOnDemandTrigger(this, 'TriggerJobOnDemand', { +myWorkflow.onDemandTrigger(this, 'TriggerJobOnDemand', { description: 'On demand run for ' + myGlueJob.name, actions: [glueJob1, glueJob2, glueJob3, glueCrawler] }); @@ -320,21 +337,21 @@ the expression that Glue requires from the Schedule object). The trigger method take an optional description and list of Actions which can refer to Jobs or crawlers via conditional types. -``` +```ts // Create Daily Schedule at 00 UTC -myWorkflow.createDailyScheduleTrigger(this, 'TriggerCrawlerOnDailySchedule', { +myWorkflow.dailyScheduleTrigger(this, 'TriggerCrawlerOnDailySchedule', { description: 'Scheduled run for ' + myGlueJob.name, actions: [ myGlueCrawler] }); // Create Weekly schedule at 00 UTC on Sunday -myWorkflow.createWeeklyScheduleTrigger(this, 'TriggerJobOnWeeklySchedule', { +myWorkflow.weeklyScheduleTrigger(this, 'TriggerJobOnWeeklySchedule', { description: 'Scheduled run for ' + myGlueJob.name, actions: [glueJob1, glueJob2, glueJob3, glueCrawler] }); // Create Custom schedule, e.g. Monthly on the 7th day at 15:30 UTC -myWorkflow.createCustomScheduleJobTrigger(this, 'TriggerCrawlerOnCustomSchedule', { +myWorkflow.customScheduleJobTrigger(this, 'TriggerCrawlerOnCustomSchedule', { description: 'Scheduled run for ' + myGlueJob.name, actions: [glueJob1, glueJob2, glueJob3, glueCrawler] schedule: events.Schedule.cron(day: '7', hour: '15', minute: '30') @@ -348,14 +365,14 @@ of notify event triggers, batching and non-batching trigger. For batching trigge developers must specify `BatchSize` but for non-batching `BatchSize` will be set to 1. For both triggers, `BatchWindow` will be default to 900 seconds. -``` -myWorkflow.createNotifyEventTrigger(this, 'MyNotifyTriggerBatching', { +```ts +myWorkflow.notifyEventTrigger(this, 'MyNotifyTriggerBatching', { batchSize: int, jobActions: [glueJob1, glueJob2, glueJob3], actions: [glueJob1, glueJob2, glueJob3, glueCrawler] }); -myWorkflow.createNotifyEventTrigger(this, 'MyNotifyTriggerNonBatching', { +myWorkflow.notifyEventTrigger(this, 'MyNotifyTriggerNonBatching', { actions: [glueJob1, glueJob2, glueJob3] }); ``` @@ -365,9 +382,9 @@ myWorkflow.createNotifyEventTrigger(this, 'MyNotifyTriggerNonBatching', { Conditional triggers have a predicate and actions associated with them. When the predicateCondition is true, the trigger actions will be executed. -``` +```ts // Triggers on Job and Crawler status -myWorkflow.createConditionalTrigger(this, 'conditionalTrigger', { +myWorkflow.conditionalTrigger(this, 'conditionalTrigger', { description: 'Conditional trigger for ' + myGlueJob.name, actions: [glueJob1, glueJob2, glueJob3, glueCrawler] predicateCondition: glue.TriggerPredicateCondition.AND, From 054b8798e3027f6f63be889e0eea335ec89b9431 Mon Sep 17 00:00:00 2001 From: Natalie White Date: Mon, 3 Jul 2023 15:06:42 -0700 Subject: [PATCH 11/15] Add comprehensive parameter interfaces to each job type per RFC review --- text/0497-glue-l2-construct.md | 846 +++++++++++++++++++++++++++++++-- 1 file changed, 801 insertions(+), 45 deletions(-) diff --git a/text/0497-glue-l2-construct.md b/text/0497-glue-l2-construct.md index e0b35116e..483b5d11c 100644 --- a/text/0497-glue-l2-construct.md +++ b/text/0497-glue-l2-construct.md @@ -46,6 +46,10 @@ As an opinionated construct, the Glue L2 construct will enforce best practices and not allow developers to create resources that use deprecated libraries and tool sets (e.g. deprecated versions of Python). +Optional and required parameters for each job will be enforced via interface +rather than validation; see [Glue's public documentation](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api.html) +for more granular details. + ### Spark Jobs 1. **ETL Jobs** @@ -90,13 +94,211 @@ glue.pySparkEtlJob(this, 'pySparkEtlJob', { glueVersion: glue.GlueVersion.V3_0, pythonVersion: glue.PythonVersion.3_9, script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-python-script'), - extraJarsS3Url: [glue.Code.fromBucket('bucket-name', 'path-to-extra-scala-jar')], description: 'an example pySpark ETL job', numberOfWorkers: 20, role: iam.IRole, }); ``` +Scala Spark ETL Job Property Interface: + +```ts +ScalaSparkEtlJobProps{ + /** + * Script Code Location (required) + * Script to run when the Glue Job executes. Can be uploaded + * from the local directory structure using fromAsset + * or referenced via S3 location using fromBucket + * */ + script: glue.Code; + + /** + * Class name (required for Scala) + * Package and class name for the entry point of Glue Job execution for + * Java scripts + * */ + className: string; + + /** + * Extra Jars S3 URL (optional) + * S3 URL where additional jar dependencies are located + */ + extraJarsS3Url?: string[]; + + /** + * IAM Role (required) + * IAM Role to use for Glue Job execution + * */ + role: iam.IRole; + + /** + * Name of the Glue Job (optional) + * Developer-specified name of the Glue Job + * */ + name?: string; + + /** + * Description (optional) + * Developer-specified description of the Glue Job + * */ + description?: string; + + /** + * Number of Workers (optional) + * Number of workers for Glue to use during Job execution + * @default 10 + * */ + numberOrWorkers?: int; + + /** + * Max Concurrent Runs (optional) + * The maximum number of runs this Glue Job cna concurrently run + * @default 1 + * */ + maxConcurrentRuns?: int; + + /** + * Default Arguments (optional) + * The default arguments for every run of this Glue Job, + * specified as name-value pairs. + * */ + defaultArguments?: {[key: string], string }[]; + + /** + * Connections (optional) + * List of connections to use for this Glue Job + * */ + connections?: IConnection[]; + + /** + * Max Retries (optional) + * Maximum number of retry attempts Glue will perform + * if the Job fails + * @default 0 + * */ + maxRetries?: int; + + /** + * Timeout (optional) + * Timeout for the Glue Job, specified in minutes + * @default 2880 (2 days for non-streaming) + * */ + timeout?: int; + + /** + * Security Configuration (optional) + * Defines the encryption options for the Glue Job + * */ + securityConfiguration?: ISecurityConfiguration; + + /** + * Tags (optional) + * A list of key:value pairs of tags to apply to this Glue Job resource + * */ + tags?: {[key: string], string }[]; + + /** + * Glue Version + * The version of Glue to use to execute this Job + * @default 3.0 for ETL + * */ + glueVersion?: glue.GlueVersion; +} +``` + +pySpark ETL Job Property Interface: + +```ts +pySparkEtlJobProps{ + /** + * Script Code Location (required) + * Script to run when the Glue Job executes. Can be uploaded + * from the local directory structure using fromAsset + * or referenced via S3 location using fromBucket + * */ + script: glue.Code; + + /** + * IAM Role (required) + * IAM Role to use for Glue Job execution + * */ + role: iam.IRole; + + /** + * Name of the Glue Job (optional) + * Developer-specified name of the Glue Job + * */ + name?: string; + + /** + * Description (optional) + * Developer-specified description of the Glue Job + * */ + description?: string; + + /** + * Number of Workers (optional) + * Number of workers for Glue to use during Job execution + * @default 10 + * */ + numberOrWorkers?: int; + + /** + * Max Concurrent Runs (optional) + * The maximum number of runs this Glue Job cna concurrently run + * @default 1 + * */ + maxConcurrentRuns?: int; + + /** + * Default Arguments (optional) + * The default arguments for every run of this Glue Job, + * specified as name-value pairs. + * */ + defaultArguments?: {[key: string], string }[]; + + /** + * Connections (optional) + * List of connections to use for this Glue Job + * */ + connections?: IConnection[]; + + /** + * Max Retries (optional) + * Maximum number of retry attempts Glue will perform + * if the Job fails + * @default 0 + * */ + maxRetries?: int; + + /** + * Timeout (optional) + * Timeout for the Glue Job, specified in minutes + * @default 2880 (2 days for non-streaming) + * */ + timeout?: int; + + /** + * Security Configuration (optional) + * Defines the encryption options for the Glue Job + * */ + securityConfiguration?: ISecurityConfiguration; + + /** + * Tags (optional) + * A list of key:value pairs of tags to apply to this Glue Job resource + * */ + tags?: {[key: string], string }[]; + + /** + * Glue Version + * The version of Glue to use to execute this Job + * @default 3.0 for ETL + * */ + glueVersion?: glue.GlueVersion; +} +``` + 2. **Streaming Jobs** A Streaming job is similar to an ETL job, except that it performs ETL on data @@ -110,7 +312,7 @@ to G2 worker and 4.0 version for streaming jobs which developers can override. We will enable `—enable-metrics, —enable-spark-ui, —enable-continuous-cloudwatch-log`. ```ts -new glue.PythonSparkStreamingJob(this, 'PythonSparkStreamingJob', { +new glue.pySparkStreamingJob(this, 'pySparkStreamingJob', { script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-python-script'), role: iam.IRole, }); @@ -128,7 +330,7 @@ Optionally, developers can override the glueVersion and add extraJars and a description: ```ts -new glue.PythonSparkStreamingJob(this, 'PythonSparkStreamingJob', { +new glue.pySparkStreamingJob(this, 'pySparkStreamingJob', { glueVersion: glue.GlueVersion.V3_0, pythonVersion: glue.PythonVersion.3_9, script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-python-script'), @@ -137,11 +339,11 @@ new glue.PythonSparkStreamingJob(this, 'PythonSparkStreamingJob', { role: iam.IRole, }); - new glue.ScalaSparkStreamingJob(this, 'ScalaSparkStreamingJob', { glueVersion: glue.GlueVersion.V3_0, pythonVersion: glue.PythonVersion.3_9, script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-scala-jar'), + extraJarsS3Url: [glue.Code.fromBucket('bucket-name', 'path-to-extra-scala-jar'),], className: 'com.example.HelloWorld', description: 'an example Python Streaming job', numberOfWorkers: 20, @@ -149,6 +351,203 @@ new glue.ScalaSparkStreamingJob(this, 'ScalaSparkStreamingJob', { }); ``` +Scala Spark Streaming Job Property Interface: + +```ts +ScalaSparkStreamingJobProps{ + /** + * Script Code Location (required) + * Script to run when the Glue Job executes. Can be uploaded + * from the local directory structure using fromAsset + * or referenced via S3 location using fromBucket + * */ + script: glue.Code; + + /** + * Class name (required for Scala scripts) + * Package and class name for the entry point of Glue Job execution for + * Java scripts + * */ + className: string; + + /** + * IAM Role (required) + * IAM Role to use for Glue Job execution + * */ + role: iam.IRole; + + /** + * Name of the Glue Job (optional) + * Developer-specified name of the Glue Job + * */ + name?: string; + + /** + * Extra Jars S3 URL (optional) + * S3 URL where additional jar dependencies are located + */ + extraJarsS3Url?: string[]; + + /** + * Description (optional) + * Developer-specified description of the Glue Job + * */ + description?: string; + + /** + * Number of Workers (optional) + * Number of workers for Glue to use during Job execution + * @default 10 + * */ + numberOrWorkers?: int; + + /** + * Max Concurrent Runs (optional) + * The maximum number of runs this Glue Job cna concurrently run + * @default 1 + * */ + maxConcurrentRuns?: int; + + /** + * Default Arguments (optional) + * The default arguments for every run of this Glue Job, + * specified as name-value pairs. + * */ + defaultArguments?: {[key: string], string }[]; + + /** + * Connections (optional) + * List of connections to use for this Glue Job + * */ + connections?: IConnection[]; + + /** + * Max Retries (optional) + * Maximum number of retry attempts Glue will perform + * if the Job fails + * @default 0 + * */ + maxRetries?: int; + + /** + * Timeout (optional) + * Timeout for the Glue Job, specified in minutes + * */ + timeout?: int; + + /** + * Security Configuration (optional) + * Defines the encryption options for the Glue Job + * */ + securityConfiguration?: ISecurityConfiguration; + + /** + * Tags (optional) + * A list of key:value pairs of tags to apply to this Glue Job resource + * */ + tags?: {[key: string], string }[]; + + /** + * Glue Version + * The version of Glue to use to execute this Job + * @default 3.0 + * */ + glueVersion?: glue.GlueVersion; +} +``` + +pySpark Streaming Job Property Interface: + +```ts +pySparkStreamingJobProps{ + /** + * Script Code Location (required) + * Script to run when the Glue Job executes. Can be uploaded + * from the local directory structure using fromAsset + * or referenced via S3 location using fromBucket + * */ + script: glue.Code; + + /** + * IAM Role (required) + * IAM Role to use for Glue Job execution + * */ + role: iam.IRole; + + /** + * Name of the Glue Job (optional) + * Developer-specified name of the Glue Job + * */ + name?: string; + + /** + * Description (optional) + * Developer-specified description of the Glue Job + * */ + description?: string; + + /** + * Number of Workers (optional) + * Number of workers for Glue to use during Job execution + * @default 10 + * */ + numberOrWorkers?: int; + + /** + * Max Concurrent Runs (optional) + * The maximum number of runs this Glue Job cna concurrently run + * @default 1 + * */ + maxConcurrentRuns?: int; + + /** + * Default Arguments (optional) + * The default arguments for every run of this Glue Job, + * specified as name-value pairs. + * */ + defaultArguments?: {[key: string], string }[]; + + /** + * Connections (optional) + * List of connections to use for this Glue Job + * */ + connections?: IConnection[]; + + /** + * Max Retries (optional) + * Maximum number of retry attempts Glue will perform + * if the Job fails + * @default 0 + * */ + maxRetries?: int; + + /** + * Timeout (optional) + * Timeout for the Glue Job, specified in minutes + * */ + timeout?: int; + + /** + * Security Configuration (optional) + * Defines the encryption options for the Glue Job + * */ + securityConfiguration?: ISecurityConfiguration; + + /** + * Tags (optional) + * A list of key:value pairs of tags to apply to this Glue Job resource + * */ + tags?: {[key: string], string }[]; + + /** + * Glue Version + * The version of Glue to use to execute this Job + * @default 3.0 + * */ + glueVersion?: glue.GlueVersion; +} +``` + 3. **Flex Jobs** The flexible execution class is appropriate for non-urgent jobs such as @@ -195,6 +594,205 @@ new glue.pySparkFlexEtlJob(this, 'pySparkFlexEtlJob', { }); ``` +Scala Spark Flex Job Property Interface: + +```ts +ScalaSparkFlexJobProps{ + /** + * Script Code Location (required) + * Script to run when the Glue Job executes. Can be uploaded + * from the local directory structure using fromAsset + * or referenced via S3 location using fromBucket + * */ + script: glue.Code; + + /** + * Class name (required for Scala scripts) + * Package and class name for the entry point of Glue Job execution for + * Java scripts + * */ + className: string; + + /** + * Extra Jars S3 URL (optional) + * S3 URL where additional jar dependencies are located + */ + extraJarsS3Url?: string[]; + + /** + * IAM Role (required) + * IAM Role to use for Glue Job execution + * */ + role: iam.IRole; + + /** + * Name of the Glue Job (optional) + * Developer-specified name of the Glue Job + * */ + name?: string; + + /** + * Description (optional) + * Developer-specified description of the Glue Job + * */ + description?: string; + + /** + * Number of Workers (optional) + * Number of workers for Glue to use during Job execution + * @default 10 + * */ + numberOrWorkers?: int; + + /** + * Max Concurrent Runs (optional) + * The maximum number of runs this Glue Job cna concurrently run + * @default 1 + * */ + maxConcurrentRuns?: int; + + /** + * Default Arguments (optional) + * The default arguments for every run of this Glue Job, + * specified as name-value pairs. + * */ + defaultArguments?: {[key: string], string }[]; + + /** + * Connections (optional) + * List of connections to use for this Glue Job + * */ + connections?: IConnection[]; + + /** + * Max Retries (optional) + * Maximum number of retry attempts Glue will perform + * if the Job fails + * @default 0 + * */ + maxRetries?: int; + + /** + * Timeout (optional) + * Timeout for the Glue Job, specified in minutes + * @default 2880 (2 days for non-streaming) + * */ + timeout?: int; + + /** + * Security Configuration (optional) + * Defines the encryption options for the Glue Job + * */ + securityConfiguration?: ISecurityConfiguration; + + /** + * Tags (optional) + * A list of key:value pairs of tags to apply to this Glue Job resource + * */ + tags?: {[key: string], string }[]; + + /** + * Glue Version + * The version of Glue to use to execute this Job + * @default 3.0 + * */ + glueVersion?: glue.GlueVersion; +} +``` + +pySpark Flex Job Property Interface: + +```ts +PySparkFlexJobProps{ + /** + * Script Code Location (required) + * Script to run when the Glue Job executes. Can be uploaded + * from the local directory structure using fromAsset + * or referenced via S3 location using fromBucket + * */ + script: glue.Code; + + /** + * IAM Role (required) + * IAM Role to use for Glue Job execution + * */ + role: iam.IRole; + + /** + * Name of the Glue Job (optional) + * Developer-specified name of the Glue Job + * */ + name?: string; + + /** + * Description (optional) + * Developer-specified description of the Glue Job + * */ + description?: string; + + /** + * Number of Workers (optional) + * Number of workers for Glue to use during Job execution + * @default 10 + * */ + numberOrWorkers?: int; + + /** + * Max Concurrent Runs (optional) + * The maximum number of runs this Glue Job cna concurrently run + * @default 1 + * */ + maxConcurrentRuns?: int; + + /** + * Default Arguments (optional) + * The default arguments for every run of this Glue Job, + * specified as name-value pairs. + * */ + defaultArguments?: {[key: string], string }[]; + + /** + * Connections (optional) + * List of connections to use for this Glue Job + * */ + connections?: IConnection[]; + + /** + * Max Retries (optional) + * Maximum number of retry attempts Glue will perform + * if the Job fails + * @default 0 + * */ + maxRetries?: int; + + /** + * Timeout (optional) + * Timeout for the Glue Job, specified in minutes + * @default 2880 (2 days for non-streaming) + * */ + timeout?: int; + + /** + * Security Configuration (optional) + * Defines the encryption options for the Glue Job + * */ + securityConfiguration?: ISecurityConfiguration; + + /** + * Tags (optional) + * A list of key:value pairs of tags to apply to this Glue Job resource + * */ + tags?: {[key: string], string }[]; + + /** + * Glue Version + * The version of Glue to use to execute this Job + * @default 3.0 + * */ + glueVersion?: glue.GlueVersion; +} +``` + ### Python Shell Jobs A Python shell job runs Python scripts as a shell and supports a Python @@ -227,6 +825,99 @@ new glue.PythonShellJob(this, 'PythonShellJob', { }); ``` +Python Shell Job Property Interface: + +```ts +PythonShellJobProps{ + /** + * Script Code Location (required) + * Script to run when the Glue Job executes. Can be uploaded + * from the local directory structure using fromAsset + * or referenced via S3 location using fromBucket + * */ + script: glue.Code; + + /** + * IAM Role (required) + * IAM Role to use for Glue Job execution + * */ + role: iam.IRole; + + /** + * Name of the Glue Job (optional) + * Developer-specified name of the Glue Job + * */ + name?: string; + + /** + * Description (optional) + * Developer-specified description of the Glue Job + * */ + description?: string; + + /** + * Number of Workers (optional) + * Number of workers for Glue to use during Job execution + * @default 10 + * */ + numberOrWorkers?: int; + + /** + * Max Concurrent Runs (optional) + * The maximum number of runs this Glue Job cna concurrently run + * @default 1 + * */ + maxConcurrentRuns?: int; + + /** + * Default Arguments (optional) + * The default arguments for every run of this Glue Job, + * specified as name-value pairs. + * */ + defaultArguments?: {[key: string], string }[]; + + /** + * Connections (optional) + * List of connections to use for this Glue Job + * */ + connections?: IConnection[]; + + /** + * Max Retries (optional) + * Maximum number of retry attempts Glue will perform + * if the Job fails + * @default 0 + * */ + maxRetries?: int; + + /** + * Timeout (optional) + * Timeout for the Glue Job, specified in minutes + * @default 2880 (2 days for non-streaming) + * */ + timeout?: int; + + /** + * Security Configuration (optional) + * Defines the encryption options for the Glue Job + * */ + securityConfiguration?: ISecurityConfiguration; + + /** + * Tags (optional) + * A list of key:value pairs of tags to apply to this Glue Job resource + * */ + tags?: {[key: string], string }[]; + + /** + * Glue Version + * The version of Glue to use to execute this Job + * @default 3.0 for ETL + * */ + glueVersion?: glue.GlueVersion; +} +``` + ### Ray Jobs Glue ray only supports worker type Z.2X and Glue version 4.0. Runtime @@ -250,33 +941,98 @@ new glue.GlueRayJob(this, 'GlueRayJob', { }); ``` -### Required, Optional, and Overridable Parameters - -Each of these parameters are documented in [Glue's public documentation](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api.html); -this table lists the parameters that will or will not be required, optional, -and/or overridable for the Glue L2 construct. - -|Parameter|Required|Default Value|Overridable| -|:---|:---|:---|:---| -|ScriptLocation|Yes: S3 location|CommandName|yes S3 location| -|NumberOfWorkers|No|10|Yes for ETL, STREAMING, RAY jobs| -|Name|Yes|Default: Auto Generated name|Yes| -|Role|Yes|None|Yes| -|Description|No|None|Yes| -|MaxConcurrentRuns|No|1|Yes| -|DefaultArguments|No|None|Yes| -|Connections|No|None|Yes| -|MaxRetries|No|0|Yes| -|Timeout|No|2 Days for non-streaming jobs|Yes| -|SecurityConfiguration|No|None|Yes| |Tags|No|None|Yes| -|GlueVersion|No|Default: 3.0 for ETL, RAY: 4.0|Yes| -|Command name|No|ETL, PythonShell, Streaming and GlueRay|No| -|Command Runtime|No|GlueRay: Ray2.4|No| -|NonOverridableArguments|No|None|No| -|ExecutionClass|No|STANDARD / FLEX|No| -|Command PythonVersion|No|3 for ETL/Streaming, 3.9 for PythonShell| -|LogUri|No|None|This option is not widely used| -|MaxCapacity (Python Shell Jobs only)|No|0.0625|Yes| +Ray Job Property Interface: + +```ts +RayJobProps{ + /** + * Script Code Location (required) + * Script to run when the Glue Job executes. Can be uploaded + * from the local directory structure using fromAsset + * or referenced via S3 location using fromBucket + * */ + script: glue.Code; + + /** + * IAM Role (required) + * IAM Role to use for Glue Job execution + * */ + role: iam.IRole; + + /** + * Name of the Glue Job (optional) + * Developer-specified name of the Glue Job + * */ + name?: string; + + /** + * Description (optional) + * Developer-specified description of the Glue Job + * */ + description?: string; + + /** + * Number of Workers (optional) + * Number of workers for Glue to use during Job execution + * @default 10 + * */ + numberOrWorkers?: int; + + /** + * Max Concurrent Runs (optional) + * The maximum number of runs this Glue Job cna concurrently run + * @default 1 + * */ + maxConcurrentRuns?: int; + + /** + * Default Arguments (optional) + * The default arguments for every run of this Glue Job, + * specified as name-value pairs. + * */ + defaultArguments?: {[key: string], string }[]; + + /** + * Connections (optional) + * List of connections to use for this Glue Job + * */ + connections?: IConnection[]; + + /** + * Max Retries (optional) + * Maximum number of retry attempts Glue will perform + * if the Job fails + * @default 0 + * */ + maxRetries?: int; + + /** + * Timeout (optional) + * Timeout for the Glue Job, specified in minutes + * @default 2880 (2 days for non-streaming) + * */ + timeout?: int; + + /** + * Security Configuration (optional) + * Defines the encryption options for the Glue Job + * */ + securityConfiguration?: ISecurityConfiguration; + + /** + * Tags (optional) + * A list of key:value pairs of tags to apply to this Glue Job resource + * */ + tags?: {[key: string], string }[]; + + /** + * Glue Version + * The version of Glue to use to execute this Job + * @default 4.0 + * */ + glueVersion?: glue.GlueVersion; +} +``` ### Uploading scripts from the same repo to S3 @@ -321,8 +1077,8 @@ myWorkflow = new glue.Workflow(this, "GlueWorkflow", { }); myWorkflow.onDemandTrigger(this, 'TriggerJobOnDemand', { - description: 'On demand run for ' + myGlueJob.name, - actions: [glueJob1, glueJob2, glueJob3, glueCrawler] + description: 'On demand run for ' + glue.JobExecutable.name, + actions: [jobOrCrawler: glue.JobExecutable | cdk.CfnCrawler?, ...] }); ``` @@ -340,20 +1096,20 @@ crawlers via conditional types. ```ts // Create Daily Schedule at 00 UTC myWorkflow.dailyScheduleTrigger(this, 'TriggerCrawlerOnDailySchedule', { - description: 'Scheduled run for ' + myGlueJob.name, - actions: [ myGlueCrawler] + description: 'Scheduled run for ' + glue.JobExecutable.name, + actions: [ jobOrCrawler: glue.JobExecutable | cdk.CfnCrawler?, ...] }); // Create Weekly schedule at 00 UTC on Sunday myWorkflow.weeklyScheduleTrigger(this, 'TriggerJobOnWeeklySchedule', { - description: 'Scheduled run for ' + myGlueJob.name, - actions: [glueJob1, glueJob2, glueJob3, glueCrawler] + description: 'Scheduled run for ' + glue.JobExecutable.name, + actions: [jobOrCrawler: glue.JobExecutable | cdk.CfnCrawler?, ...] }); // Create Custom schedule, e.g. Monthly on the 7th day at 15:30 UTC myWorkflow.customScheduleJobTrigger(this, 'TriggerCrawlerOnCustomSchedule', { - description: 'Scheduled run for ' + myGlueJob.name, - actions: [glueJob1, glueJob2, glueJob3, glueCrawler] + description: 'Scheduled run for ' + glue.JobExecutable.name, + actions: [jobOrCrawler: glue.JobExecutable | cdk.CfnCrawler?, ...] schedule: events.Schedule.cron(day: '7', hour: '15', minute: '30') }); ``` @@ -368,12 +1124,12 @@ to 1. For both triggers, `BatchWindow` will be default to 900 seconds. ```ts myWorkflow.notifyEventTrigger(this, 'MyNotifyTriggerBatching', { batchSize: int, - jobActions: [glueJob1, glueJob2, glueJob3], - actions: [glueJob1, glueJob2, glueJob3, glueCrawler] + jobActions: [jobOrCrawler: glue.JobExecutable | cdk.CfnCrawler?, ...], + actions: [jobOrCrawler: glue.JobExecutable | cdk.CfnCrawler?, ... ] }); myWorkflow.notifyEventTrigger(this, 'MyNotifyTriggerNonBatching', { - actions: [glueJob1, glueJob2, glueJob3] + actions: [jobOrCrawler: glue.JobExecutable | cdk.CfnCrawler?, ...] }); ``` @@ -386,10 +1142,10 @@ When the predicateCondition is true, the trigger actions will be executed. // Triggers on Job and Crawler status myWorkflow.conditionalTrigger(this, 'conditionalTrigger', { description: 'Conditional trigger for ' + myGlueJob.name, - actions: [glueJob1, glueJob2, glueJob3, glueCrawler] + actions: [jobOrCrawler: glue.JobExecutable | cdk.CfnCrawler?, ...] predicateCondition: glue.TriggerPredicateCondition.AND, - jobPredicates: [{'job': glueJobPred, 'state': glue.JobRunState.FAILED}, - {'job': glueJobPred1, 'state' : glue.JobRunState.SUCCEEDED}] + jobPredicates: [{'job': JobExecutable, 'state': glue.JobState.FAILED}, + {'job': JobExecutable, 'state' : glue.JobState.SUCCEEDED}] }); ``` From fa6c36ab90a307aadbbb3404d3d57cdd8cd69e57 Mon Sep 17 00:00:00 2001 From: Natalie White Date: Mon, 3 Jul 2023 15:17:08 -0700 Subject: [PATCH 12/15] Add mandatory workflow language for Notify Event Triggers --- text/0497-glue-l2-construct.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/text/0497-glue-l2-construct.md b/text/0497-glue-l2-construct.md index 483b5d11c..1707054c9 100644 --- a/text/0497-glue-l2-construct.md +++ b/text/0497-glue-l2-construct.md @@ -1116,7 +1116,7 @@ myWorkflow.customScheduleJobTrigger(this, 'TriggerCrawlerOnCustomSchedule', { #### **3. Notify Event Triggers** -This type of trigger is only supported with Glue workflows. There are two types +Workflows are mandatory for this trigger type. There are two types of notify event triggers, batching and non-batching trigger. For batching triggers, developers must specify `BatchSize` but for non-batching `BatchSize` will be set to 1. For both triggers, `BatchWindow` will be default to 900 seconds. From f5030f64ddfaa3ec0cd55f64648760c457d4c2c0 Mon Sep 17 00:00:00 2001 From: Natalie White Date: Mon, 31 Jul 2023 12:32:48 -0700 Subject: [PATCH 13/15] Added worker type to parameter lists, updated timeout Duration type, added note about why IAM Role is required from the developer rather than being auto-generated by the L2 --- text/0497-glue-l2-construct.md | 110 ++++++++++++++++++++++++++++++--- 1 file changed, 102 insertions(+), 8 deletions(-) diff --git a/text/0497-glue-l2-construct.md b/text/0497-glue-l2-construct.md index 1707054c9..ec4e2a194 100644 --- a/text/0497-glue-l2-construct.md +++ b/text/0497-glue-l2-construct.md @@ -86,6 +86,8 @@ glue.ScalaSparkEtlJob(this, 'ScalaSparkEtlJob', { extraJarsS3Url: [glue.Code.fromBucket('bucket-name', 'path-to-extra-scala-jar'),], description: 'an example Scala Spark ETL job', numberOfWorkers: 20, + workerType: glue.WorkerType.G8X, + timeout: cdk.Duration.minutes(15), role: iam.IRole, }); @@ -96,6 +98,8 @@ glue.pySparkEtlJob(this, 'pySparkEtlJob', { script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-python-script'), description: 'an example pySpark ETL job', numberOfWorkers: 20, + workerType: glue.WorkerType.G8X, + timeout: cdk.Duration.minutes(15), role: iam.IRole, }); ``` @@ -128,6 +132,8 @@ ScalaSparkEtlJobProps{ /** * IAM Role (required) * IAM Role to use for Glue Job execution + * Must be specified by the developer because the L2 doesn't have visibility + * into the actions the script(s) will take during the Job Execution * */ role: iam.IRole; @@ -150,6 +156,14 @@ ScalaSparkEtlJobProps{ * */ numberOrWorkers?: int; + /** + * Worker Type (optional) + * Type of Worker for Glue to use during Job execution + * Enum options: Standard, G.1X, G.2X, G.025X. G.4X, G.8X, Z.2X + * @default G.2X + * */ + workerType?: glue.WorkerType; + /** * Max Concurrent Runs (optional) * The maximum number of runs this Glue Job cna concurrently run @@ -183,7 +197,7 @@ ScalaSparkEtlJobProps{ * Timeout for the Glue Job, specified in minutes * @default 2880 (2 days for non-streaming) * */ - timeout?: int; + timeout?: cdk.Duration; /** * Security Configuration (optional) @@ -221,6 +235,8 @@ pySparkEtlJobProps{ /** * IAM Role (required) * IAM Role to use for Glue Job execution + * Must be specified by the developer because the L2 doesn't have visibility + * into the actions the script(s) will take during the Job Execution * */ role: iam.IRole; @@ -243,6 +259,14 @@ pySparkEtlJobProps{ * */ numberOrWorkers?: int; + /** + * Worker Type (optional) + * Type of Worker for Glue to use during Job execution + * Enum options: Standard, G.1X, G.2X, G.025X. G.4X, G.8X, Z.2X + * @default G.2X + * */ + workerType?: glue.WorkerType; + /** * Max Concurrent Runs (optional) * The maximum number of runs this Glue Job cna concurrently run @@ -276,7 +300,7 @@ pySparkEtlJobProps{ * Timeout for the Glue Job, specified in minutes * @default 2880 (2 days for non-streaming) * */ - timeout?: int; + timeout?: cdk.Duration; /** * Security Configuration (optional) @@ -336,6 +360,8 @@ new glue.pySparkStreamingJob(this, 'pySparkStreamingJob', { script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-python-script'), description: 'an example Python Streaming job', numberOfWorkers: 20, + workerType: glue.WorkerType.G8X, + timeout: cdk.Duration.minutes(15), role: iam.IRole, }); @@ -347,6 +373,8 @@ new glue.ScalaSparkStreamingJob(this, 'ScalaSparkStreamingJob', { className: 'com.example.HelloWorld', description: 'an example Python Streaming job', numberOfWorkers: 20, + workerType: glue.WorkerType.G8X, + timeout: cdk.Duration.minutes(15), role: iam.IRole, }); ``` @@ -373,6 +401,8 @@ ScalaSparkStreamingJobProps{ /** * IAM Role (required) * IAM Role to use for Glue Job execution + * Must be specified by the developer because the L2 doesn't have visibility + * into the actions the script(s) will take during the Job Execution * */ role: iam.IRole; @@ -401,6 +431,14 @@ ScalaSparkStreamingJobProps{ * */ numberOrWorkers?: int; + /** + * Worker Type (optional) + * Type of Worker for Glue to use during Job execution + * Enum options: Standard, G.1X, G.2X, G.025X. G.4X, G.8X, Z.2X + * @default G.2X + * */ + workerType?: glue.WorkerType; + /** * Max Concurrent Runs (optional) * The maximum number of runs this Glue Job cna concurrently run @@ -433,7 +471,7 @@ ScalaSparkStreamingJobProps{ * Timeout (optional) * Timeout for the Glue Job, specified in minutes * */ - timeout?: int; + timeout?: cdk.Duration; /** * Security Configuration (optional) @@ -471,6 +509,8 @@ pySparkStreamingJobProps{ /** * IAM Role (required) * IAM Role to use for Glue Job execution + * Must be specified by the developer because the L2 doesn't have visibility + * into the actions the script(s) will take during the Job Execution * */ role: iam.IRole; @@ -493,6 +533,14 @@ pySparkStreamingJobProps{ * */ numberOrWorkers?: int; + /** + * Worker Type (optional) + * Type of Worker for Glue to use during Job execution + * Enum options: Standard, G.1X, G.2X, G.025X. G.4X, G.8X, Z.2X + * @default G.2X + * */ + workerType?: glue.WorkerType; + /** * Max Concurrent Runs (optional) * The maximum number of runs this Glue Job cna concurrently run @@ -525,7 +573,7 @@ pySparkStreamingJobProps{ * Timeout (optional) * Timeout for the Glue Job, specified in minutes * */ - timeout?: int; + timeout?: cdk.Duration; /** * Security Configuration (optional) @@ -581,6 +629,8 @@ glue.ScalaSparkFlexEtlJob(this, 'ScalaSparkFlexEtlJob', { extraJarsS3Url: [glue.Code.fromBucket('bucket-name', 'path-to-extra-python-scripts')], description: 'an example pySpark ETL job', numberOfWorkers: 20, + workerType: glue.WorkerType.G8X, + timeout: cdk.Duration.minutes(15), role: iam.IRole, }); @@ -590,6 +640,8 @@ new glue.pySparkFlexEtlJob(this, 'pySparkFlexEtlJob', { script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-python-script'), description: 'an example Flex job', numberOfWorkers: 20, + workerType: glue.WorkerType.G8X, + timeout: cdk.Duration.minutes(15), role: iam.IRole, }); ``` @@ -622,6 +674,8 @@ ScalaSparkFlexJobProps{ /** * IAM Role (required) * IAM Role to use for Glue Job execution + * Must be specified by the developer because the L2 doesn't have visibility + * into the actions the script(s) will take during the Job Execution * */ role: iam.IRole; @@ -644,6 +698,14 @@ ScalaSparkFlexJobProps{ * */ numberOrWorkers?: int; + /** + * Worker Type (optional) + * Type of Worker for Glue to use during Job execution + * Enum options: Standard, G.1X, G.2X, G.025X. G.4X, G.8X, Z.2X + * @default G.2X + * */ + workerType?: glue.WorkerType; + /** * Max Concurrent Runs (optional) * The maximum number of runs this Glue Job cna concurrently run @@ -677,7 +739,7 @@ ScalaSparkFlexJobProps{ * Timeout for the Glue Job, specified in minutes * @default 2880 (2 days for non-streaming) * */ - timeout?: int; + timeout?: cdk.Duration; /** * Security Configuration (optional) @@ -715,6 +777,8 @@ PySparkFlexJobProps{ /** * IAM Role (required) * IAM Role to use for Glue Job execution + * Must be specified by the developer because the L2 doesn't have visibility + * into the actions the script(s) will take during the Job Execution * */ role: iam.IRole; @@ -737,6 +801,14 @@ PySparkFlexJobProps{ * */ numberOrWorkers?: int; + /** + * Worker Type (optional) + * Type of Worker for Glue to use during Job execution + * Enum options: Standard, G.1X, G.2X, G.025X. G.4X, G.8X, Z.2X + * @default G.2X + * */ + workerType?: glue.WorkerType; + /** * Max Concurrent Runs (optional) * The maximum number of runs this Glue Job cna concurrently run @@ -770,7 +842,7 @@ PySparkFlexJobProps{ * Timeout for the Glue Job, specified in minutes * @default 2880 (2 days for non-streaming) * */ - timeout?: int; + timeout?: cdk.Duration; /** * Security Configuration (optional) @@ -821,6 +893,8 @@ new glue.PythonShellJob(this, 'PythonShellJob', { script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-python-script'), description: 'an example Python Shell job', numberOfWorkers: 20, + workerType: glue.WorkerType.G8X, + timeout: cdk.Duration.minutes(15), role: iam.IRole, }); ``` @@ -840,6 +914,8 @@ PythonShellJobProps{ /** * IAM Role (required) * IAM Role to use for Glue Job execution + * Must be specified by the developer because the L2 doesn't have visibility + * into the actions the script(s) will take during the Job Execution * */ role: iam.IRole; @@ -862,6 +938,14 @@ PythonShellJobProps{ * */ numberOrWorkers?: int; + /** + * Worker Type (optional) + * Type of Worker for Glue to use during Job execution + * Enum options: Standard, G.1X, G.2X, G.025X. G.4X, G.8X, Z.2X + * @default G.2X + * */ + workerType?: glue.WorkerType; + /** * Max Concurrent Runs (optional) * The maximum number of runs this Glue Job cna concurrently run @@ -895,7 +979,7 @@ PythonShellJobProps{ * Timeout for the Glue Job, specified in minutes * @default 2880 (2 days for non-streaming) * */ - timeout?: int; + timeout?: cdk.Duration; /** * Security Configuration (optional) @@ -956,6 +1040,8 @@ RayJobProps{ /** * IAM Role (required) * IAM Role to use for Glue Job execution + * Must be specified by the developer because the L2 doesn't have visibility + * into the actions the script(s) will take during the Job Execution * */ role: iam.IRole; @@ -978,6 +1064,14 @@ RayJobProps{ * */ numberOrWorkers?: int; + /** + * Worker Type (optional) + * Type of Worker for Glue to use during Job execution + * Enum options: Standard, G.1X, G.2X, G.025X. G.4X, G.8X, Z.2X + * @default G.2X + * */ + workerType?: glue.WorkerType; + /** * Max Concurrent Runs (optional) * The maximum number of runs this Glue Job cna concurrently run @@ -1011,7 +1105,7 @@ RayJobProps{ * Timeout for the Glue Job, specified in minutes * @default 2880 (2 days for non-streaming) * */ - timeout?: int; + timeout?: cdk.Duration; /** * Security Configuration (optional) From 5a92978d5c4720acd0286fb3919c4f77246d8cc3 Mon Sep 17 00:00:00 2001 From: Natalie White Date: Tue, 8 Aug 2023 09:02:06 -0700 Subject: [PATCH 14/15] Addressing @humanzz comments on Glue L2 extrajars parameters --- text/0497-glue-l2-construct.md | 24 +++++++++++++++--------- 1 file changed, 15 insertions(+), 9 deletions(-) diff --git a/text/0497-glue-l2-construct.md b/text/0497-glue-l2-construct.md index ec4e2a194..a7567d03b 100644 --- a/text/0497-glue-l2-construct.md +++ b/text/0497-glue-l2-construct.md @@ -81,9 +81,9 @@ description: ```ts glue.ScalaSparkEtlJob(this, 'ScalaSparkEtlJob', { glueVersion: glue.GlueVersion.V3_0, - script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-scala-jar'), + script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-scala-script'), className: 'com.example.HelloWorld', - extraJarsS3Url: [glue.Code.fromBucket('bucket-name', 'path-to-extra-scala-jar'),], + extraJars: [glue.Code.fromBucket('bucket-name', 'path-to-extra-jars'),], description: 'an example Scala Spark ETL job', numberOfWorkers: 20, workerType: glue.WorkerType.G8X, @@ -127,7 +127,7 @@ ScalaSparkEtlJobProps{ * Extra Jars S3 URL (optional) * S3 URL where additional jar dependencies are located */ - extraJarsS3Url?: string[]; + extraJars?: string[]; /** * IAM Role (required) @@ -252,6 +252,12 @@ pySparkEtlJobProps{ * */ description?: string; + /** + * Extra Jars S3 URL (optional) + * S3 URL where additional jar dependencies are located + */ + extraJars?: string[]; + /** * Number of Workers (optional) * Number of workers for Glue to use during Job execution @@ -368,8 +374,8 @@ new glue.pySparkStreamingJob(this, 'pySparkStreamingJob', { new glue.ScalaSparkStreamingJob(this, 'ScalaSparkStreamingJob', { glueVersion: glue.GlueVersion.V3_0, pythonVersion: glue.PythonVersion.3_9, - script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-scala-jar'), - extraJarsS3Url: [glue.Code.fromBucket('bucket-name', 'path-to-extra-scala-jar'),], + script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-scala-script'), + extraJars: [glue.Code.fromBucket('bucket-name', 'path-to-extra-jars'),], className: 'com.example.HelloWorld', description: 'an example Python Streaming job', numberOfWorkers: 20, @@ -416,7 +422,7 @@ ScalaSparkStreamingJobProps{ * Extra Jars S3 URL (optional) * S3 URL where additional jar dependencies are located */ - extraJarsS3Url?: string[]; + extraJars?: string[]; /** * Description (optional) @@ -624,9 +630,9 @@ provide extra jars, and a description ```ts glue.ScalaSparkFlexEtlJob(this, 'ScalaSparkFlexEtlJob', { glueVersion: glue.GlueVersion.V3_0, - script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-scala-jar'), + script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-scala-script'), className: 'com.example.HelloWorld', - extraJarsS3Url: [glue.Code.fromBucket('bucket-name', 'path-to-extra-python-scripts')], + extraJars: [glue.Code.fromBucket('bucket-name', 'path-to-extra-jars')], description: 'an example pySpark ETL job', numberOfWorkers: 20, workerType: glue.WorkerType.G8X, @@ -669,7 +675,7 @@ ScalaSparkFlexJobProps{ * Extra Jars S3 URL (optional) * S3 URL where additional jar dependencies are located */ - extraJarsS3Url?: string[]; + extraJars?: string[]; /** * IAM Role (required) From 2b3e5fef6a67dd68939ff7255ca8b3e31d896c57 Mon Sep 17 00:00:00 2001 From: Natalie White Date: Tue, 15 Aug 2023 13:53:01 -0700 Subject: [PATCH 15/15] Added README section back to doc, changed from 3rd to first person, changed passive verbs to active, added more README style details, added runtime override for Ray jobs --- text/0497-glue-l2-construct.md | 547 +++++++++++++++++---------------- 1 file changed, 284 insertions(+), 263 deletions(-) diff --git a/text/0497-glue-l2-construct.md b/text/0497-glue-l2-construct.md index a7567d03b..f16ca9418 100644 --- a/text/0497-glue-l2-construct.md +++ b/text/0497-glue-l2-construct.md @@ -7,46 +7,65 @@ * **API Bar Raiser:** @TheRealAmazonKendra [Link to RFC Issue](https://github.com/aws/aws-cdk-rfcs/issues/497) -## Overview +## README [AWS Glue](https://aws.amazon.com/glue/) is a serverless data integration service that makes it easier to discover, prepare, move, and integrate data from multiple sources for analytics, machine learning (ML), and application -development. Glue was released on 2017/08. -[Launch](https://aws.amazon.com/blogs/aws/launch-aws-glue-now-generally-available/) - -Today, customers define Glue data sources, connections, jobs, and workflows -to define their data and ETL solutions via the AWS console, the AWS CLI, and -Infrastructure as Code tools like CloudFormation and the CDK. However, they -have challenges defining the required and optional parameters depending on -job type, networking constraints for data source connections, secrets for -JDBC connections, and least-privilege IAM Roles and Policies. We will build -convenience methods working backwards from common use cases and default to -recommended best practices. - -This RFC proposes updates to the L2 construct for Glue which will provide -convenience features and abstractions for the existing -[L1 (CloudFormation) Constructs](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/AWS_Glue.html) building on the -functionality already supported in the [@aws-cdk/aws-glue-alpha module](https://github.com/aws/aws-cdk/blob/v2.51.1/packages/%40aws-cdk/aws-glue/README.md). +development. + +Wihout an L2 construct, developers define Glue data sources, connections, +jobs, and workflows for their data and ETL solutions via the AWS console, +the AWS CLI, and Infrastructure as Code tools like CloudFormation and the +CDK. However, there are several challenges to defining Glue resources at +scale that an L2 construct can resolve. First, developers must reference +documentation to determine the valid combinations of job type, Glue version, +worker type, language versions, and other parameters that are required for specific +job types. Additionally, developers must already know or look up the +networking constraints for data source connections, and there is ambiguity +around how to securely store secrets for JDBC connections. Finally, +developers want prescriptive guidance via best practice defaults for +throughput parameters like number of workers and batching. + +The Glue L2 construct has convenience methods working backwards from common +use cases and sets required parameters to defaults that align with recommended +best practices for each job type. It also provides customers with a balance +between flexibility via optional parameter overrides, and opinionated +interfaces that discouraging anti-patterns, resulting in reduced time to develop +and deploy new resources. + +### References + +* [Glue Launch Announcement](https://aws.amazon.com/blogs/aws/launch-aws-glue-now-generally-available/) +* [Glue Documentation](https://docs.aws.amazon.com/glue/index.html) +* [Glue L1 (CloudFormation) Constructs](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/AWS_Glue.html) +* Prior version of the [@aws-cdk/aws-glue-alpha module](https://github.com/aws/aws-cdk/blob/v2.51.1/packages/%40aws-cdk/aws-glue/README.md) ## Create a Glue Job -The glue-alpha-module already supports three of the four common types of Glue -Jobs: Spark (ETL and Streaming), Python Shell, Ray. This RFC will add the -more recent Flex Job. The construct also implements AWS best practice -recommendations, such as: - -* use of Secrets Management for Connection JDBC strings -* Glue Job Autoscaling -* defaults for Glue job specification - -This RFC will introduce breaking changes to the existing glue-alpha-module to -streamline the developer experience and introduce new constants and validations. -As an opinionated construct, the Glue L2 construct will enforce -best practices and not allow developers to create resources that use deprecated -libraries and tool sets (e.g. deprecated versions of Python). - -Optional and required parameters for each job will be enforced via interface +A Job encapsulates a script that connects to data sources, processes +them, and then writes output to a data target. There are four types of Glue +Jobs: Spark (ETL and Streaming), Python Shell, Ray, and Flex Jobs. Most +of the required parameters for these jobs are common across all types, +but there are a few differences depending on the languages supported +and features provided by each type. For all job types, the L2 defaults +to AWS best practice recommendations, such as: + +* Use of Secrets Manager for Connection JDBC strings +* Glue job autoscaling +* Default parameter values for Glue job creation + +This iteration of the L2 construct introduces breaking changes to +the existing glue-alpha-module, but these changes streamline the developer +experience, introduce new constants for defaults, and replacing synth-time +validations with interface contracts for enforcement of the parameter combinations +that Glue supports. As an opinionated construct, the Glue L2 construct does +not allow developers to create resources that use non-current versions +of Glue or deprecated language dependencies (e.g. deprecated versions of Python). +As always, L1s allow you to specify a wider range of parameters if you need +or want to use alternative configurations. + +Optional and required parameters for each job are enforced via interface rather than validation; see [Glue's public documentation](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api.html) for more granular details. @@ -54,10 +73,11 @@ for more granular details. 1. **ETL Jobs** -ETL jobs supports python and Scala language. ETL job type supports G1, G2, G4 -and G8 worker type default as G2, which customer can override. It wil default to -the best practice version of ETL 4.0, but allow developers to override to 3.0. -We will also default to best practice enablement the following ETL features: +ETL jobs support pySpark and Scala languages, for which there are separate but +similar constructors. ETL jobs default to the G2 worker type, but you can +override this default with other supported worker type values (G1, G2, G4 +and G8). ETL jobs defaults to Glue version 4.0, which you can override to 3.0. +The following ETL features are enabled by default: `—enable-metrics, —enable-spark-ui, —enable-continuous-cloudwatch-log.` You can find more details about version, worker type and other features in [Glue's public documentation](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-jobs-job.html). @@ -75,8 +95,7 @@ glue.pySparkEtlJob(this, 'pySparkEtlJob', { }); ``` -Optionally, developers can override the glueVersion and add extra jars and a -description: +Optional override examples: ```ts glue.ScalaSparkEtlJob(this, 'ScalaSparkEtlJob', { @@ -118,7 +137,7 @@ ScalaSparkEtlJobProps{ /** * Class name (required for Scala) - * Package and class name for the entry point of Glue Job execution for + * Package and class name for the entry point of Glue job execution for * Java scripts * */ className: string; @@ -131,89 +150,89 @@ ScalaSparkEtlJobProps{ /** * IAM Role (required) - * IAM Role to use for Glue Job execution + * IAM Role to use for Glue job execution * Must be specified by the developer because the L2 doesn't have visibility - * into the actions the script(s) will take during the Job Execution + * into the actions the script(s) take during the job execution * */ role: iam.IRole; /** - * Name of the Glue Job (optional) - * Developer-specified name of the Glue Job + * Name of the Glue job (optional) + * Developer-specified name of the Glue job * */ name?: string; /** * Description (optional) - * Developer-specified description of the Glue Job + * Developer-specified description of the Glue job * */ description?: string; /** * Number of Workers (optional) - * Number of workers for Glue to use during Job execution + * Number of workers for Glue to use during job execution * @default 10 * */ numberOrWorkers?: int; /** * Worker Type (optional) - * Type of Worker for Glue to use during Job execution - * Enum options: Standard, G.1X, G.2X, G.025X. G.4X, G.8X, Z.2X - * @default G.2X + * Type of Worker for Glue to use during job execution + * Enum options: Standard, G_1X, G_2X, G_025X. G_4X, G_8X, Z_2X + * @default G_2X * */ workerType?: glue.WorkerType; /** * Max Concurrent Runs (optional) - * The maximum number of runs this Glue Job cna concurrently run + * The maximum number of runs this Glue job can concurrently run * @default 1 * */ maxConcurrentRuns?: int; /** * Default Arguments (optional) - * The default arguments for every run of this Glue Job, + * The default arguments for every run of this Glue job, * specified as name-value pairs. * */ defaultArguments?: {[key: string], string }[]; /** * Connections (optional) - * List of connections to use for this Glue Job + * List of connections to use for this Glue job * */ connections?: IConnection[]; /** * Max Retries (optional) - * Maximum number of retry attempts Glue will perform - * if the Job fails + * Maximum number of retry attempts Glue performs + * if the job fails * @default 0 * */ maxRetries?: int; /** * Timeout (optional) - * Timeout for the Glue Job, specified in minutes + * Timeout for the Glue job, specified in minutes * @default 2880 (2 days for non-streaming) * */ timeout?: cdk.Duration; /** * Security Configuration (optional) - * Defines the encryption options for the Glue Job + * Defines the encryption options for the Glue job * */ securityConfiguration?: ISecurityConfiguration; /** * Tags (optional) - * A list of key:value pairs of tags to apply to this Glue Job resource + * A list of key:value pairs of tags to apply to this Glue job resource * */ tags?: {[key: string], string }[]; /** * Glue Version - * The version of Glue to use to execute this Job + * The version of Glue to use to execute this job * @default 3.0 for ETL * */ glueVersion?: glue.GlueVersion; @@ -226,7 +245,7 @@ pySpark ETL Job Property Interface: pySparkEtlJobProps{ /** * Script Code Location (required) - * Script to run when the Glue Job executes. Can be uploaded + * Script to run when the Glue job executes. Can be uploaded * from the local directory structure using fromAsset * or referenced via S3 location using fromBucket * */ @@ -234,21 +253,21 @@ pySparkEtlJobProps{ /** * IAM Role (required) - * IAM Role to use for Glue Job execution + * IAM Role to use for Glue job execution * Must be specified by the developer because the L2 doesn't have visibility - * into the actions the script(s) will take during the Job Execution + * into the actions the script(s) takes during the job execution * */ role: iam.IRole; /** - * Name of the Glue Job (optional) - * Developer-specified name of the Glue Job + * Name of the Glue job (optional) + * Developer-specified name of the Glue job * */ name?: string; /** * Description (optional) - * Developer-specified description of the Glue Job + * Developer-specified description of the Glue job * */ description?: string; @@ -260,69 +279,69 @@ pySparkEtlJobProps{ /** * Number of Workers (optional) - * Number of workers for Glue to use during Job execution + * Number of workers for Glue to use during job execution * @default 10 * */ numberOrWorkers?: int; /** * Worker Type (optional) - * Type of Worker for Glue to use during Job execution - * Enum options: Standard, G.1X, G.2X, G.025X. G.4X, G.8X, Z.2X - * @default G.2X + * Type of Worker for Glue to use during job execution + * Enum options: Standard, G_1X, G_2X, G_025X. G_4X, G_8X, Z_2X + * @default G_2X * */ workerType?: glue.WorkerType; /** * Max Concurrent Runs (optional) - * The maximum number of runs this Glue Job cna concurrently run + * The maximum number of runs this Glue job can concurrently run * @default 1 * */ maxConcurrentRuns?: int; /** * Default Arguments (optional) - * The default arguments for every run of this Glue Job, + * The default arguments for every run of this Glue job, * specified as name-value pairs. * */ defaultArguments?: {[key: string], string }[]; /** * Connections (optional) - * List of connections to use for this Glue Job + * List of connections to use for this Glue job * */ connections?: IConnection[]; /** * Max Retries (optional) - * Maximum number of retry attempts Glue will perform - * if the Job fails + * Maximum number of retry attempts Glue performs + * if the job fails * @default 0 * */ maxRetries?: int; /** * Timeout (optional) - * Timeout for the Glue Job, specified in minutes + * Timeout for the Glue job, specified in minutes * @default 2880 (2 days for non-streaming) * */ timeout?: cdk.Duration; /** * Security Configuration (optional) - * Defines the encryption options for the Glue Job + * Defines the encryption options for the Glue job * */ securityConfiguration?: ISecurityConfiguration; /** * Tags (optional) - * A list of key:value pairs of tags to apply to this Glue Job resource + * A list of key:value pairs of tags to apply to this Glue job resource * */ tags?: {[key: string], string }[]; /** * Glue Version - * The version of Glue to use to execute this Job + * The version of Glue to use to execute this job * @default 3.0 for ETL * */ glueVersion?: glue.GlueVersion; @@ -331,15 +350,14 @@ pySparkEtlJobProps{ 2. **Streaming Jobs** -A Streaming job is similar to an ETL job, except that it performs ETL on data +Streaming jobs are similar to ETL jobs, except that they perform ETL on data streams using the Apache Spark Structured Streaming framework. Some Spark -job features are not available to streaming ETL jobs. These jobs will default -to use Python 3.9. - -Similar to ETL streaming job supports Scala and Python languages. Similar to ETL, -it supports G1 and G2 worker type and 2.0, 3.0 and 4.0 version. We’ll default -to G2 worker and 4.0 version for streaming jobs which developers can override. -We will enable `—enable-metrics, —enable-spark-ui, —enable-continuous-cloudwatch-log`. +job features are not available to Streaming ETL jobs. They support Scala +and pySpark languages. PySpark streaming jobs default Python 3.9, +which you can override with any non-deprecated version of Python. It +defaults to the G2 worker type and Glue 4.0, both of which you can override. +The following best practice features are enabled by default: +`—enable-metrics, —enable-spark-ui, —enable-continuous-cloudwatch-log`. ```ts new glue.pySparkStreamingJob(this, 'pySparkStreamingJob', { @@ -356,8 +374,7 @@ new glue.ScalaSparkStreamingJob(this, 'ScalaSparkStreamingJob', { ``` -Optionally, developers can override the glueVersion and add extraJars and a -description: +Optional override examples: ```ts new glue.pySparkStreamingJob(this, 'pySparkStreamingJob', { @@ -391,7 +408,7 @@ Scala Spark Streaming Job Property Interface: ScalaSparkStreamingJobProps{ /** * Script Code Location (required) - * Script to run when the Glue Job executes. Can be uploaded + * Script to run when the Glue job executes. Can be uploaded * from the local directory structure using fromAsset * or referenced via S3 location using fromBucket * */ @@ -399,22 +416,22 @@ ScalaSparkStreamingJobProps{ /** * Class name (required for Scala scripts) - * Package and class name for the entry point of Glue Job execution for + * Package and class name for the entry point of Glue job execution for * Java scripts * */ className: string; /** * IAM Role (required) - * IAM Role to use for Glue Job execution + * IAM Role to use for Glue job execution * Must be specified by the developer because the L2 doesn't have visibility - * into the actions the script(s) will take during the Job Execution + * into the actions the script(s) take during the job execution * */ role: iam.IRole; /** - * Name of the Glue Job (optional) - * Developer-specified name of the Glue Job + * Name of the Glue job (optional) + * Developer-specified name of the Glue job * */ name?: string; @@ -426,74 +443,74 @@ ScalaSparkStreamingJobProps{ /** * Description (optional) - * Developer-specified description of the Glue Job + * Developer-specified description of the Glue job * */ description?: string; /** * Number of Workers (optional) - * Number of workers for Glue to use during Job execution + * Number of workers for Glue to use during job execution * @default 10 * */ numberOrWorkers?: int; /** * Worker Type (optional) - * Type of Worker for Glue to use during Job execution - * Enum options: Standard, G.1X, G.2X, G.025X. G.4X, G.8X, Z.2X - * @default G.2X + * Type of Worker for Glue to use during job execution + * Enum options: Standard, G_1X, G_2X, G_025X. G_4X, G_8X, Z_2X + * @default G_2X * */ workerType?: glue.WorkerType; /** * Max Concurrent Runs (optional) - * The maximum number of runs this Glue Job cna concurrently run + * The maximum number of runs this Glue job can concurrently run * @default 1 * */ maxConcurrentRuns?: int; /** * Default Arguments (optional) - * The default arguments for every run of this Glue Job, + * The default arguments for every run of this Glue job, * specified as name-value pairs. * */ defaultArguments?: {[key: string], string }[]; /** * Connections (optional) - * List of connections to use for this Glue Job + * List of connections to use for this Glue job * */ connections?: IConnection[]; /** * Max Retries (optional) - * Maximum number of retry attempts Glue will perform - * if the Job fails + * Maximum number of retry attempts Glue performs + * if the job fails * @default 0 * */ maxRetries?: int; /** * Timeout (optional) - * Timeout for the Glue Job, specified in minutes + * Timeout for the Glue job, specified in minutes * */ timeout?: cdk.Duration; /** * Security Configuration (optional) - * Defines the encryption options for the Glue Job + * Defines the encryption options for the Glue job * */ securityConfiguration?: ISecurityConfiguration; /** * Tags (optional) - * A list of key:value pairs of tags to apply to this Glue Job resource + * A list of key:value pairs of tags to apply to this Glue job resource * */ tags?: {[key: string], string }[]; /** * Glue Version - * The version of Glue to use to execute this Job + * The version of Glue to use to execute this job * @default 3.0 * */ glueVersion?: glue.GlueVersion; @@ -506,7 +523,7 @@ pySpark Streaming Job Property Interface: pySparkStreamingJobProps{ /** * Script Code Location (required) - * Script to run when the Glue Job executes. Can be uploaded + * Script to run when the Glue job executes. Can be uploaded * from the local directory structure using fromAsset * or referenced via S3 location using fromBucket * */ @@ -514,88 +531,88 @@ pySparkStreamingJobProps{ /** * IAM Role (required) - * IAM Role to use for Glue Job execution + * IAM Role to use for Glue job execution * Must be specified by the developer because the L2 doesn't have visibility - * into the actions the script(s) will take during the Job Execution + * into the actions the script(s) take during the job execution * */ role: iam.IRole; /** - * Name of the Glue Job (optional) - * Developer-specified name of the Glue Job + * Name of the Glue job (optional) + * Developer-specified name of the Glue job * */ name?: string; /** * Description (optional) - * Developer-specified description of the Glue Job + * Developer-specified description of the Glue job * */ description?: string; /** * Number of Workers (optional) - * Number of workers for Glue to use during Job execution + * Number of workers for Glue to use during job execution * @default 10 * */ numberOrWorkers?: int; /** * Worker Type (optional) - * Type of Worker for Glue to use during Job execution - * Enum options: Standard, G.1X, G.2X, G.025X. G.4X, G.8X, Z.2X - * @default G.2X + * Type of Worker for Glue to use during job execution + * Enum options: Standard, G_1X, G_2X, G_025X. G_4X, G_8X, Z_2X + * @default G_2X * */ workerType?: glue.WorkerType; /** * Max Concurrent Runs (optional) - * The maximum number of runs this Glue Job cna concurrently run + * The maximum number of runs this Glue job can concurrently run * @default 1 * */ maxConcurrentRuns?: int; /** * Default Arguments (optional) - * The default arguments for every run of this Glue Job, + * The default arguments for every run of this Glue job, * specified as name-value pairs. * */ defaultArguments?: {[key: string], string }[]; /** * Connections (optional) - * List of connections to use for this Glue Job + * List of connections to use for this Glue job * */ connections?: IConnection[]; /** * Max Retries (optional) - * Maximum number of retry attempts Glue will perform - * if the Job fails + * Maximum number of retry attempts Glue perform + * if the job fails * @default 0 * */ maxRetries?: int; /** * Timeout (optional) - * Timeout for the Glue Job, specified in minutes + * Timeout for the Glue job, specified in minutes * */ timeout?: cdk.Duration; /** * Security Configuration (optional) - * Defines the encryption options for the Glue Job + * Defines the encryption options for the Glue job * */ securityConfiguration?: ISecurityConfiguration; /** * Tags (optional) - * A list of key:value pairs of tags to apply to this Glue Job resource + * A list of key:value pairs of tags to apply to this Glue job resource * */ tags?: {[key: string], string }[]; /** * Glue Version - * The version of Glue to use to execute this Job + * The version of Glue to use to execute this job * @default 3.0 * */ glueVersion?: glue.GlueVersion; @@ -605,10 +622,9 @@ pySparkStreamingJobProps{ 3. **Flex Jobs** The flexible execution class is appropriate for non-urgent jobs such as -pre-production jobs, testing, and one-time data loads. Flexible job runs -are supported for jobs using AWS Glue version 3.0 or later and `G.1X` or -`G.2X` worker types but will default to the latest version of Glue -(currently Glue 3.0.) Similar to ETL, we’ll enable these features: +pre-production jobs, testing, and one-time data loads. Flexible jobs default +to Glue version 3.0 and worker type `G_2X`. The following best practice +features are enabled by default: `—enable-metrics, —enable-spark-ui, —enable-continuous-cloudwatch-log` ```ts @@ -624,8 +640,7 @@ glue.pySparkFlexEtlJob(this, 'pySparkFlexEtlJob', { }); ``` -Optionally, developers can override the glue version, python version, -provide extra jars, and a description +Optional override examples: ```ts glue.ScalaSparkFlexEtlJob(this, 'ScalaSparkFlexEtlJob', { @@ -658,7 +673,7 @@ Scala Spark Flex Job Property Interface: ScalaSparkFlexJobProps{ /** * Script Code Location (required) - * Script to run when the Glue Job executes. Can be uploaded + * Script to run when the Glue job executes. Can be uploaded * from the local directory structure using fromAsset * or referenced via S3 location using fromBucket * */ @@ -666,7 +681,7 @@ ScalaSparkFlexJobProps{ /** * Class name (required for Scala scripts) - * Package and class name for the entry point of Glue Job execution for + * Package and class name for the entry point of Glue job execution for * Java scripts * */ className: string; @@ -679,89 +694,89 @@ ScalaSparkFlexJobProps{ /** * IAM Role (required) - * IAM Role to use for Glue Job execution + * IAM Role to use for Glue job execution * Must be specified by the developer because the L2 doesn't have visibility - * into the actions the script(s) will take during the Job Execution + * into the actions the script(s) take during the job execution * */ role: iam.IRole; /** - * Name of the Glue Job (optional) - * Developer-specified name of the Glue Job + * Name of the Glue job (optional) + * Developer-specified name of the Glue job * */ name?: string; /** * Description (optional) - * Developer-specified description of the Glue Job + * Developer-specified description of the Glue job * */ description?: string; /** * Number of Workers (optional) - * Number of workers for Glue to use during Job execution + * Number of workers for Glue to use during job execution * @default 10 * */ numberOrWorkers?: int; /** * Worker Type (optional) - * Type of Worker for Glue to use during Job execution - * Enum options: Standard, G.1X, G.2X, G.025X. G.4X, G.8X, Z.2X - * @default G.2X + * Type of Worker for Glue to use during job execution + * Enum options: Standard, G_1X, G_2X, G_025X. G_4X, G_8X, Z_2X + * @default G_2X * */ workerType?: glue.WorkerType; /** * Max Concurrent Runs (optional) - * The maximum number of runs this Glue Job cna concurrently run + * The maximum number of runs this Glue job can concurrently run * @default 1 * */ maxConcurrentRuns?: int; /** * Default Arguments (optional) - * The default arguments for every run of this Glue Job, + * The default arguments for every run of this Glue job, * specified as name-value pairs. * */ defaultArguments?: {[key: string], string }[]; /** * Connections (optional) - * List of connections to use for this Glue Job + * List of connections to use for this Glue job * */ connections?: IConnection[]; /** * Max Retries (optional) - * Maximum number of retry attempts Glue will perform - * if the Job fails + * Maximum number of retry attempts Glue perform + * if the job fails * @default 0 * */ maxRetries?: int; /** * Timeout (optional) - * Timeout for the Glue Job, specified in minutes + * Timeout for the Glue job, specified in minutes * @default 2880 (2 days for non-streaming) * */ timeout?: cdk.Duration; /** * Security Configuration (optional) - * Defines the encryption options for the Glue Job + * Defines the encryption options for the Glue job * */ securityConfiguration?: ISecurityConfiguration; /** * Tags (optional) - * A list of key:value pairs of tags to apply to this Glue Job resource + * A list of key:value pairs of tags to apply to this Glue job resource * */ tags?: {[key: string], string }[]; /** * Glue Version - * The version of Glue to use to execute this Job + * The version of Glue to use to execute this job * @default 3.0 * */ glueVersion?: glue.GlueVersion; @@ -774,7 +789,7 @@ pySpark Flex Job Property Interface: PySparkFlexJobProps{ /** * Script Code Location (required) - * Script to run when the Glue Job executes. Can be uploaded + * Script to run when the Glue job executes. Can be uploaded * from the local directory structure using fromAsset * or referenced via S3 location using fromBucket * */ @@ -782,89 +797,89 @@ PySparkFlexJobProps{ /** * IAM Role (required) - * IAM Role to use for Glue Job execution + * IAM Role to use for Glue job execution * Must be specified by the developer because the L2 doesn't have visibility - * into the actions the script(s) will take during the Job Execution + * into the actions the script(s) take during the job execution * */ role: iam.IRole; /** - * Name of the Glue Job (optional) - * Developer-specified name of the Glue Job + * Name of the Glue job (optional) + * Developer-specified name of the Glue job * */ name?: string; /** * Description (optional) - * Developer-specified description of the Glue Job + * Developer-specified description of the Glue job * */ description?: string; /** * Number of Workers (optional) - * Number of workers for Glue to use during Job execution + * Number of workers for Glue to use during job execution * @default 10 * */ numberOrWorkers?: int; /** * Worker Type (optional) - * Type of Worker for Glue to use during Job execution - * Enum options: Standard, G.1X, G.2X, G.025X. G.4X, G.8X, Z.2X - * @default G.2X + * Type of Worker for Glue to use during job execution + * Enum options: Standard, G_1X, G_2X, G_025X. G_4X, G_8X, Z_2X + * @default G_2X * */ workerType?: glue.WorkerType; /** * Max Concurrent Runs (optional) - * The maximum number of runs this Glue Job cna concurrently run + * The maximum number of runs this Glue job can concurrently run * @default 1 * */ maxConcurrentRuns?: int; /** * Default Arguments (optional) - * The default arguments for every run of this Glue Job, + * The default arguments for every run of this Glue job, * specified as name-value pairs. * */ defaultArguments?: {[key: string], string }[]; /** * Connections (optional) - * List of connections to use for this Glue Job + * List of connections to use for this Glue job * */ connections?: IConnection[]; /** * Max Retries (optional) - * Maximum number of retry attempts Glue will perform - * if the Job fails + * Maximum number of retry attempts Glue perform + * if the job fails * @default 0 * */ maxRetries?: int; /** * Timeout (optional) - * Timeout for the Glue Job, specified in minutes + * Timeout for the Glue job, specified in minutes * @default 2880 (2 days for non-streaming) * */ timeout?: cdk.Duration; /** * Security Configuration (optional) - * Defines the encryption options for the Glue Job + * Defines the encryption options for the Glue job * */ securityConfiguration?: ISecurityConfiguration; /** * Tags (optional) - * A list of key:value pairs of tags to apply to this Glue Job resource + * A list of key:value pairs of tags to apply to this Glue job resource * */ tags?: {[key: string], string }[]; /** * Glue Version - * The version of Glue to use to execute this Job + * The version of Glue to use to execute this job * @default 3.0 * */ glueVersion?: glue.GlueVersion; @@ -873,15 +888,12 @@ PySparkFlexJobProps{ ### Python Shell Jobs -A Python shell job runs Python scripts as a shell and supports a Python -version that depends on the AWS Glue version you are using. This can be used -to schedule and run tasks that don't require an Apache Spark environment. - -We’ll default to `PythonVersion.3_9`. Python shell jobs have a MaxCapacity feature. -Developers can choose MaxCapacity = `0.0625` or MaxCapacity = `1`. By default, -MaxCapacity will be set `0.0625`. Python 3.9 supports preloaded analytics -libraries using the `library-set=analytics` flag, and this feature will -be enabled by default. +Python shell jobs support a Python version that depends on the AWS Glue +version you use. These can be used to schedule and run tasks that don't +require an Apache Spark environment. Python shell jobs default to +Python 3.9 and a MaxCapacity of `0.0625`. Python 3.9 supports pre-loaded +analytics libraries using the `library-set=analytics` flag, which is +enabled by default. ```ts new glue.PythonShellJob(this, 'PythonShellJob', { @@ -890,7 +902,7 @@ new glue.PythonShellJob(this, 'PythonShellJob', { }); ``` -Optional overrides: +Optional override examples: ```ts new glue.PythonShellJob(this, 'PythonShellJob', { @@ -911,7 +923,7 @@ Python Shell Job Property Interface: PythonShellJobProps{ /** * Script Code Location (required) - * Script to run when the Glue Job executes. Can be uploaded + * Script to run when the Glue job executes. Can be uploaded * from the local directory structure using fromAsset * or referenced via S3 location using fromBucket * */ @@ -919,89 +931,89 @@ PythonShellJobProps{ /** * IAM Role (required) - * IAM Role to use for Glue Job execution + * IAM Role to use for Glue job execution * Must be specified by the developer because the L2 doesn't have visibility - * into the actions the script(s) will take during the Job Execution + * into the actions the script(s) take during the job execution * */ role: iam.IRole; /** - * Name of the Glue Job (optional) - * Developer-specified name of the Glue Job + * Name of the Glue job (optional) + * Developer-specified name of the Glue job * */ name?: string; /** * Description (optional) - * Developer-specified description of the Glue Job + * Developer-specified description of the Glue job * */ description?: string; /** * Number of Workers (optional) - * Number of workers for Glue to use during Job execution + * Number of workers for Glue to use during job execution * @default 10 * */ numberOrWorkers?: int; /** * Worker Type (optional) - * Type of Worker for Glue to use during Job execution - * Enum options: Standard, G.1X, G.2X, G.025X. G.4X, G.8X, Z.2X - * @default G.2X + * Type of Worker for Glue to use during job execution + * Enum options: Standard, G_1X, G_2X, G_025X. G_4X, G_8X, Z_2X + * @default G_2X * */ workerType?: glue.WorkerType; /** * Max Concurrent Runs (optional) - * The maximum number of runs this Glue Job cna concurrently run + * The maximum number of runs this Glue job can concurrently run * @default 1 * */ maxConcurrentRuns?: int; /** * Default Arguments (optional) - * The default arguments for every run of this Glue Job, + * The default arguments for every run of this Glue job, * specified as name-value pairs. * */ defaultArguments?: {[key: string], string }[]; /** * Connections (optional) - * List of connections to use for this Glue Job + * List of connections to use for this Glue job * */ connections?: IConnection[]; /** * Max Retries (optional) - * Maximum number of retry attempts Glue will perform - * if the Job fails + * Maximum number of retry attempts Glue perform + * if the job fails * @default 0 * */ maxRetries?: int; /** * Timeout (optional) - * Timeout for the Glue Job, specified in minutes + * Timeout for the Glue job, specified in minutes * @default 2880 (2 days for non-streaming) * */ timeout?: cdk.Duration; /** * Security Configuration (optional) - * Defines the encryption options for the Glue Job + * Defines the encryption options for the Glue job * */ securityConfiguration?: ISecurityConfiguration; /** * Tags (optional) - * A list of key:value pairs of tags to apply to this Glue Job resource + * A list of key:value pairs of tags to apply to this Glue job resource * */ tags?: {[key: string], string }[]; /** * Glue Version - * The version of Glue to use to execute this Job + * The version of Glue to use to execute this job * @default 3.0 for ETL * */ glueVersion?: glue.GlueVersion; @@ -1010,8 +1022,9 @@ PythonShellJobProps{ ### Ray Jobs -Glue ray only supports worker type Z.2X and Glue version 4.0. Runtime -will default to `Ray2.3` and min workers will default to 3. +Glue Ray jobs use worker type Z.2X and Glue version 4.0. These are not +overrideable since these are the only configuration that Glue Ray jobs +currently support. The runtime defaults to Ray2.4 and min workers defaults to 3. ```ts new glue.GlueRayJob(this, 'GlueRayJob', { @@ -1020,13 +1033,13 @@ new glue.GlueRayJob(this, 'GlueRayJob', { }); ``` -Developers can override min workers and other Glue job fields +Optional override example: ```ts new glue.GlueRayJob(this, 'GlueRayJob', { - runtime: glue.Runtime.RAY_2_2, script: glue.Code.fromBucket('bucket-name', 's3prefix/path-to-python-script'), numberOfWorkers: 50, + minWorkers: 25, role: iam.IRole, }); ``` @@ -1037,7 +1050,7 @@ Ray Job Property Interface: RayJobProps{ /** * Script Code Location (required) - * Script to run when the Glue Job executes. Can be uploaded + * Script to run when the Glue job executes. Can be uploaded * from the local directory structure using fromAsset * or referenced via S3 location using fromBucket * */ @@ -1045,101 +1058,109 @@ RayJobProps{ /** * IAM Role (required) - * IAM Role to use for Glue Job execution + * IAM Role to use for Glue job execution * Must be specified by the developer because the L2 doesn't have visibility - * into the actions the script(s) will take during the Job Execution + * into the actions the script(s) take during the job execution * */ role: iam.IRole; /** - * Name of the Glue Job (optional) - * Developer-specified name of the Glue Job + * Name of the Glue job (optional) + * Developer-specified name of the Glue job * */ name?: string; /** * Description (optional) - * Developer-specified description of the Glue Job + * Developer-specified description of the Glue job * */ description?: string; /** * Number of Workers (optional) - * Number of workers for Glue to use during Job execution + * Number of workers for Glue to use during job execution * @default 10 * */ numberOrWorkers?: int; /** * Worker Type (optional) - * Type of Worker for Glue to use during Job execution - * Enum options: Standard, G.1X, G.2X, G.025X. G.4X, G.8X, Z.2X - * @default G.2X + * Type of Worker for Glue to use during job execution + * Enum options: Standard, G_1X, G_2X, G_025X. G_4X, G_8X, Z_2X + * @default Z_2X * */ workerType?: glue.WorkerType; + /** + * Runtime (optional) + * Type of Worker for Glue to use during job execution + * Enum options: Ray2_2, Ray 2_3, Ray2_4 + * @default Ray2_4 + * */ + runtime?: glue.RayRuntime; + /** * Max Concurrent Runs (optional) - * The maximum number of runs this Glue Job cna concurrently run + * The maximum number of runs this Glue job can concurrently run * @default 1 * */ maxConcurrentRuns?: int; /** * Default Arguments (optional) - * The default arguments for every run of this Glue Job, + * The default arguments for every run of this Glue job, * specified as name-value pairs. * */ defaultArguments?: {[key: string], string }[]; /** * Connections (optional) - * List of connections to use for this Glue Job + * List of connections to use for this Glue job * */ connections?: IConnection[]; /** * Max Retries (optional) - * Maximum number of retry attempts Glue will perform - * if the Job fails + * Maximum number of retry attempts Glue perform + * if the job fails * @default 0 * */ maxRetries?: int; /** * Timeout (optional) - * Timeout for the Glue Job, specified in minutes + * Timeout for the Glue job, specified in minutes * @default 2880 (2 days for non-streaming) * */ timeout?: cdk.Duration; /** * Security Configuration (optional) - * Defines the encryption options for the Glue Job + * Defines the encryption options for the Glue job * */ securityConfiguration?: ISecurityConfiguration; /** * Tags (optional) - * A list of key:value pairs of tags to apply to this Glue Job resource + * A list of key:value pairs of tags to apply to this Glue job resource * */ tags?: {[key: string], string }[]; /** * Glue Version - * The version of Glue to use to execute this Job + * The version of Glue to use to execute this job * @default 4.0 * */ glueVersion?: glue.GlueVersion; } ``` -### Uploading scripts from the same repo to S3 +### Uploading scripts from the CDK app repository to S3 -Similar to other L2 constructs, the Glue L2 will automate uploading / updating +Similar to other L2 constructs, the Glue L2 automates uploading / updating scripts to S3 via an optional fromAsset parameter pointing to a script -in the local file structure. Developers will provide an existing S3 bucket and -the path to which they'd like the script to be uploaded. +in the local file structure. You provide the existing S3 bucket and +path to which you'd like the script to be uploaded. ```ts glue.ScalaSparkEtlJob(this, 'ScalaSparkEtlJob', { @@ -1150,24 +1171,24 @@ glue.ScalaSparkEtlJob(this, 'ScalaSparkEtlJob', { ### Workflow Triggers -In AWS Glue, developers can use workflows to create and visualize complex +You can use Glue workflows to create and visualize complex extract, transform, and load (ETL) activities involving multiple crawlers, -jobs, and triggers. Standalone triggers are an anti-pattern, so we will -only create triggers from within a workflow. +jobs, and triggers. Standalone triggers are an anti-pattern, so you must +create triggers from within a workflow using the L2 construct. -Within the workflow object, there will be functions to create different -types of triggers with actions and predicates. Those triggers can then be -added to jobs. +Within a workflow object, there are functions to create different +types of triggers with actions and predicates. You then add those triggers +to jobs. -For all trigger types, the StartOnCreation property will be set to true by -default, but developers will have the option to override it. +StartOnCreation defaults to true for all trigger types, but you can +override it if you prefer for your trigger not to start on creation. -1. **On Demand Triggers** +1. **On-Demand Triggers** -On demand triggers can start glue jobs or crawlers. We’ll add convenience -functions to create on-demand crawler or job triggers. The trigger method -will take an optional description but abstract the requirement of an actions -list using the job or crawler objects using conditional types. +On-demand triggers can start glue jobs or crawlers. This construct provides +convenience functions to create on-demand crawler or job triggers. The constructor +takes an optional description parameter, but abstracts the requirement of an +actions list using the job or crawler objects using conditional types. ```ts myWorkflow = new glue.Workflow(this, "GlueWorkflow", { @@ -1184,14 +1205,13 @@ myWorkflow.onDemandTrigger(this, 'TriggerJobOnDemand', { 1. **Scheduled Triggers** -Schedule triggers are a way for developers to create jobs using cron -expressions. We’ll provide daily, weekly, and monthly convenience functions, -as well as a custom function that will allow developers to create their own -custom timing using the [existing event Schedule object](https://docs.aws.amazon.com/cdk/api/v2/docs/aws-cdk-lib.aws_events.Schedule.html) -without having to build their own cron expressions. (The L2 will extract -the expression that Glue requires from the Schedule object). The trigger method will -take an optional description and list of Actions which can refer to Jobs or -crawlers via conditional types. +You can create scheduled triggers using cron expressions. This construct +provides daily, weekly, and monthly convenience functions, +as well as a custom function that allows you to create your own +custom timing using the [existing event Schedule class](https://docs.aws.amazon.com/cdk/api/v2/docs/aws-cdk-lib.aws_events.Schedule.html) +without having to build your own cron expressions. The L2 extracts +the expression that Glue requires from the Schedule object. The constructor +takes an optional description and a list of jobs or crawlers as actions. ```ts // Create Daily Schedule at 00 UTC @@ -1216,10 +1236,11 @@ myWorkflow.customScheduleJobTrigger(this, 'TriggerCrawlerOnCustomSchedule', { #### **3. Notify Event Triggers** -Workflows are mandatory for this trigger type. There are two types -of notify event triggers, batching and non-batching trigger. For batching triggers, -developers must specify `BatchSize` but for non-batching `BatchSize` will be set -to 1. For both triggers, `BatchWindow` will be default to 900 seconds. +There are two types of notify event triggers: batching and non-batching. +For batching triggers, you must specify `BatchSize`. For non-batching +triggers, `BatchSize` defaults to 1. For both triggers, `BatchWindow` +defaults to 900 seconds, but you can override the window to align with +your workload's requirements. ```ts myWorkflow.notifyEventTrigger(this, 'MyNotifyTriggerBatching', { @@ -1236,7 +1257,7 @@ myWorkflow.notifyEventTrigger(this, 'MyNotifyTriggerNonBatching', { #### **4. Conditional Triggers** Conditional triggers have a predicate and actions associated with them. -When the predicateCondition is true, the trigger actions will be executed. +The trigger actions are executed when the predicateCondition is true. ```ts // Triggers on Job and Crawler status @@ -1255,17 +1276,17 @@ A `Connection` allows Glue jobs, crawlers and development endpoints to access certain types of data stores. ***Secrets Management - **User needs to specify JDBC connection credentials in Secrets Manager and - provide the Secrets Manager Key name as a property to the Job connection - property. + **You must specify JDBC connection credentials in Secrets Manager and + provide the Secrets Manager Key name as a property to the job connection. -* **Networking - CDK determines the best fit subnet for Glue Connection +* **Networking - the CDK determines the best fit subnet for Glue connection configuration - **The current glue-alpha-module requires the developer to specify the - subnet of the Connection when it’s defined. The developer can still specify the - specific subnet they want to use, but no longer have to. This Glue L2 RFC will - allow developers to provide only a VPC and either a public or private subnet - selection. The L2 will then leverage the existing [EC2 Subnet Selection](https://docs.aws.amazon.com/cdk/api/v2/python/aws_cdk.aws_ec2/SubnetSelection.html) + **The prior version of the glue-alpha-module requires the developer to + specify the subnet of the Connection when it’s defined. Now, you can still + specify the specific subnet you want to use, but are no longer required + to. You are only required to provide a VPC and either a public or private + subnet selection. Without a specific subnet provided, the L2 leverages the + existing [EC2 Subnet Selection](https://docs.aws.amazon.com/cdk/api/v2/python/aws_cdk.aws_ec2/SubnetSelection.html) library to make the best choice selection for the subnet. ## Public FAQ @@ -1289,7 +1310,7 @@ team are not in scope for this effort. Developers should use existing methods to create these resources, and the new Glue L2 construct assumes they already exist as inputs. While best practice is for application and infrastructure code to be as close as possible for teams using fully-implemented DevOps mechanisms, -in practice these ETL scripts will likely be managed by a data science team who +in practice these ETL scripts are likely managed by a data science team who know Python or Scala and don’t necessarily own or manage their own infrastructure deployments. We want to meet developers where they are, and not assume that all of the code resides in the same repository, Developers can