Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark ETL FLEX jobs - L2 Constructs #2

Merged
merged 4 commits into from
Jun 18, 2024
Merged

Conversation

pras-b
Copy link
Collaborator

@pras-b pras-b commented Apr 12, 2024

Issue # (if applicable)

https://app.asana.com/0/1204099321156022/1204826115759041/f

Description of changes

Per RFC#498, Added L2 Constructs for Spark FLEX ETL job - Python & Scala

Added Integration and Unit tests for both the constructs(Python & Scala ETL Flex jobs)

Refer to the RFC for more details (Spark ETL FLEX jobs) - aws/aws-cdk-rfcs#497

Description of how you validated changes

Validated through Integration & Unit testing

Checklist


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license

command: {
name: JobType.ETL,
scriptLocation: this.codeS3ObjectUrl(props.script),
pythonVersion: PythonVersion.THREE,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this required given that it is a Scala Flex job?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed it...


export interface PySparkFlexEtlJobProps extends JobProperties {

/**
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these in the RFC? Cannot see them

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are the Props interface.. not required to be passed during invocation

const args: { [key: string]: string } = {};
args['--job-language'] = JobLanguage.PYTHON;

// TODO: Confirm with Glue service team what the mapping is from extra-x to job language, if any
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this TODO still valid?

Copy link
Collaborator Author

@pras-b pras-b Apr 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unsure...

return args;
}

private setupSparkUI(role: iam.IRole, sparkUiProps: SparkUIProps) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add some comments here to describe the functions purpose, including the arguments and return types

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

* S3 URL where additional python dependencies are located
* @default - no extra files
*/
readonly extraPythonFiles?: string[];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be in this class? This is for Scala

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed it from the scala props interface

return args;
}

private setupSparkUI(role: iam.IRole, sparkUiProps: SparkUIProps) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can comment what this function does

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

role: iam_role,
});

/*new glue.PySparkFlexEtlJob(stack, 'BasicPySparkFlexEtlJobv3', {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be commented out?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is on the test file. Just thought of leaving it there as it can be used for any future test cases.

Copy link
Collaborator

@natalie-white-aws natalie-white-aws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few small questions / removals

...this.checkNoReservedArgs(props.defaultArguments),
};

/*if ((!props.workerType && props.numberOrWorkers !== undefined) || (props.workerType && props.numberOrWorkers === undefined)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove. We're enforcing contractual defaults so both will always be set whether they're overridden or not.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.. removed it.

//args['--extra-py-files'] = props.extraPythonFiles.map(code => this.codeS3ObjectUrl(code)).join(',');
}

// if (props.extraJars && props.extraJars?.length > 0) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarify whether extraFiles is used for all languages, but definitely don't need extraJars in the pyspark job.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed the extraJars from pyspark job. Added --extra-py-files and --extra-files for pyspark flex etl job. The property glueVersion is defined in the interfaces on job-execution.ts(SharedJobExecutableProps) and job.ts(JobProperties) and are conflicting and hence could not use SharedJobExecutableProps interface.


}

/**
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy-paste?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed....

public readonly sparkUILoggingLocation?: SparkUILoggingLocation;

/**
* PySparkFlexEtlJob constructor
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy-paste?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

numberOfWorkers: props.numberOrWorkers ? props.numberOrWorkers : 10,
maxRetries: props.maxRetries,
executionProperty: props.maxConcurrentRuns ? { maxConcurrentRuns: props.maxConcurrentRuns } : undefined,
//notificationProperty: props.notifyDelayAfter ? { notifyDelayAfter: props.notifyDelayAfter.toMinutes() } : undefined,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this commented out?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is optional.. It was commented in the pyspark ELT class. Now I have uncommented it and it would be an option parameter to pass

numberOfWorkers: props.numberOrWorkers ? props.numberOrWorkers : 10,
maxRetries: props.maxRetries,
executionProperty: props.maxConcurrentRuns ? { maxConcurrentRuns: props.maxConcurrentRuns } : undefined,
//notificationProperty: props.notifyDelayAfter ? { notifyDelayAfter: props.notifyDelayAfter.toMinutes() } : undefined,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this commented out?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is optional.. It was commented in the pyspark ELT class. Now I have uncommented it and it would be an option parameter to pass

//args['--extra-py-files'] = props.extraPythonFiles.map(code => this.codeS3ObjectUrl(code)).join(',');
}

// if (props.extraJars && props.extraJars?.length > 0) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirm extraFiles requirements

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added --extra-jars , --extra-files & --user-jars-first to the scalaspark flex job. These are optional. There is a conflict for the property glueVersion from the interfaces job-execution.ts(SharedJobExecutableProps) and job.ts(JobProperties) and hence could not use the properties from the interfaces in job-execution.ts.

@natalie-white-aws natalie-white-aws merged commit f976c88 into main Jun 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants