Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/1416 S3 input/output for Enceladus on EMR (S2.4, H2.8.5) #1483

Merged
merged 7 commits into from
Aug 24, 2020

Conversation

dk1844
Copy link
Contributor

@dk1844 dk1844 commented Aug 13, 2020

This PoC allows to run Enceladus using s3, there are following preconditions:

  • Have local variable with your Menas hostname (e.g. Load-balanced DNS record) defined:
export MENAS_URL="http://menas-fargate-elb.ctodatadev.aws.dsarena.com"

(This is a more durable way than defining the in /etc/hosts as e.g. 20.0.62.173 menas-load-balanced)

Also, a couple of things have been disabled (commented out) in order for Enceladus not to fail on HDFS attempting to use paths starting with s3://:

  • Atum integration (checkpoints, ...)
  • Performance metrics
  • INFO files
  • Existing output checking

Standardization

To run standardization

spark-submit --class za.co.absa.enceladus.standardization.StandardizationJob \
--conf "spark.driver.extraJavaOptions=-Dmenas.rest.uri=$MENAS_URL -Dstandardized.hdfs.path=s3://euw1-ctodatadev-dev-bigdatarnd-s3-poc/enceladusPoc/ao-hdfs-data/stdOutput/standardized-{0}-{1}-{2}-{3} -Dspline.producer.url=http://spline-rest-gateway-fargate.ctodatadev.aws.dsarena.com:8080/producer" \
~/enceladusPoc/spark-jobs-2.12.0-AWS-SNAPSHOT.jar \
--menas-credentials-file ~/enceladusPoc/menas-credentials.properties \
--dataset-name dk_test1_emr285 --raw-format json --dataset-version 1 --report-date 2019-11-27 --report-version 1

Conformance

To run conformance

spark-submit --class za.co.absa.enceladus.conformance.DynamicConformanceJob \
--conf "spark.driver.extraJavaOptions=-Dmenas.rest.uri=$MENAS_URL -Dstandardized.hdfs.path=s3://euw1-ctodatadev-dev-bigdatarnd-s3-poc/enceladusPoc/ao-hdfs-data/stdOutput/standardized-{0}-{1}-{2}-{3} -Dspline.producer.url=http://spline-rest-gateway-fargate.ctodatadev.aws.dsarena.com:8080/producer" \
~/enceladusPoc/spark-jobs-2.12.0-AWS-SNAPSHOT.jar \
--menas-credentials-file ~/enceladusPoc/menas-credentials.properties \
--dataset-name dk_test1_emr285 --dataset-version 1 --report-date 2019-11-27 --report-version 1

Test-run output

The standardization & conformance run successfully outputs the data onto s3:

$ aws --profile saml s3 ls --recursive s3://euw1-ctodatadev-dev-bigdatarnd-s3-poc/enceladusPoc
2020-08-17 13:47:59          0 enceladusPoc/ao-hdfs-data/confOutput/enceladus_info_date=2019-11-27/enceladus_info_version=1/_SUCCESS
2020-08-17 13:47:59      23307 enceladusPoc/ao-hdfs-data/confOutput/enceladus_info_date=2019-11-27/enceladus_info_version=1/part-00000-c12ec4db-7ce9-473c-9ac2-c8f8e5775e5c-c000.snappy.parquet
2020-08-17 13:46:25          0 enceladusPoc/ao-hdfs-data/stdOutput/standardized-dk_test1_emr285-1-2019-11-27-1/_SUCCESS
2020-08-17 13:46:25      22377 enceladusPoc/ao-hdfs-data/stdOutput/standardized-dk_test1_emr285-1-2019-11-27-1/part-00000-ae351131-b2dd-4144-8539-d5379c263fd2-c000.snappy.parquet

 - all directly-hdfs touching stuff disabled (atum, performance measurements, info files, output path checking)

# Add menasfargate into hosts
sudo nano /etc/hosts
# paste
20.0.63.69 menasfargate
# save & exit (ctrl+O, ctrl+X)

# Running standardization works via:
spark-submit --class za.co.absa.enceladus.standardization.StandardizationJob --conf "spark.driver.extraJavaOptions=-Dmenas.rest.uri=http://menasfargate:8080 -Dstandardized.hdfs.path=s3://euw1-ctodatadev-dev-bigdatarnd-s3-poc/enceladusPoc/ao-hdfs-data/stdOutput/standardized-{0}-{1}-{2}-{3}" ~/enceladusPoc/spark-jobs-2.11.0-SNAPSHOT.jar --menas-credentials-file ~/enceladusPoc/menas-credentials.properties --dataset-name dk_test1_emr285 --raw-format json --dataset-version 1 --report-date 2019-11-27 --report-version 1 2> ~/enceladusPoc/stderr.txt
…ut, s3 conf output)

 0- all directly-hdfs touching stuff disabled (atum, performance measurements, info files, output path checking)

# Add menasfargate into hosts
sudo nano /etc/hosts
# paste
20.0.63.69 menasfargate
# save & exit (ctrl+O, ctrl+X)

# Running conformance works via:
spark-submit --class za.co.absa.enceladus.conformance.DynamicConformanceJob --conf "spark.driver.extraJavaOptions=-Dmenas.rest.uri=http://menasfargate:8080 -Dstandardized.hdfs.path=s3://euw1-ctodatadev-dev-bigdatarnd-s3-poc/enceladusPoc/ao-hdfs-data/stdOutput/standardized-{0}-{1}-{2}-{3}" ~/enceladusPoc/spark-jobs-2.11.0-SNAPSHOT.jar --menas-credentials-file ~/enceladusPoc/menas-credentials.properties --dataset-name dk_test1_emr285 --dataset-version 1 --report-date 2019-11-27 --report-version 1 2> ~/enceladusPoc/conf-log.txt
<dependency>
<groupId>software.amazon.awssdk</groupId>
<artifactId>bom</artifactId>
<version>${aws.java.sdk.version}</version>
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the SDK is not being used, yet as of now, but the support is now added because the usage is expected based on the S3 PoC.

Copy link
Contributor

@AdrianOlosutean AdrianOlosutean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read the code and I don't have more to add given that it works on EMR


// TODO fix for s3 [ref issue #1416]
//withPartCols.writeInfoFile(preparationResult.pathCfg.publishPath)
//writePerformanceMetrics(preparationResult.performance, cmd)

if (conformanceReader.isAutocleanStdFolderEnabled()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should also be commented out

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, done.

Copy link
Contributor

@Zejnilovic Zejnilovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I saw correctly getting spark-jobs to run on the cloud, for now, is mostly about stripping support features and we will look into #1416 in wave 2 again.
Anyway, could you write some conclusion/documentation like this in #1416?

case Some(p) => p
}
// TODO fix for s3 [ref issue #1416]
val recordCount = 100
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 would be better if there is no check for it to be > 0. Because 100 seems like a valid value.

Copy link
Contributor Author

@dk1844 dk1844 Aug 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, took your advice (=> -1; #1416 (comment))

Zejnilovic
Zejnilovic previously approved these changes Aug 17, 2020
Copy link
Contributor

@Zejnilovic Zejnilovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to approve. My one comment is non-blocking

@dk1844 dk1844 marked this pull request as ready for review August 17, 2020 12:50
lokm01
lokm01 previously approved these changes Aug 17, 2020
@lokm01
Copy link
Collaborator

lokm01 commented Aug 17, 2020

Whats the menas loadbalanced IP for?

Can we use the internal ELB hostname instead?

@dk1844
Copy link
Contributor Author

dk1844 commented Aug 17, 2020

Whats the menas loadbalanced IP for?

Can we use the internal ELB hostname instead?

Your question is spot on, so I replaced it with the internal ELB hostname as suggested. The IP would work, too, but should be a more durable solution, I agree.

@sonarcloud
Copy link

sonarcloud bot commented Aug 24, 2020

Kudos, SonarCloud Quality Gate passed!

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities (and Security Hotspot 0 Security Hotspots to review)
Code Smell A 42 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

@dk1844 dk1844 merged commit 0b60b1a into aws-poc Aug 24, 2020
@dk1844 dk1844 deleted the feature/1416-aws-emr-poc branch August 24, 2020 15:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants