-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/1416 S3 input/output for Enceladus on EMR (S2.4, H2.8.5) #1483
Conversation
- all directly-hdfs touching stuff disabled (atum, performance measurements, info files, output path checking) # Add menasfargate into hosts sudo nano /etc/hosts # paste 20.0.63.69 menasfargate # save & exit (ctrl+O, ctrl+X) # Running standardization works via: spark-submit --class za.co.absa.enceladus.standardization.StandardizationJob --conf "spark.driver.extraJavaOptions=-Dmenas.rest.uri=http://menasfargate:8080 -Dstandardized.hdfs.path=s3://euw1-ctodatadev-dev-bigdatarnd-s3-poc/enceladusPoc/ao-hdfs-data/stdOutput/standardized-{0}-{1}-{2}-{3}" ~/enceladusPoc/spark-jobs-2.11.0-SNAPSHOT.jar --menas-credentials-file ~/enceladusPoc/menas-credentials.properties --dataset-name dk_test1_emr285 --raw-format json --dataset-version 1 --report-date 2019-11-27 --report-version 1 2> ~/enceladusPoc/stderr.txt
…ut, s3 conf output) 0- all directly-hdfs touching stuff disabled (atum, performance measurements, info files, output path checking) # Add menasfargate into hosts sudo nano /etc/hosts # paste 20.0.63.69 menasfargate # save & exit (ctrl+O, ctrl+X) # Running conformance works via: spark-submit --class za.co.absa.enceladus.conformance.DynamicConformanceJob --conf "spark.driver.extraJavaOptions=-Dmenas.rest.uri=http://menasfargate:8080 -Dstandardized.hdfs.path=s3://euw1-ctodatadev-dev-bigdatarnd-s3-poc/enceladusPoc/ao-hdfs-data/stdOutput/standardized-{0}-{1}-{2}-{3}" ~/enceladusPoc/spark-jobs-2.11.0-SNAPSHOT.jar --menas-credentials-file ~/enceladusPoc/menas-credentials.properties --dataset-name dk_test1_emr285 --dataset-version 1 --report-date 2019-11-27 --report-version 1 2> ~/enceladusPoc/conf-log.txt
da8d4b8
to
67e4012
Compare
<dependency> | ||
<groupId>software.amazon.awssdk</groupId> | ||
<artifactId>bom</artifactId> | ||
<version>${aws.java.sdk.version}</version> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the SDK is not being used, yet as of now, but the support is now added because the usage is expected based on the S3 PoC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Read the code and I don't have more to add given that it works on EMR
|
||
// TODO fix for s3 [ref issue #1416] | ||
//withPartCols.writeInfoFile(preparationResult.pathCfg.publishPath) | ||
//writePerformanceMetrics(preparationResult.performance, cmd) | ||
|
||
if (conformanceReader.isAutocleanStdFolderEnabled()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should also be commented out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
case Some(p) => p | ||
} | ||
// TODO fix for s3 [ref issue #1416] | ||
val recordCount = 100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-1 would be better if there is no check for it to be > 0. Because 100 seems like a valid value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, took your advice (=> -1; #1416 (comment))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to approve. My one comment is non-blocking
85570d7
Whats the menas loadbalanced IP for? Can we use the internal ELB hostname instead? |
Your question is spot on, so I replaced it with the internal ELB hostname as suggested. The IP would work, too, but should be a more durable solution, I agree. |
% Conflicts: % pom.xml
Kudos, SonarCloud Quality Gate passed! 0 Bugs No Coverage information |
This PoC allows to run Enceladus using s3, there are following preconditions:
(This is a more durable way than defining the in
/etc/hosts
as e.g.20.0.62.173 menas-load-balanced
)Also, a couple of things have been disabled (commented out) in order for Enceladus not to fail on HDFS attempting to use paths starting with
s3://
:Standardization
To run standardization
Conformance
To run conformance
Test-run output
The standardization & conformance run successfully outputs the data onto s3: