Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aws-POC merge to develop #1632

Merged
merged 38 commits into from
Jan 29, 2021
Merged

Aws-POC merge to develop #1632

merged 38 commits into from
Jan 29, 2021

Conversation

benedeki
Copy link
Collaborator

@benedeki benedeki commented Jan 11, 2021

  • Upgrade to Atum 3.2.2 // @dk1844: I think it is 3.2.0
  • Menas Docker Setup
  • Kept spline 0.3
  • Restore/Kept HDFS (file) browser in Menas
  • Restore/Kept Oozie in Menas
  • Preserving all other changes from AWS-PoC branch

Closes #1622

Zejnilovic and others added 29 commits July 2, 2020 19:48
% Conflicts:
%	dao/pom.xml
%	data-model/pom.xml
%	examples/pom.xml
%	menas/pom.xml
%	menas/src/main/resources/application.properties.template
%	migrations-cli/pom.xml
%	migrations/pom.xml
%	plugins-api/pom.xml
%	plugins-builtin/pom.xml
%	pom.xml
%	spark-jobs/pom.xml
%	utils/pom.xml
…2-snapshot

Aws poc update from develop 2.12 snapshot
 - all directly-hdfs touching stuff disabled (atum, performance measurements, info files, output path checking)

# Add menasfargate into hosts
sudo nano /etc/hosts
# paste
20.0.63.69 menasfargate
# save & exit (ctrl+O, ctrl+X)

# Running standardization works via:
spark-submit --class za.co.absa.enceladus.standardization.StandardizationJob --conf "spark.driver.extraJavaOptions=-Dmenas.rest.uri=http://menasfargate:8080 -Dstandardized.hdfs.path=s3://euw1-ctodatadev-dev-bigdatarnd-s3-poc/enceladusPoc/ao-hdfs-data/stdOutput/standardized-{0}-{1}-{2}-{3}" ~/enceladusPoc/spark-jobs-2.11.0-SNAPSHOT.jar --menas-credentials-file ~/enceladusPoc/menas-credentials.properties --dataset-name dk_test1_emr285 --raw-format json --dataset-version 1 --report-date 2019-11-27 --report-version 1 2> ~/enceladusPoc/stderr.txt
…ut, s3 conf output)

 0- all directly-hdfs touching stuff disabled (atum, performance measurements, info files, output path checking)

# Add menasfargate into hosts
sudo nano /etc/hosts
# paste
20.0.63.69 menasfargate
# save & exit (ctrl+O, ctrl+X)

# Running conformance works via:
spark-submit --class za.co.absa.enceladus.conformance.DynamicConformanceJob --conf "spark.driver.extraJavaOptions=-Dmenas.rest.uri=http://menasfargate:8080 -Dstandardized.hdfs.path=s3://euw1-ctodatadev-dev-bigdatarnd-s3-poc/enceladusPoc/ao-hdfs-data/stdOutput/standardized-{0}-{1}-{2}-{3}" ~/enceladusPoc/spark-jobs-2.11.0-SNAPSHOT.jar --menas-credentials-file ~/enceladusPoc/menas-credentials.properties --dataset-name dk_test1_emr285 --dataset-version 1 --report-date 2019-11-27 --report-version 1 2> ~/enceladusPoc/conf-log.txt
Feature/1416 S3 input/output for Enceladus on EMR (S2.4, H2.8.5)
This is a temporary solution. We currently experiment with
many forms of URLs, and having a regex there now slows us down.
#1526 
* FsUtils divided into LocalFsUtils & HdfsUtils
* PathConfigSuite update
* S3FsUtils with tail-recursive pagination accumulation - now generic with optional short-circuit breakOut
TestRunnerJob updated to manually cover the cases - should serve as a basis for tests
* HdfsUtils replace by trait DistributedFsUtils (except for MenasCredentials loading & nonSplittable splitting)
* using final version of s3-powered Atum (3.0.0)
* mockito-update version update, scalatest version update
* S3FsUtilsSuite: exists, read, sizeDir(hidden, non-hidden, reucursive), non-splittable (simple, recursive with breakOut), delete (recursive), version find (simple - empty, recursive)
* explicit stubbing fix for hyperdrive
* s3 using hadoop fs api
* s3 sdk usage removed (pom, classes, tests)
* atum final version 3.1.0 used
* readStandardizationInputData(... path: String)(implicit ... fs: FileSystem) -> readStandardizationInputData(input: PathWithFs)
* #1554 Tomcat with TLS container

* #1554 Added envoy config + enabling running unencrypted container
…e commons library version. Test-ran on EMR. (#1619)
* Tests pass
* Depends on ATUM 3.1.1-SNAPSHOT (the bugfix for AbsaOSS/atum#48)
* put back Oozie
* downgraded Spline
@benedeki benedeki self-assigned this Jan 11, 2021
@benedeki benedeki added the work in progress Work on this item is not yet finished (mainly intended for PRs) label Jan 11, 2021
Conflicts addressed:
#	spark-jobs/src/test/scala/za/co/absa/enceladus/conformance/datasource/DatasourceSuite.scala
#	spark-jobs/src/test/scala/za/co/absa/enceladus/conformance/interpreter/rules/RuleOptimizationSuite.scala
#	spark-jobs/src/test/scala/za/co/absa/enceladus/conformance/interpreter/rules/RulesSuite.scala
#	spark-jobs/src/test/scala/za/co/absa/enceladus/conformance/interpreter/rules/mapping/JoinMappingRuleInterpreterSuite.scala
@benedeki benedeki removed the work in progress Work on this item is not yet finished (mainly intended for PRs) label Jan 11, 2021
jobConfig.reportVersion match {
case Some(version) => version
case None =>

// publishFs for this specific feature (needed for missing reportVersion until reusable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment needed anymore?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not even sure what the comment means. 🤔 @dk1844 ?

Copy link
Contributor

@dk1844 dk1844 Jan 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the comment is needed (or at least that it is not meaningless). Maybe it should be reworded though if it is unclear. The gist of the meaning is this:

Everywhere else, one would want to reuse existing "Publish FS" that will reside in pathCfg.publish.fileSystem in this situation (to get an instance of HadoopFsUtils bound to this FS). However, here, you cannot do it, because pathCfg object is not yet available at this point. So, you the would-be-publishFs is here obtained temporarily from the dataset.hdfsPublishPath.

That's all. The comment just meant to explain why seemingly the call of dataset.hdfsPublishPath is pointlessly repeated here and also later in the factory method of PathConfig. It just could not be reused reasonably.

Copy link
Contributor

@dk1844 dk1844 Jan 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion for the comment here is:

// Since "pathConfig.publish.fileSystem" is not available at this point yet, a temporary publish-FS is used here instead

This should explain the intention quite a bit and not be too complicated.

val in = fs.open(new Path(hdfsPath))
val content = Array.fill(in.available())(0.toByte)
in.readFully(content)
val tmpFile = File.createTempFile("enceladusFSUtils", "hdfsFileToLocalTemp")
tmpFile.deleteOnExit()
FileUtils.writeByteArrayToFile(tmpFile, content)
tmpFile.getAbsolutePath

// why not use
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be removed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems so. @dk1844 can you confirm, please?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment still makes sense to me. Why in this case we do not use the simpler call as outlined in the comment:

fs.copyToLocalFile(false, new Path(hdfsPath), new Path("someLocalName"), true)

pom.xml Outdated
<scala.version>2.11.12</scala.version>
<scalatest.maven.version>2.0.0</scalatest.maven.version>
<scalatest.version>3.2.2</scalatest.version>
<scopt.version>4.0.0-RC2</scopt.version>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can scopt be updated to the 4.0.0 release?

dk1844
dk1844 previously approved these changes Jan 13, 2021
Copy link
Contributor

@dk1844 dk1844 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(just read through, just small non-blocking comments)

menas/ui/css/style.css Outdated Show resolved Hide resolved
menas/ui/css/style.css Outdated Show resolved Hide resolved
margin-top: 1rem !important;
}

h5[id$=datasetDetailView--scheduleTimingTitle] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible, but quite ugly. Couldn't the element just have a custom CSS class instead of relying on the generated id?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it's not nice. But I don't want to add improvements (not required by the merge) into the PR. Please consider fixing it, and creating a dedicated PR for it.

menas/ui/service/RunService.js Outdated Show resolved Hide resolved
menas/pom.xml Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
scripts/bash/enceladus_env.template.sh Outdated Show resolved Hide resolved
scripts/bash/run_enceladus.sh Outdated Show resolved Hide resolved
* Addressed issues identified by reviewers
Copy link
Collaborator Author

@benedeki benedeki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for this PR being so big. 😊

README.md Outdated Show resolved Hide resolved
menas/pom.xml Show resolved Hide resolved
menas/ui/css/style.css Outdated Show resolved Hide resolved
menas/ui/css/style.css Outdated Show resolved Hide resolved
margin-top: 1rem !important;
}

h5[id$=datasetDetailView--scheduleTimingTitle] {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it's not nice. But I don't want to add improvements (not required by the merge) into the PR. Please consider fixing it, and creating a dedicated PR for it.

menas/ui/service/RunService.js Outdated Show resolved Hide resolved
Conflicts addressed:
#	pom.xml
#	spark-jobs/src/test/scala/za/co/absa/enceladus/conformance/interpreter/InterpreterSuite.scala
#	utils/src/test/scala/za/co/absa/enceladus/utils/config/ConfigReaderSuite.scala
HuvarVer
HuvarVer previously approved these changes Jan 22, 2021
Copy link
Contributor

@HuvarVer HuvarVer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some tests

spark jobs

  • complex standardization
  • conformance with almost all of the rules
  • standardization&conformance with mapping table rule

UI

  • create and update schema
  • create and update mapping table
  • create and update dataset and conformance rules

Comparison with data on current Develop branche

Zejnilovic
Zejnilovic previously approved these changes Jan 29, 2021
Copy link
Contributor

@Zejnilovic Zejnilovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code reviewed
No functional changes after Veronika's approve

@sonarqubecloud
Copy link

Kudos, SonarCloud Quality Gate passed!

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 3 Code Smells

No Coverage information No Coverage information
1.1% 1.1% Duplication

Copy link
Contributor

@AdrianOlosutean AdrianOlosutean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code reviewed

@benedeki benedeki merged commit 722ff05 into develop Jan 29, 2021
@benedeki benedeki deleted the aws-merge-to-develop branch January 29, 2021 15:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Merge of aws-poc to develop brach
7 participants