Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

s3 persistence (atum, sdk fs usage, ...) #1526

Merged
merged 27 commits into from
Oct 16, 2020
Merged

s3 persistence (atum, sdk fs usage, ...) #1526

merged 27 commits into from
Oct 16, 2020

Conversation

dk1844
Copy link
Contributor

@dk1844 dk1844 commented Sep 18, 2020

Implementation of Enceladus on S3 wave 2:

  • includes Atum for s3 v3.0.0
  • FileSystemVersionUtils renamed to HdfsUtils and then direct HdfsUtils was replaced by a trait DistributedFsUtils backed by HdfsUtils and `S3FsUtils implementations.
    -AWS S3 SDK ised for the S3FsUtils implementations (measurements, cleanup, guards) and unit tests for it
  • scalaTest version updated, mockito-scala added
  • the non-splittable-to-temporary splittable conversion is not used for S3.

The S3 SDK usage relies on config values being present:

  • s3.region
  • s3.kmsKeyId (expected to be supplied externally as an override)

HdfsUtils is used directly (not via the trait and substitutable by S3FsUtils) in HyperConformance and in MenasCredentials (this can be discussed if this is desired or not).

Test run and its effect of raw vs post-std-post-conformance:

$ diff <(cat raw_info3 | jq ".") <(cat publish_info3 | jq ".")
10c10,60
<     "additionalInfo": {}
---
>     "additionalInfo": {
>       "conform_size_ratio": "98.86 %",
>       "std_yarn_deploy_mode": "client",
>       "conform_username": "user",
>       "conform_input_data_size": "45330",
>       "raw_format": "csv",
>       "conform_output_data_size": "47020",
>       "std_application_id": "application_1590755537141_0258",
>       "source_record_count": "223929",
>       "std_spark_master": "yarn",
>       "conform_output_dir_size": "47020",
>       "conform_records_failed": "131",
>       "std_executor_memory": "4743M",
>       "std_executors_num": "2",
>       "conform_errors_count": "131",
>       "std_username": "user",
>       "conform_driver_memory": "2048M",
>       "std_output_data_size": "45330",
>       "std_records_succeeded": "512",
>       "std_records_failed": "131",
>       "std_executor_cores": "2",
>       "conform_enceladus_version": "2.12.0-AWS-SNAPSHOT",
>       "conform_spark_master": "yarn",
>       "std_errors_count": "131",
>       "conform_executor_memory": "4743M",
>       "std_output_dir_size": "47337",
>       "std_input_dir_size": "82922",
>       "conform_executors_num": "2",
>       "std_output_dir": "s3://MY_BUCKET/superhero2/std/standardized-dk_test1_emr285-2-2020-08-06-1",
>       "conform_input_dir": "s3://MY_BUCKET/superhero2/std/standardized-dk_test1_emr285-2-2020-08-06-1",
>       "enceladus_info_version": "1",
>       "std_input_dir": "s3://MY_BUCKET/superhero/raw/2020/08/06/v1",
>       "std_enceladus_version": "2.12.0-AWS-SNAPSHOT",
>       "conform_records_succeeded": "512",
>       "conform_data_size_ratio": "98.86 %",
>       "conform_cmd_line_args": "--menas-credentials-file /home/ec2-user/enceladusPoc/menas-credentials.properties --dataset-name dk_test1_emr285 --dataset-version 2 --report-date 2020-08-06 --report-version 1",
>       "std_input_data_size": "81865",
>       "std_cmd_line_args": "--menas-credentials-file /home/ec2-user/enceladusPoc/menas-credentials.properties --dataset-name dk_test1_emr285 --dataset-version 2 --report-date 2020-08-06 --report-version 1 --raw-format csv --header true",
>       "conform_record_count": "643",
>       "conform_executor_cores": "2",
>       "conform_input_dir_size": "47564",
>       "std_driver_memory": "2048M",
>       "std_size_ratio": "54.67 %",
>       "conform_yarn_deploy_mode": "client",
>       "std_record_count": "643",
>       "conform_output_dir": "s3://MY_BUCKET/superhero2/publish/enceladus_info_date=2020-08-06/enceladus_info_version=1",
>       "conform_application_id": "application_1590755537141_0259",
>       "raw_record_count": "223929",
>       "enceladus_info_date": "2020-08-06",
>       "std_data_size_ratio": "54.67 %"
>     }
11a62
>   "runUniqueId": "3270723c-5a80-43b1-ab0a-cb206b6e1559",
22c73
<           "controlType": "controlType.count",
---
>           "controlType": "count",
24c75
<           "controlValue": 223929
---
>           "controlValue": "223929"
37c88
<           "controlType": "controlType.count",
---
>           "controlType": "count",
39c90,141
<           "controlValue": 223929
---
>           "controlValue": "223929"
>         }
>       ]
>     },
>     {
>       "name": "Standardization - End",
>       "software": "Atum",
>       "version": "0.3.0-SNAPSHOT",
>       "processStartTime": "18-09-2020 10:03:49 +0000",
>       "processEndTime": "18-09-2020 10:04:01 +0000",
>       "workflowName": "Standardization",
>       "order": 3,
>       "controls": [
>         {
>           "controlName": "recordcount",
>           "controlType": "count",
>           "controlCol": "*",
>           "controlValue": "643"
>         }
>       ]
>     },
>     {
>       "name": "Conformance - Start",
>       "software": "Atum",
>       "version": "0.3.0-SNAPSHOT",
>       "processStartTime": "18-09-2020 10:09:56 +0000",
>       "processEndTime": "18-09-2020 10:10:06 +0000",
>       "workflowName": "Conformance",
>       "order": 4,
>       "controls": [
>         {
>           "controlName": "recordcount",
>           "controlType": "count",
>           "controlCol": "*",
>           "controlValue": "643"
>         }
>       ]
>     },
>     {
>       "name": "Conformance - End",
>       "software": "Atum",
>       "version": "0.3.0-SNAPSHOT",
>       "processStartTime": "18-09-2020 10:10:06 +0000",
>       "processEndTime": "18-09-2020 10:10:08 +0000",
>       "workflowName": "Conformance",
>       "order": 5,
>       "controls": [
>         {
>           "controlName": "recordcount",
>           "controlType": "count",
>           "controlCol": "*",
>           "controlValue": "643"

(bucket name is redacted and replaced by MY_BUCKET)

Sorry for the PR being that large.

…za.co.absa.atum.core.ControlFrameworkState.storeCurrentInfoFile - hdfs dependent, yet

TODO = fsUtils replacement for dir/files sizing
TODO = fsUtils replacement for dir/files sizing

successfully ran on EMR, the following measurement was added to the _INFO file:
 {
      "name": "Standardization - End",
      "software": "Atum",
      "version": "0.3.0-SNAPSHOT",
      "processStartTime": "03-09-2020 10:50:35 +0000",
      "processEndTime": "03-09-2020 10:50:49 +0000",
      "workflowName": "Standardization",
      "order": 3,
      "controls": [
        {
          "controlName": "recordcount",
          "controlType": "count",
          "controlCol": "*",
          "controlValue": "643"
        }
      ]
    }
…on over 1000 objects.

TestRunnerJob to naively test the s3 impl.
…ith optional short-circuit breakOut

TestRunnerJob updated to manually cover the cases - should serve as a basis for tests
…he recursive-accumulative method (+continue control measure removed)
@dk1844 dk1844 added the work in progress Work on this item is not yet finished (mainly intended for PRs) label Sep 18, 2020

val localKeyTabPath = fsUtils.getLocalPathToFile(path)
val localKeyTabPath = fsUtils.getLocalPathToFileOrCopyToLocal(path)
Copy link
Contributor Author

@dk1844 dk1844 Sep 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original logic is to load from a local path and then fallback on the same path to HDFS. Do we want to have the option to fallback to a menas-credentials file on S3?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you asking if we want the ability to read menas-credentials file from S3 in general? Answer is yes

Or in some special case?

Copy link
Contributor Author

@dk1844 dk1844 Sep 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, yes, the former (ability to read menas-credentials file from S3 in general).

However, the slight problem with the current flow is that the path is fallbacked (lookup up locally, if it doesn't exist, the same path is tried on HDFS and copied to a local temp location). That can be reasonably done with local fs and HDFS; with S3, how do you fallback on a local path that does not exist on S3? (you need a bucket, for example).

We can of course making it work with a file from S3, too, but the fallbacking is weird here.

What is the point of fallbacking like this anyway? Isn't it more confusing rather than helpful?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think fallback is not necessary


// todo remove or create a integtest like this instead.
// implementation is directly suited to be runnable locally with a saml profile.
object S3FsUtilsTestJob {
Copy link
Contributor Author

@dk1844 dk1844 Sep 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This entrypoint is entirely designed for testrunning the S3FsUtils on local machine against a real s3 using the local saml profile. Future integTest can be heavily inspired by this class, so maybe it could stay until then?

If agreed, we could create such an issue and refer the todo to it? (discussion pending)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say it can stay, but use different paths in s3

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed the base-path, what is left can reasonably serve as a generic example.

@dk1844 dk1844 changed the title Feature/1498 atum s3 s3 persistence (atum, sdk fs usage, ...) Sep 30, 2020
spark-jobs/src/main/resources/reference.conf Outdated Show resolved Hide resolved
)

cmd.rowTag.foreach(rowTag => Atum.setAdditionalInfo("xml_row_tag" -> rowTag))
cmd.csvDelimiter.foreach(delimiter => Atum.setAdditionalInfo("csv_delimiter" -> delimiter))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we only provide these 2 as additional info? This might a mistake even in the master branch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was in the code before, I haven't changed it, so I don't know.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to tag @benedeki and somehow failed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the others are added, but these ones are added only if they are set

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are the only reader options provided to additional info

@dk1844 dk1844 removed the work in progress Work on this item is not yet finished (mainly intended for PRs) label Oct 1, 2020

log.info(s"standardization path: ${preparationResult.pathCfg.standardizationPath}")
log.info(s"publish path: ${preparationResult.pathCfg.publishPath}")

// Enable Control Framework
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is not needed anymore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the comment closer to where it makes more sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aa, so this way that import is not needed anymore

)

cmd.rowTag.foreach(rowTag => Atum.setAdditionalInfo("xml_row_tag" -> rowTag))
cmd.csvDelimiter.foreach(delimiter => Atum.setAdditionalInfo("csv_delimiter" -> delimiter))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the others are added, but these ones are added only if they are set


// todo remove or create a integtest like this instead.
// implementation is directly suited to be runnable locally with a saml profile.
object S3FsUtilsTestJob {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say it can stay, but use different paths in s3

@sonarcloud
Copy link

sonarcloud bot commented Oct 5, 2020

Kudos, SonarCloud Quality Gate passed!

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities (and Security Hotspot 0 Security Hotspots to review)
Code Smell A 4 Code Smells

No Coverage information No Coverage information
1.0% 1.0% Duplication

Copy link
Contributor

@Zejnilovic Zejnilovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code reviewed

Copy link
Contributor

@AdrianOlosutean AdrianOlosutean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code reviewed

@dk1844 dk1844 merged commit 5dfc40f into aws-poc Oct 16, 2020
@dk1844 dk1844 deleted the feature/1498-atum-s3 branch October 16, 2020 12:20
benedeki added a commit that referenced this pull request Jan 29, 2021
#1422 and 1423 Remove HDFS and Oozie from Menas

#1422 Fix HDFS location validation

#1424 Add Menas Dockerfile

#1416 hadoop-aws 2.8.5 + s3 aws sdk 2.13.65 compiles.

#1416 - enceladus on S3:
* - all directly-hdfs touching stuff disabled (atum, performance measurements, info files, output path checking)
# Add menasfargate into hosts
# paste
# save & exit (ctrl+O, ctrl+X)

#1416 - enceladus on S3 - (crude) conformance works on s3 (s3 std input, s3 conf output)
* Merge spline 0.5.3 into aws-poc
* Update spline to 0.5.4 for AWS PoC

#1503 Remove HDFS url Validation
* New dockerfile - smaller image
* s3 persistence (atum, sdk fs usage, ...) (#1526)

#1526 
* FsUtils divided into LocalFsUtils & HdfsUtils
* PathConfigSuite update
* S3FsUtils with tail-recursive pagination accumulation - now generic with optional short-circuit breakOut
* TestRunnerJob updated to manually cover the cases - should serve as a basis for tests
* HdfsUtils replace by trait DistributedFsUtils (except for MenasCredentials loading & nonSplittable splitting)
* using final version of s3-powered Atum (3.0.0)
* mockito-update version update, scalatest version update
* S3FsUtilsSuite: exists, read, sizeDir(hidden, non-hidden, reucursive), non-splittable (simple, recursive with breakOut), delete (recursive), version find (simple - empty, recursive)
* explicit stubbing fix for hyperdrive

#1556 file access PoC using Hadoop FS API (#1586)
* s3 using hadoop fs api
* s3 sdk usage removed (pom, classes, tests)
* atum final version 3.1.0 used
* readStandardizationInputData(... path: String)(implicit ... fs: FileSystem) -> readStandardizationInputData(input: PathWithFs)


#1554 Tomcat with TLS container in Docker container

#1554 Added envoy config + enabling running unencrypted container

#1499 Add authentication to /lineage + update spline to 0.5.5

#1618 - fixes failing spline 0.5.5 integration by providing compatible commons library version. Test-ran on EMR. (#1619)

#1434 Add new way of serving properties to Docker

#1622: Merge of aws-poc to develop brach
* put back HDFS browser
* put back Oozie
* downgraded Spline
* Scopt 4.0.0
* AWS SDK Exclusion
* ATUM version 3.2.2

Co-authored-by: Saša Zejnilović <zejnils@gmail.com>
Co-authored-by: Daniel Kavan <dk1844@gmail.com>
Co-authored-by: Adrian Olosutean <adi.olosutean@gmail.com>
Co-authored-by: Adrian Olosutean <adrian.olosutean@absa.africa>
Co-authored-by: Jan Scherbaum <kmoj02@gmail.com>
@benedeki benedeki mentioned this pull request Feb 3, 2021
6 tasks
AdrianOlosutean added a commit that referenced this pull request Apr 12, 2021
* 1422 and 1423 Remove HDFS and Oozie from Menas

* #1422 Fix HDFS location validation

* #1424 Add Menas Dockerfile

* #1416 hadoop-aws 2.8.5 + s3 aws sdk 2.13.65 compiles.

* #1416 - enceladus on S3:

 - all directly-hdfs touching stuff disabled (atum, performance measurements, info files, output path checking)

# Add menasfargate into hosts
sudo nano /etc/hosts
# paste
20.0.63.69 menasfargate
# save & exit (ctrl+O, ctrl+X)

# Running standardization works via:
spark-submit --class za.co.absa.enceladus.standardization.StandardizationJob --conf "spark.driver.extraJavaOptions=-Dmenas.rest.uri=http://menasfargate:8080 -Dstandardized.hdfs.path=s3://euw1-ctodatadev-dev-bigdatarnd-s3-poc/enceladusPoc/ao-hdfs-data/stdOutput/standardized-{0}-{1}-{2}-{3}" ~/enceladusPoc/spark-jobs-2.11.0-SNAPSHOT.jar --menas-credentials-file ~/enceladusPoc/menas-credentials.properties --dataset-name dk_test1_emr285 --raw-format json --dataset-version 1 --report-date 2019-11-27 --report-version 1 2> ~/enceladusPoc/stderr.txt

* #1416 - enceladus on S3 - (crude) conformance works on s3 (s3 std input, s3 conf output)

 0- all directly-hdfs touching stuff disabled (atum, performance measurements, info files, output path checking)

# Add menasfargate into hosts
sudo nano /etc/hosts
# paste
20.0.63.69 menasfargate
# save & exit (ctrl+O, ctrl+X)

# Running conformance works via:
spark-submit --class za.co.absa.enceladus.conformance.DynamicConformanceJob --conf "spark.driver.extraJavaOptions=-Dmenas.rest.uri=http://menasfargate:8080 -Dstandardized.hdfs.path=s3://euw1-ctodatadev-dev-bigdatarnd-s3-poc/enceladusPoc/ao-hdfs-data/stdOutput/standardized-{0}-{1}-{2}-{3}" ~/enceladusPoc/spark-jobs-2.11.0-SNAPSHOT.jar --menas-credentials-file ~/enceladusPoc/menas-credentials.properties --dataset-name dk_test1_emr285 --dataset-version 1 --report-date 2019-11-27 --report-version 1 2> ~/enceladusPoc/conf-log.txt

* ref issue = 1416

* related test cases ignored (issue reference added)

* PR updates

* Merge spline 0.5.3 into aws-poc

* Update spline to 0.5.4 for AWS PoC

* #1503 Remove HDFS url Validation

This is a temporary solution. We currently experiment with
many forms of URLs, and having a regex there now slows us down.

* New dockerfile - smaller image

* s3 persistence (atum, sdk fs usage, ...) (#1526)

#1526 
* FsUtils divided into LocalFsUtils & HdfsUtils
* PathConfigSuite update
* S3FsUtils with tail-recursive pagination accumulation - now generic with optional short-circuit breakOut
TestRunnerJob updated to manually cover the cases - should serve as a basis for tests
* HdfsUtils replace by trait DistributedFsUtils (except for MenasCredentials loading & nonSplittable splitting)
* using final version of s3-powered Atum (3.0.0)
* mockito-update version update, scalatest version update
* S3FsUtilsSuite: exists, read, sizeDir(hidden, non-hidden, reucursive), non-splittable (simple, recursive with breakOut), delete (recursive), version find (simple - empty, recursive)
* explicit stubbing fix for hyperdrive

* Feature/1556 file access PoC using Hadoop FS API (#1586)

* s3 using hadoop fs api
* s3 sdk usage removed (pom, classes, tests)
* atum final version 3.1.0 used
* readStandardizationInputData(... path: String)(implicit ... fs: FileSystem) -> readStandardizationInputData(input: PathWithFs)

* 1554 Tomcat with TLS in Docker container (#1585)

* #1554 Tomcat with TLS container

* #1554 Added envoy config + enabling running unencrypted container

* #1499 Add authentication to /lineage + update spline to 0.5.5

* #1618 - fixes failing spline 0.5.5 integration by providing compatible commons library version. Test-ran on EMR. (#1619)

* #1612 Separation start

* #1612 Updated DAO for spark-jobs

* #1612 Fixed spline integration and schema, removed redundant code

* #1612 Fixed tests, removed unused dependency

* #1612 Added back dependency

* WIP fixing merge issues

* * Merge compiles
* Tests pass
* Depends on ATUM 3.1.1-SNAPSHOT (the bugfix for AbsaOSS/atum#48)

* #1612 Removed Spring from menas-web, enabled building war and static resources. Removed version subpath in menas-web + added theme dependencies in repo

* #1612 Cookies + updated lineage

* * put back HDFS browser
* put back Oozie
* downgraded Spline

* * AWS SDK Exclusion

* #1612 Included HDFSFolder + missing Oozie parts

* * New ATUM version

* * Adding missing files

* #1612 menas-web on nginx container and passing API_URL

* #1612 Working TLS on nginx, resources not included in code

* 1622: Merge of aws-poc to develop brach
* Addressed issues identified by reviewers

* * comments improvement

* 1434 Add new way of serving properties to Docker

* #1612 Building using ui5 + reused /api route

* #1612 Project version

* #713 Add favicon

* #1612 Merges

* #1612 pom parent version

* #1648 Fix war deployment + adding back spline to menas

* #1612 other fixes

* #1612 added pom package.json version sync

* #1612 newline

* #1612 fix version sync + cleaning dist

* 1648 merge to develop

* 1648 merge fix

* 1648 Fixes schema upload

* 1648 Fixes schema registry request

* 1648 pom version

* 1612 add docker build

* #601 Swagger 2 PoC

* #601 Swagger 2 PoC

* #601 Swagger 2 PoC

* #1648 Updating menas-web to 3.0

* #1612 Updated npm project versions + mvn plugin

* #1612 license_check.yml

* #1612 licence check fix

Co-authored-by: Saša Zejnilović <zejnils@gmail.com>
Co-authored-by: Daniel Kavan <dk1844@gmail.com>
Co-authored-by: Jan Scherbaum <kmoj02@gmail.com>
Co-authored-by: David Benedeki <benedeki@volny.cz>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants