Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/1733 lineage dumper #1742

Closed
wants to merge 26 commits into from
Closed

Conversation

dk1844
Copy link
Contributor

@dk1844 dk1844 commented Mar 29, 2021

(blocked by SNAPSHOT version of spline agent and SNAPSHOT version of Absa commons) for the time being

Using a SNAPSHOT spline agent from AbsaOSS/spline-spark-agent#197, Enceladus's spark-jobs run is able to produce a _LINEAGE file on S3.

  • In order to produce a _LINEAGE file, specific hdfs dispatcher must be enabled. (Using `spark-submit ... --conf "spark.driver.extraJavaOptions=-Dspark.spline.lineageDispatcher=hdfs ... "```)
  • Because in practise, we do not want to replace the default http dispatcher (responsible for sending data to spline server/ui), we may want to use the composite dispatcher to get both: ... -Dspark.spline.lineageDispatcher=composite -Dspline.lineageDispatcher.composite.dispatchers=http,hdfs

Test run

This was actually built into an Enceladus spark-jobs fat jar and ran on EMR for Standardization:

export MENAS_URL="https://menas.reachable.from.emr.example.com"
export SPLINE_URL="https://spline056.compatbile.with060..example.com/producer"

spark-submit --class za.co.absa.enceladus.standardization.StandardizationJob \
--conf "spark.driver.extraJavaOptions=-Dmenas.rest.uri=$MENAS_URL -Dstandardized.hdfs.path=s3://my-example-bucket123/superhero2/std/standardized-{0}-{1}-{2}-{3} -Dspline.producer.url=$SPLINE_URL -Dspline.lineageDispatcher=composite -Dspline.lineageDispatcher.composite.dispatchers=http,hdfs " \
~/enceladus/spark-jobs-2.22.0-SNAPSHOT.jar \
--menas-credentials-file ~/enceladus/svc-aws-credentials.properties \
--dataset-name SuperheroesS3 --dataset-version 1 --report-date 2020-08-06 --report-version 1 \
--raw-format csv --header true \
$ aws --profile saml s3 ls --recursive s3://my-example-bucket123/superhero2/std
2021-03-29 12:44:12       2226 superhero2/std/standardized-SuperheroesS3-1-2020-08-06-1/_INFO
2021-03-29 12:44:07     127428 superhero2/std/standardized-SuperheroesS3-1-2020-08-06-1/_LINEAGE
2021-03-29 12:44:01          0 superhero2/std/standardized-SuperheroesS3-1-2020-08-06-1/_SUCCESS
2021-03-29 12:44:01      45177 superhero2/std/standardized-SuperheroesS3-1-2020-08-06-1/part-00000-7f53ff53-8713-4340-94
3d-af390f32eac9-c000.snappy.parquet

(values of MENAS_URL and SPLINE_URL; and my-example-bucket123 are not actual values used)

Closes #1733.

GeorgiChochov and others added 22 commits March 19, 2020 23:02
* #1126: Spline 0.4.x in Oozie
* new setting for Spline used by Oozie
* changed Oozie template
* optional spline.mode setting support
* applied some code hints
#1110: Make Standardization and Conformance use Spline 0.4.x
* Change Spline version
* Make **Standardization** and **Confluence** depend and init the new Spline module instead of the old ones
* Replacing Spline properties in templates
* Using a different JSON library (instead of the one transitionally ingested from old Spline)
* Application properties templates adjutsted
* Helper scripts updated
* README.md updated (addressing also changes in Menas and Oozie from other PR)
* Missing copyright headers
Co-authored-by: Georgi Chochov <g.chochov@gmail.com>
* Server-side api endpoints to provide new configuration
* Server-side controller to serve the Spline WebJar
* Client-side changes to gather the lineage id and display the lineage new UI
* Added dependency to include the webjar
* Header to allow iframe display `<frame-options policy="SAMEORIGIN"/>`
* fixed `style.css`
* Rebasing to current `Develop`
* Unused import removed
…ncies

* Version names sorted alphabetically
# Conflicts (resolved):
#	pom.xml
#	scripts/bash/run_enceladus.sh
#	spark-jobs/pom.xml
* Create license_check.yml
% Conflicts:
%	menas/src/main/scala/za/co/absa/enceladus/menas/MvcConfig.scala
%	menas/src/main/scala/za/co/absa/enceladus/menas/services/RunService.scala
%	menas/src/test/resources/application.properties
%	menas/ui/css/style.css
%	pom.xml
%	spark-jobs/pom.xml
%	spark-jobs/src/main/scala/za/co/absa/enceladus/conformance/DynamicConformanceJob.scala
%	spark-jobs/src/main/scala/za/co/absa/enceladus/standardization/StandardizationJob.scala
%	spark-jobs/src/test/scala/za/co/absa/enceladus/conformance/interpreter/InterpreterSuite.scala
%	utils/src/test/scala/za/co/absa/enceladus/utils/validation/field/ScalarFieldValidatorSuite.scala
…za/co/absa/commons/config/ConfigurationImplicits$ConfigurationRequiredWrapper$)
* Add e2e tests into examples

* Fix style

* Examples - Fix comments

* Apply suggestions from code review

Co-authored-by: veprokendlo <68275331+veprokendlo@users.noreply.github.com>

* Update vars in test jasons and other small fixes

* new test for MappingCR and other small fixes

* repair key address

* typo remove

* add recordId conf into test jasons and update readme

Co-authored-by: veprokendlo <68275331+veprokendlo@users.noreply.github.com>
Co-authored-by: Veronika Huvarova <vhuvarova@gmail.com>
Co-authored-by: Veronika <56251097+HuvarVer@users.noreply.github.com>
@sonarqubecloud
Copy link

sonarqubecloud bot commented Apr 8, 2021

Kudos, SonarCloud Quality Gate passed!

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 2 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

@dk1844 dk1844 changed the base branch from develop to develop-ver-3.0 April 13, 2021 11:39
@dk1844
Copy link
Contributor Author

dk1844 commented May 4, 2021

Closed in favor of #1766

@dk1844 dk1844 closed this May 4, 2021
@dk1844 dk1844 deleted the feature/1733-lineage-dumper branch March 23, 2022 10:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Spline _LINEAGE dumper to work on S3
4 participants