#1894 `HadoopFsPersistenceFactory` - adding Spline S3 write support #1912

dk1844 · 2021-09-10T07:39:47Z

Problem summary

Vanilla spline 0.3.9 is not able to write the _LINEAGE file to an S3 location: #1896

Solution

Naively, S3 support was sketched into Spline 0.3.x, however, that was undesirable for multiple reasons. Instead, @wajda suggested overriding the default configuration of Spline

spline.persistence.composition.factories=za.co.absa.spline.persistence.mongo.MongoPersistenceFactory,za.co.absa.spline.persistence.hdfs.HdfsPersistenceFactory

to use a custom PersistenceFactory instead (look at the end):

spline.persistence.composition.factories=za.co.absa.spline.persistence.mongo.MongoPersistenceFactory,za.co.absa.enceladus.spline.persistence.HadoopFsPersistenceFactory

Using this approach, such a factory can be packed with Enceladus instead of Spline and the original Spline 0.3.9 can remain untouched. Thanks, @wajda!

Test run

Having provided all necessary requirements (Menas with a dataset's conformed publish path being located on S3, data, and configs), the test-run's result indeed includes the _LINEAGE file being present in the S3-located output:

$ aws --profile saml s3 ls s3://<my-bucket-name-here>/publish/superhero/ --recursive
2021-09-10 08:17:30       4048 publish/superhero/enceladus_info_date=2020-08-06/enceladus_info_version=1/_INFO
2021-09-10 08:17:28      10600 publish/superhero/enceladus_info_date=2020-08-06/enceladus_info_version=1/_LINEAGE
2021-09-10 08:17:28          0 publish/superhero/enceladus_info_date=2020-08-06/enceladus_info_version=1/_SUCCESS
2021-09-10 08:17:28      46426 publish/superhero/enceladus_info_date=2020-08-06/enceladus_info_version=1/part-00000-398b99ca-ee85-4c48-bc75-7120307f73a2-c000.snappy.pa
rquet

Release note suggestion

Enceladus can write Spline's _LINEAGE file in an S3 location.

…stenceFactory`.

AdrianOlosutean

Code reviewed

Zejnilovic · 2021-09-13T07:03:04Z

...-jobs/src/main/scala/za/co/absa/enceladus/spline/persistence/HadoopFsDataLineageWriter.scala

@@ -0,0 +1,98 @@
+/*
+ * Copyright 2017 ABSA Group Limited


how does this pass the check? Don't we have a rule to say it should be 2018?

Also in JSONSerialization.scala
And has a range HadoopFsPersistenceFactory.scala

I guess the apache-rat-plugin matchers don't include year (kind of logically), and it's not possible to specify. 🤔
On the other hand I remember Atum failing on wrong year, but it's not using the plugin

Thanks for noticing, it must have been as I copied some of the files from spline prior to making changes to them. Fixed.

I think the licence check only checks if there is a "valid" licence, but for the ABSA one, it does not check the year.

Don't we have a rule to say it should be 2018?

@Zejnilovic what is that rule and why?

…sistence-factory

benedeki · 2021-09-13T08:53:12Z

...-jobs/src/main/scala/za/co/absa/enceladus/spline/persistence/HadoopFsDataLineageWriter.scala

@@ -0,0 +1,98 @@
+/*
+ * Copyright 2017 ABSA Group Limited


I guess the apache-rat-plugin matchers don't include year (kind of logically), and it's not possible to specify. 🤔
On the other hand I remember Atum failing on wrong year, but it's not using the plugin

...-jobs/src/main/scala/za/co/absa/enceladus/spline/persistence/HadoopFsDataLineageWriter.scala

benedeki · 2021-09-13T10:37:37Z

...-jobs/src/main/scala/za/co/absa/enceladus/spline/persistence/HadoopFsDataLineageWriter.scala

+  override def store(lineage: DataLineage)(implicit ec: ExecutionContext): Future[Unit] = Future {
+    val pathOption = getPath(lineage)
+    import JSONSerialization._
+    for (path <- pathOption) {


Ha, interesting construction, a map replacement

foreach replacement to be precise.

benedeki · 2021-09-13T11:03:39Z

...-jobs/src/main/scala/za/co/absa/enceladus/spline/persistence/HadoopFsDataLineageWriter.scala

+
+  private def getPath(lineage: DataLineage): Option[Path] =
+    lineage.rootOperation match {
+      case dn: Write => Some(new Path(dn.path, fileName))


Could the dn have a more explanatory name?

Thanks, adjusted.

benedeki · 2021-09-13T15:54:34Z

...-jobs/src/main/scala/za/co/absa/enceladus/spline/persistence/HadoopFsDataLineageWriter.scala

+   * @return FS + relative path
+   **/
+  def pathStringToFsWithPath(pathString: String): (FileSystem, Path) = {
+    pathString.toS3Location match {


This implicit function comes from Atum. That seems like a weird dependency. Wasn't it part of commons too?

At this point of develop (using Atum 3.3.0, it is still part of Atum). Later, in Atum 3.5+, this is moved to commons.

Since this PR does not update Atum, the import change will have to be resolved in the future.

...jobs/src/main/scala/za/co/absa/enceladus/spline/persistence/HadoopFsPersistenceFactory.scala

...src/main/scala/za/co/absa/enceladus/spline/persistence/serialization/JSONSerialization.scala

benedeki · 2021-09-14T07:19:21Z

Btw, shouldn't the new factory class be added to spline.properties configuration?

…ization is linked from Spline, ... )

…oopFsPersistenceFactory` added to the spline.properties.template

dk1844 · 2021-09-14T07:45:38Z

Btw, shouldn't the new factory class be added to spline.properties configuration?

Yes, it must be changed in the properties in order to be used, as I note in the PR description. However, spline.properties is not part of the repo, so I have reflected the change in the spline.properties.template instead. Good idea leading me to provide this as a default/template. 👍

wajda · 2021-09-14T09:17:55Z

...jobs/src/main/scala/za/co/absa/enceladus/spline/persistence/HadoopFsPersistenceFactory.scala

+ * * [[za.co.absa.spline.persistence.hdfs.HdfsPersistenceFactory]].
+ */
+object HadoopFsPersistenceFactory {
+  private val fileNameKey = "spline.hdfs.file.name"


I think this was copied from the old Spline, and I'm not sure what naming convention for constants are set in Enceladus, but I remember there was a conversation on CQC some time ago - constant names should generally be written upper camel case - https://docs.scala-lang.org/style/naming-conventions.html#constants-values-and-variables

Usually we adhere to capital letters too. But not very pedantic in code reviews about it. 😉

…sistence-factory

… release notes howto https://github.com/AbsaOSS/atum/releases/tag/v3.5.0

AdrianOlosutean

Code review

benedeki

code reviewed
pulled
built
run
Ran it locally, want to run on AWS too, to verify S3 functionality

benedeki · 2021-09-17T16:33:04Z

...-jobs/src/main/scala/za/co/absa/enceladus/spline/persistence/HadoopFsDataLineageWriter.scala

+ * Licensed under the Apache License, Version 2.0 (the "License");
+ * you may not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *


For some reason, this empty line makes Scalastyle prevent my build.

benedeki · 2021-09-17T16:42:35Z

✔️ Tested both locally and in AWS. The _LINEAGE file was created both on hfds and S3.
The PR change request is not in the code area, fix won't affect the test.

...-jobs/src/main/scala/za/co/absa/enceladus/spline/persistence/HadoopFsDataLineageWriter.scala

…ladus/spline/persistence/HadoopFsDataLineageWriter.scala Co-authored-by: David Benedeki <14905969+benedeki@users.noreply.github.com>

sonarcloud · 2021-09-17T16:50:35Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
0 Code Smells

No Coverage information
0.0% Duplication

* Update for next development version 2.24.0-SNAPSHOT * Suppress download noise in license check * Suppress compiler warning of obsolete Java (#1892) * 1868 statistics with missing counts and datasets missing proprties (#1873) * 1868 statistics with missing counts and datasets missing proprties * 1843 Summary page for properties (#1880) * 1843 Home page with properties, side panel with missing counts and summary page for properties with tab containing datasets missing that particular property * Feature/1603 mapping table filtering general (#1879) * #1603 serde tests for CR and MT DataFrameFilters (mongo-bson-based serde tests for CR and MT DataFrameFilters, mongo-bson-based serde tests extended for CR with a blank mappingTableFilter) * #1909 Increase the limit of columns shown in menas column selection * 1903 Add validation for complex default values in mapping tables on import * Project config and management updates (#1908) Project config and management updates * poc issue template * CODEOWNERS update * developers update * Badges to README.md * 1881 HyperConformance enceladus_info_version from payload (#1896) 1881 HyperConformance enceladus_info_version from payload * #1887 defaultTimestampTimeZone can be source type specific (#1899) #1887 defaultTimestampTimeZone can be source type specific * `DefaultsByFormat` extends the `Defaults` trait, being able to read defaults from configuration files * `DefaultsByFormat` offers further granularity by first checking the format specific setting only then taking the global one * Basic `GlobalDefaults` are not configuration dependent anymore * Standardization now user `DefaultsByFormat` for its defaults, where rawFormat is used for format parameter * Switched to configuration path to be `enceladus.defaultTimestampTimeZone.default` and `enceladus.defaultTimestampTimeZone.[rawFormat]` respectively * `defaultTimestampTimeZone` is still supported/read as an obsolete fallback Co-authored-by: Daniel K <dk1844@gmail.com> * #1887 defaultTimestampTimeZone can be source type specific (#1916) #1887 defaultTimestampTimeZone can be source type specific * rename of the configuration prefix from `enceladus.` to `standardization.` * #172 Save original timezone information in metadata file (#1900) * Upgrade of Atum to 3.6.0 * Writing the default time zones for timestamps and dates into _INFO file * #1894 `HadoopFsPersistenceFactory` - adding Spline S3 write support (#1912) * #1894 Spline S3 support via custom persistence factory `HadoopFsPersistenceFactory`. Co-authored-by: David Benedeki <14905969+benedeki@users.noreply.github.com> * Update versions for release v2.24.0 * Update for next development version 2.25.0-SNAPSHOT * #1926 Add executor extra java opts to helper scripts * #1931 Add switch for running kinit in helper scripts * #1882 Update Cobrix dependency to v.2.3.0 * #1882 Remove explicit "collapse_root" since it is the default since Cobrix 2.3.0 * #1882 Update Cobrix to 2.4.1 and update Cobol test suite for ASCII files. * #1882 Bump up Cobrix version to 2.4.2. * #1927 Spline _LINEAGE and Atum _INFO files permission alignment (#1934) * #1927 - testing setup: set both spline _LINEAGE and atum _INFO to hdfs file permissions 733 -> the result on EMR HDFS was 711 (due to 022 umask there) -> evidence of working * #1927 - cleanup of test settings of 733 fs permissions * #1927 Atum final version 3.7.0 used instead of the snapshot (same code) * #1927 comment change * #1927 - default 644 FS permissions for both _INFO and _LINEAGE files. * 1937 limit output file size (#1941) * 1937 limit output file size * 1937 limit output file size * 1937 renamings + constants * 1937 more conditions * 1937 rename params * 1937 feedback + script params * 1937 more feedback * 1937 final feedback * #1951: Windows Helper scripts - add missing features * `ADDITIONAL_JVM_EXECUTOR_CONF` * Kerberos configuration * Trust store configuration * kinit execution option * `--min-processing-block-size` & `--max-processing-block-size` * logo improvement * * --min-processing-block-size -> --min-processing-partition-size * --max-processing-block-size -> --max-processing-partition-size * #1869: SparkJobs working with LoadBalanced Menas (#1935) * `menas.rest.retryCount` - configuration, how many times an url should be retried if failing with retry-able error implemented * `menas.rest.availability.setup` - configuration, how the url list should be handled * _Standardization_, _Conformance_ and _HyperConformance_ changed to provide retry count and availability setup to Dao, read from configuration * `ConfigReader` enhanced and unified to read configurations more easily and universally * Mockito upgraded to 1.16.42 Co-authored-by: Daniel K <dk1844@gmail.com> * Feature/1863 mapping table filtering (#1929) * #1863 mapping cr & mt fitler successfully reuses the same fragment (both using the same named model) - todo reuse validation, reuse manipulation methods * #1863 FilterEdit.js allows reusing filterEdit TreeTable logic between mCR and MT editings * #1863 mCT editing validation enabled (commons from FilterEdit.js) * #1863 mCT datatype hinting hinting enabled (commons from DataTypeUtils.js) * #1863 mCR/MT edit dialog default width=950px, some cleanup * #1863 bugfixes: directly creating MT with filter (fix on accepting the field), UI fix for MT filter model initialization * #1863 npm audit fix * #1863 bugfix: adding new mCR (when no edit MCR dialog has been opened yet) did not work - fixed * #1863 selecting mapping column from MT schema works (for all schema levels) for edit. TODO = Schema type support #1863 mCR - schema-based columns suggested for filter, value types filled in silently during submit, too. * #1863 bugfix: empty MT - schema may be empty * #1863 bugfix: removing a filter left a null node - cleanup was needed (otherwise view would fail) logging cleanup * #1863 select list item now shows valueType as additionalText, cleanup * #1863 nonEmptyAndNonNullFilled - map->filter bug fixed. * #1863 typo for null filter Co-authored-by: David Benedeki <14905969+benedeki@users.noreply.github.com> * Update versions for release v2.25.0 * [merge] build fix * [merge] npm audit fix * [merge] npm audit fix * [merge] buildfix (menas->rest_api packaging fix) * [merge] review updates Co-authored-by: David Benedeki <benedeki@volny.cz> Co-authored-by: Saša Zejnilović <zejnils@gmail.com> Co-authored-by: David Benedeki <14905969+benedeki@users.noreply.github.com> Co-authored-by: Adrian Olosutean <adi.olosutean@gmail.com> Co-authored-by: Ruslan Iushchenko <yruslan@gmail.com>

#1894 Spline S3 support via custom persistence factory `HadoopFsPersi…

50957f8

…stenceFactory`.

dk1844 requested review from AdrianOlosutean and benedeki as code owners September 10, 2021 07:39

dk1844 linked an issue Sep 10, 2021 that may be closed by this pull request

Spline 0.3 doesn't write lineage file to S3 #1894

Closed

dk1844 mentioned this pull request Sep 10, 2021

Enceladus #1894: S3 support for HdfsDataLineageWriter AbsaOSS/spline#953

Closed

dk1844 requested a review from Zejnilovic September 10, 2021 09:11

AdrianOlosutean previously approved these changes Sep 13, 2021

View reviewed changes

Zejnilovic reviewed Sep 13, 2021

View reviewed changes

#1894 licence fix

87fa4f6

dk1844 dismissed AdrianOlosutean’s stale review via 87fa4f6 September 13, 2021 10:16

Merge branch 'develop' into feature/1894-s3-support-for-spline0.3-per…

9535d0f

…sistence-factory

benedeki added the PR:reviewing Only for PR - PR is being reviewed by somebody; blocks merging label Sep 13, 2021

benedeki requested changes Sep 13, 2021

View reviewed changes

benedeki removed the PR:reviewing Only for PR - PR is being reviewed by somebody; blocks merging label Sep 13, 2021

dk1844 added 2 commits September 14, 2021 09:39

#1894 PR review updates (code style, visibility, excessive JSONSerial…

717b928

…ization is linked from Spline, ... )

#1894 PR review updates: `za.co.absa.enceladus.spline.persistence.Had…

83ace7e

…oopFsPersistenceFactory` added to the spline.properties.template

wajda reviewed Sep 14, 2021

View reviewed changes

dk1844 requested a review from benedeki September 15, 2021 07:23

dk1844 added 2 commits September 16, 2021 11:04

Merge branch 'develop' into feature/1894-s3-support-for-spline0.3-per…

0949ee8

…sistence-factory

#1894 buildfix after merge with develop having (atum 3.5+) - based on…

67d547a

… release notes howto https://github.com/AbsaOSS/atum/releases/tag/v3.5.0

AdrianOlosutean previously approved these changes Sep 16, 2021

View reviewed changes

benedeki approved these changes Sep 16, 2021

View reviewed changes

benedeki requested changes Sep 17, 2021

View reviewed changes

benedeki added the PR:tested Only for PR - PR was tested by a tester (person) label Sep 17, 2021

benedeki reviewed Sep 17, 2021

View reviewed changes

...-jobs/src/main/scala/za/co/absa/enceladus/spline/persistence/HadoopFsDataLineageWriter.scala Outdated Show resolved Hide resolved

Blank line removal - Update spark-jobs/src/main/scala/za/co/absa/ence…

ad59799

…ladus/spline/persistence/HadoopFsDataLineageWriter.scala Co-authored-by: David Benedeki <14905969+benedeki@users.noreply.github.com>

dk1844 dismissed AdrianOlosutean’s stale review via ad59799 September 17, 2021 16:49

benedeki approved these changes Sep 17, 2021

View reviewed changes

dk1844 merged commit 92d15d4 into develop Sep 17, 2021

dk1844 deleted the feature/1894-s3-support-for-spline0.3-persistence-factory branch September 17, 2021 19:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#1894 `HadoopFsPersistenceFactory` - adding Spline S3 write support #1912

#1894 `HadoopFsPersistenceFactory` - adding Spline S3 write support #1912

dk1844 commented Sep 10, 2021 •

edited

Loading

AdrianOlosutean left a comment

Zejnilovic Sep 13, 2021

benedeki Sep 13, 2021

dk1844 Sep 13, 2021

wajda Sep 13, 2021 •

edited

Loading

benedeki Sep 13, 2021

benedeki Sep 13, 2021

wajda Sep 14, 2021

benedeki Sep 13, 2021

dk1844 Sep 14, 2021

benedeki Sep 13, 2021

dk1844 Sep 14, 2021

benedeki commented Sep 14, 2021

dk1844 commented Sep 14, 2021

wajda Sep 14, 2021 •

edited

Loading

benedeki Sep 14, 2021

AdrianOlosutean left a comment

benedeki left a comment

benedeki Sep 17, 2021

benedeki commented Sep 17, 2021

sonarcloud bot commented Sep 17, 2021

#1894 HadoopFsPersistenceFactory - adding Spline S3 write support #1912

#1894 HadoopFsPersistenceFactory - adding Spline S3 write support #1912

Conversation

dk1844 commented Sep 10, 2021 • edited Loading

Problem summary

Solution

Test run

Release note suggestion

AdrianOlosutean left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wajda Sep 13, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benedeki commented Sep 14, 2021

dk1844 commented Sep 14, 2021

wajda Sep 14, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AdrianOlosutean left a comment

Choose a reason for hiding this comment

benedeki left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

benedeki commented Sep 17, 2021

sonarcloud bot commented Sep 17, 2021

#1894 `HadoopFsPersistenceFactory` - adding Spline S3 write support #1912

#1894 `HadoopFsPersistenceFactory` - adding Spline S3 write support #1912

dk1844 commented Sep 10, 2021 •

edited

Loading

wajda Sep 13, 2021 •

edited

Loading

wajda Sep 14, 2021 •

edited

Loading