#1015 Refactoring Conformance and Standardization #1377

AdrianOlosutean · 2020-06-08T11:16:45Z

Closes #1015
Closes #1231
Closes #1232

Main changes:

Extracted common configuration fields into JobCmdConfig which will be contained in StdCmdConfig and ConfCmdConfig
Common functionality for both Standardization and Conformance is moved to CommonJobExecution
Logic specific to each part was moved to (Standardization/Conformance)Reader for reading specific configuration and (Standardization/Conformance)Execution which contains execution logic

One important factor that I am not sure how it should be tackled is how the configuration should be handled for the future job containing Standardization and Conformance together

…obs and extracted commonalities

Zejnilovic

Current code is not runnable. CmdConfig cannot be extended like this. This comment

spark-jobs/src/main/scala/za/co/absa/enceladus/common/CommonJobExecution.scala

spark-jobs/src/main/scala/za/co/absa/enceladus/common/JobCmdConfig.scala

spark-jobs/src/main/scala/za/co/absa/enceladus/conformance/ConfConfig.scala

spark-jobs/src/main/scala/za/co/absa/enceladus/standardization/StdConfig.scala

yruslan

Looks good. Let's try this on the cluster.

Would the program still complain if an invalid argument is passed?

yruslan · 2020-06-15T13:22:52Z

spark-jobs/src/main/scala/za/co/absa/enceladus/standardization/StandardizationReader.scala

+
+import scala.collection.immutable.HashMap
+
+class StandardizationReader(log: Logger) {


Maybe get a logger from the factory?

private val log = LoggerFactory.getLogger(this.getClass)

Then we can convert this class to object.

It can be an object, yes. I wanted it to be consistent with ConformanceReader which I think should be a class

Yeah, makes sense leaving it as a class. But what about the logger creation?

I am with @yruslan on this one. I cannot wrap my head around the fact that I need to send logger to this class

spark-jobs/src/main/scala/za/co/absa/enceladus/standardization/StandardizationConfig.scala

Zejnilovic

Notice --dataset-versiona vs --dataset-version. I mistyped one parameter. An error was printed but it did not die on it. It failed on authentication.

spark-submit \
--class za.co.absa.enceladus.standardization.StandardizationJob \
spark-jobs-2.7.0-SNAPSHOT.jar \
--menas-credentials-file menas-credential.properties \
--dataset-name SmallCSV \
--dataset-versiona 10 \
--report-date 2020-01-01 \
--report-version 1 \
--raw-format csv --header true
Error: Missing option --dataset-version
Usage: spark-submit [spark options] StandardizationBundle.jar [options]

  -D, --dataset-name <value>
                           Dataset name
  -d, --dataset-version <value>
                           Dataset version
  -R, --report-date <value>
                           Report date in 'yyyy-MM-dd' format
  -r, --report-version <value>
                           Report version. If not provided, it is inferred based on the publish path (it's an EXPERIMENTAL feature)
  --menas-auth-keytab <value>
                           Path to keytab file used for authenticating to menas
  --performance-file <value>
                           Produce a performance metrics file at the given location (local filesystem)
  --folder-prefix <value>  Adds a folder prefix before the infoDateColumn
  --persist-storage-level <value>
                           Specifies persistence storage level to use when processing data. Spark's default is MEMORY_AND_DISK.
20/06/15 20:21:32 INFO StandardizationJob$: Enceladus version 2.7.0-SNAPSHOT
20/06/15 20:21:32 INFO SparkContext: Running Spark version 2.4.4
20/06/15 20:21:32 INFO SparkContext: Submitted application: Standardisation 2.7.0-SNAPSHOT  1
20/06/15 20:21:32 INFO SecurityManager: Changing view acls to: XXX
20/06/15 20:21:32 INFO SecurityManager: Changing modify acls to: XXX
20/06/15 20:21:32 INFO SecurityManager: Changing view acls groups to:
20/06/15 20:21:32 INFO SecurityManager: Changing modify acls groups to:
...
20/06/15 20:21:34 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
Exception in thread "main" za.co.absa.enceladus.dao.UnauthorizedException: No Menas credentials provided
	at za.co.absa.enceladus.dao.rest.AuthClient$.apply(AuthClient.scala:32)
	at za.co.absa.enceladus.dao.rest.RestDaoFactory$.getInstance(RestDaoFactory.scala:26)
	at za.co.absa.enceladus.standardization.StandardizationJob$.main(StandardizationJob.scala:47)
	at za.co.absa.enceladus.standardization.StandardizationJob.main(StandardizationJob.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
20/06/15 20:21:37 INFO SparkContext: Invoking stop() from shutdown hook

…t-common-standardization-conformance

…ct-common-standardization-conformance

Zejnilovic

here is a 1st part. I like the usage of scopt

spark-jobs/src/main/scala/za/co/absa/enceladus/common/CommonJobExecution.scala

spark-jobs/src/main/scala/za/co/absa/enceladus/common/JobCmdConfig.scala

spark-jobs/src/main/scala/za/co/absa/enceladus/conformance/ConformanceCmdConfig.scala

Zejnilovic · 2020-06-19T09:33:59Z

spark-jobs/src/main/scala/za/co/absa/enceladus/conformance/ConformanceCmdConfig.scala

+    //Mutually exclusive parameters don't work in scopt 4
+    if (args.contains("--menas-credentials-file") && args.contains("--menas-auth-keytab")) {
+      println("ERROR: Only one authentication method is allowed at a time")
+      System.exit(1)


Throw some exception or return a Try[ConformanceCmdConfig] (that was a CQC consensus for this in Herme)

I would do this on #1392

Sorry why just there? This is a new class. I don't think it add more content doing it right from start.

@benedeki sorry, is this still valid?

spark-jobs/src/main/scala/za/co/absa/enceladus/conformance/ConformanceExecution.scala

Zejnilovic

Code reviewed. Nice work. I have mostly small comments

spark-jobs/src/main/scala/za/co/absa/enceladus/conformance/ConformanceReader.scala

spark-jobs/src/main/scala/za/co/absa/enceladus/standardization/StandardizationCmdConfig.scala

spark-jobs/src/main/scala/za/co/absa/enceladus/standardization/StandardizationExecution.scala

spark-jobs/src/main/scala/za/co/absa/enceladus/standardization/StandardizationReader.scala

yruslan

Looked through the changes in last commits. Looks good if we are good with this approach in general.

I had a couple of doubts about how this approach will work for the new job that combines both Standardization and Conformance. But after a discussion with @AdrianOlosutean, I can see it should work.

…on-conformance

dk1844

(read the code, checked out, built, did not run)

dk1844 · 2020-07-07T08:51:58Z

spark-jobs/src/main/scala/za/co/absa/enceladus/conformance/PropertiesProvider.scala

+    val enabled = cmdParameterOpt match {
+      case Some(b) => b
+      case None =>
+        if (conf.hasPath(configKey)) {


I think the current version of the ConfigReader.readStringConfig can be reused here.

dk1844 · 2020-07-07T08:53:11Z

spark-jobs/src/main/scala/za/co/absa/enceladus/conformance/PropertiesProvider.scala

+
+  private def isExperimentalRuleEnabled()(implicit cmd: ConformanceConfig): Boolean = {
+    val enabled = getCmdOrConfigBoolean(cmd.experimentalMappingRule,
+      "conformance.mapping.rule.experimental.implementation",


how about exporting these keys into some constants? (and the same below)

dk1844 · 2020-07-07T09:07:48Z

pom.xml

@@ -159,7 +159,7 @@
        <mongo.java.driver.version>3.6.4</mongo.java.driver.version>
        <mockito.version>2.10.0</mockito.version>
        <spark.xml.version>0.5.0</spark.xml.version>
-        <scopt.version>3.7.0</scopt.version>
+        <scopt.version>4.0.0-RC2</scopt.version>


btw. are there any of the v4+ features actually being used?

Yes, indeed. The feature for combining parsers

…ct-common-standardization-conformance # Conflicts: # spark-jobs/src/main/scala/za/co/absa/enceladus/conformance/DynamicConformanceJob.scala # spark-jobs/src/main/scala/za/co/absa/enceladus/standardization/StandardizationJob.scala

HuvarVer · 2020-07-13T11:26:28Z

spark-jobs/src/main/scala/za/co/absa/enceladus/standardization/StandardizationJob.scala

-  private val confReader: ConfigReader = new ConfigReader(conf)
-  private val menasBaseUrls = MenasConnectionStringParser.parse(conf.getString("menas.rest.uri"))
-  private final val SparkCSVReaderMaxColumnsDefault: Int = 20480
+object StandardizationJob extends StandardizationExecution {


I am trying to run PR, but there is a problem with running standardization:

20/07/13 13:17:54 INFO core.SparkLineageInitializer$: Spline successfully initialized. Spark Lineage tracking is ENABLED. Exception in thread "main" java.lang.IllegalStateException: Control framework tracking is not initialized. at za.co.absa.atum.core.Atum$.preventNotInitialized(Atum.scala:228) at za.co.absa.atum.core.Atum$.setAllowUnpersistOldDatasets(Atum.scala:122) at za.co.absa.enceladus.common.CommonJobExecution$class.prepareJob(CommonJobExecution.scala:100) at za.co.absa.enceladus.standardization.StandardizationJob$.prepareJob(StandardizationJob.scala:26) at za.co.absa.enceladus.standardization.StandardizationJob$.main(StandardizationJob.scala:38) at za.co.absa.enceladus.standardization.StandardizationJob.main(StandardizationJob.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

True, found the missing line

HuvarVer · 2020-07-13T12:41:02Z

spark-jobs/src/main/scala/za/co/absa/enceladus/standardization/StandardizationExecution.scala

+                                          fsUtils: FileSystemVersionUtils,
+                                          spark: SparkSession): StructType = {
+    // Enable Menas plugin for Control Framework
+    MenasPlugin.enableMenas(


Another problem with Standardization job.

20/07/13 14:27:12 INFO core.SparkLineageInitializer$: Spline successfully initialized. Spark Lineage tracking is ENABLED. Exception in thread "main" java.lang.NullPointerException at za.co.absa.atum.core.Atum$.addEventListener(Atum.scala:223) at za.co.absa.atum.plugins.PluginManager$.loadPlugin(PluginManager.scala:22) at za.co.absa.enceladus.common.plugin.menas.MenasPlugin$.enableMenas(MenasPlugin.scala:55) at za.co.absa.enceladus.standardization.StandardizationExecution$class.prepareStandardization(StandardizationExecution.scala:56) at za.co.absa.enceladus.standardization.StandardizationJob$.prepareStandardization(StandardizationJob.scala:26) at za.co.absa.enceladus.standardization.StandardizationJob$.main(StandardizationJob.scala:39) at za.co.absa.enceladus.standardization.StandardizationJob.main(StandardizationJob.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

HuvarVer

pull, build, run, run e2e tests.
Now it looks that it works correctly.

Zejnilovic

Comments implemented

benedeki

Just a few less important remarks.

benedeki · 2020-07-14T13:46:22Z

spark-jobs/src/main/scala/za/co/absa/enceladus/conformance/PropertiesProvider.scala

+import za.co.absa.enceladus.utils.config.ConfigUtils.ConfigImplicits
+import za.co.absa.enceladus.conformance.interpreter.{FeatureSwitches, ThreeStateSwitch}
+
+class PropertiesProvider {


What's the purpose of this class?

Reads the properties from the configuration

With that, I meant maybe a short comment on the purpose of the class. (Not to be confused with ConformanceConfig for example) 😉

benedeki · 2020-07-14T13:48:03Z

spark-jobs/src/main/scala/za/co/absa/enceladus/conformance/PropertiesProvider.scala

+  private val log: Logger = LoggerFactory.getLogger(this.getClass)
+  private implicit val conf: Config = ConfigFactory.load()
+
+  private val standardizedHdfsFolderKey = "conformance.autoclean.standardized.hdfs.folder"


IMHO these private vals belong to companion object.

benedeki · 2020-07-14T14:04:10Z

spark-jobs/src/main/scala/za/co/absa/enceladus/standardization/PropertiesProvider.scala

+
+import scala.collection.immutable.HashMap
+
+class PropertiesProvider {


Same thing.
It's also confusing having two classes of same name, just different path, similar function, but no relationship.

They used to be called Readers and it seems that they can have the same name in different packages because they are only used in their own package

I know it does work, but it's still confusing IMHO. 🤔

…on-conformance

sonarqubecloud · 2020-07-14T15:42:01Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities (and 0 Security Hotspots to review)
16 Code Smells

No Coverage information
0.0% Duplication

benedeki and others added 5 commits April 27, 2020 23:13

Merge branch 'develop'

6c015f4

Merge branch 'develop'

47dfb7b

Merge branch 'develop'

1a9e6db

#1307 Refactoring started: changed config, separated logic from the j…

0954f90

…obs and extracted commonalities

#1307 Moved CommonJobExecution to common package

6456011

AdrianOlosutean requested review from yruslan, benedeki, lokm01, dk1844 and DzMakatun June 8, 2020 11:26

Zejnilovic requested changes Jun 9, 2020

View reviewed changes

Adrian-Olosutean added 2 commits June 10, 2020 13:48

#1015 Header + other small improvements

d1298d0

#1015 Nested configurations

0d60836

AdrianOlosutean changed the title ~~#1307 Refactoring Conformance and Standardization~~ #1015 Refactoring Conformance and Standardization Jun 10, 2020

#1015 other changes

f12f717

Zejnilovic reviewed Jun 12, 2020

View reviewed changes

spark-jobs/src/main/scala/za/co/absa/enceladus/conformance/ConfConfig.scala Outdated Show resolved Hide resolved

spark-jobs/src/main/scala/za/co/absa/enceladus/standardization/StdConfig.scala Outdated Show resolved Hide resolved

benedeki and others added 2 commits June 13, 2020 00:07

Merge branch 'develop'

e98c1af

#1015 Renamed config classes

8375390

AdrianOlosutean marked this pull request as ready for review June 15, 2020 11:12

AdrianOlosutean requested a review from HuvarVer as a code owner June 15, 2020 11:12

yruslan reviewed Jun 15, 2020

View reviewed changes

Zejnilovic requested changes Jun 15, 2020

View reviewed changes

Adrian-Olosutean added 5 commits June 18, 2020 15:13

#1015 scopt4 approach

37f99b0

#1015 cleaned leftover code

f14d299

Merge remote-tracking branch 'origin/master' into feature/1015-extrac…

b1bd2c9

…t-common-standardization-conformance

Merge remote-tracking branch 'origin/develop' into feature/1015-extra…

81e9077

…ct-common-standardization-conformance

#1015 Merging fix

1bf4c3c

Zejnilovic reviewed Jun 19, 2020

View reviewed changes

yruslan reviewed Jun 19, 2020

View reviewed changes

Adrian-Olosutean and others added 2 commits July 2, 2020 14:54

#1015 Renamed to JobConfigParser

6ffb789

Merge branch 'develop' into feature/1015-extract-common-standardizati…

ce015ce

…on-conformance

dk1844 previously approved these changes Jul 7, 2020

View reviewed changes

Adrian-Olosutean added 2 commits July 13, 2020 11:01

#1015 Conflict resolution

5544ed7

AdrianOlosutean dismissed dk1844’s stale review via 5544ed7 July 13, 2020 10:31

#1015 Fixes

6556256

HuvarVer reviewed Jul 13, 2020

View reviewed changes

#1015 Moved Atum control framework performance optimization

02ce494

HuvarVer reviewed Jul 13, 2020

View reviewed changes

Adrian-Olosutean added 2 commits July 13, 2020 17:46

#1015 Import reordering

01167e9

#1015 Small changes

d5a8f1f

Zejnilovic mentioned this pull request Jul 13, 2020

runPostProcessing could throw warning with no meaning #1443

Closed

HuvarVer previously approved these changes Jul 14, 2020

View reviewed changes

Zejnilovic previously approved these changes Jul 14, 2020

View reviewed changes

benedeki reviewed Jul 14, 2020

View reviewed changes

#1015 Renamed PropertiesProviders and added comments

a0a2be5

AdrianOlosutean dismissed stale reviews from Zejnilovic and HuvarVer via a0a2be5 July 14, 2020 15:12

benedeki approved these changes Jul 14, 2020

View reviewed changes

AdrianOlosutean removed request for lokm01 and DzMakatun July 14, 2020 15:40

Merge branch 'develop' into feature/1015-extract-common-standardizati…

0d8764f

…on-conformance

AdrianOlosutean requested review from DzMakatun and lokm01 as code owners July 14, 2020 15:40

AdrianOlosutean merged commit f925efe into develop Jul 14, 2020

AdrianOlosutean deleted the feature/1015-extract-common-standardization-conformance branch July 14, 2020 15:58

AdrianOlosutean mentioned this pull request Jul 21, 2020

Feature/1371 Combined Standardization Conformance Job #1392

Merged

AdrianOlosutean mentioned this pull request Aug 18, 2020

Using the new PathConfig class made us lose some of the logs #1488

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#1015 Refactoring Conformance and Standardization #1377

#1015 Refactoring Conformance and Standardization #1377

AdrianOlosutean commented Jun 8, 2020 •

edited

Loading

Zejnilovic left a comment •

edited

Loading

yruslan left a comment

yruslan Jun 15, 2020

AdrianOlosutean Jun 15, 2020

yruslan Jun 16, 2020

Zejnilovic Jun 29, 2020

Zejnilovic left a comment

Zejnilovic left a comment

Zejnilovic Jun 19, 2020

AdrianOlosutean Jun 19, 2020

Zejnilovic Jun 19, 2020

benedeki Jun 22, 2020

Zejnilovic Jun 29, 2020

Zejnilovic left a comment

yruslan left a comment

dk1844 left a comment

dk1844 Jul 7, 2020

dk1844 Jul 7, 2020

dk1844 Jul 7, 2020

AdrianOlosutean Jul 13, 2020

HuvarVer Jul 13, 2020

AdrianOlosutean Jul 13, 2020

HuvarVer Jul 13, 2020

HuvarVer left a comment

Zejnilovic left a comment

benedeki left a comment

benedeki Jul 14, 2020

AdrianOlosutean Jul 14, 2020

benedeki Jul 14, 2020

benedeki Jul 14, 2020

benedeki Jul 14, 2020

AdrianOlosutean Jul 14, 2020

benedeki Jul 14, 2020

sonarqubecloud bot commented Jul 14, 2020


		import scala.collection.immutable.HashMap

		class StandardizationReader(log: Logger) {


		import scala.collection.immutable.HashMap

		class PropertiesProvider {

#1015 Refactoring Conformance and Standardization #1377

#1015 Refactoring Conformance and Standardization #1377

Conversation

AdrianOlosutean commented Jun 8, 2020 • edited Loading

Zejnilovic left a comment • edited Loading

Choose a reason for hiding this comment

yruslan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Zejnilovic left a comment

Choose a reason for hiding this comment

Zejnilovic left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Zejnilovic left a comment

Choose a reason for hiding this comment

yruslan left a comment

Choose a reason for hiding this comment

dk1844 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HuvarVer left a comment

Choose a reason for hiding this comment

Zejnilovic left a comment

Choose a reason for hiding this comment

benedeki left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sonarqubecloud bot commented Jul 14, 2020

AdrianOlosutean commented Jun 8, 2020 •

edited

Loading

Zejnilovic left a comment •

edited

Loading