[SPARK-27300][GRAPH] Add Spark Graph modules and dependencies #24490

s1ck · 2019-04-29T20:47:39Z

What changes were proposed in this pull request?

This PR introduces the necessary Maven modules for the new Spark Graph feature for Spark 3.0.

spark-graph is a parent module that users depend on to get all graph functionalities (Cypher and Graph Algorithms)
spark-graph-api defines the Property Graph API that is being shared between Cypher and Algorithms
spark-cypher contains a Cypher query engine implementation

Both, spark-graph-api and spark-cypher depend on Spark SQL.

Note, that the Maven module for Graph Algorithms is not part of this PR and will be introduced in https://issues.apache.org/jira/browse/SPARK-27302

A PoC for a running Cypher implementation can be found in this WIP PR #24297

How was this patch tested?

Pass the Jenkins with all profiles and manually build and check the followings.

$ ls assembly/target/scala-2.12/jars/spark-cypher*
assembly/target/scala-2.12/jars/spark-cypher_2.12-3.0.0-SNAPSHOT.jar

$ ls assembly/target/scala-2.12/jars/spark-graph* | grep -v graphx
assembly/target/scala-2.12/jars/spark-graph-api_2.12-3.0.0-SNAPSHOT.jar
assembly/target/scala-2.12/jars/spark-graph_2.12-3.0.0-SNAPSHOT.jar

s1ck · 2019-04-29T20:48:11Z

@mengxr would be great if you could have a look

graph/api/pom.xml

pom.xml

dongjoon-hyun · 2019-04-30T02:48:30Z

ok to test

SparkQA · 2019-04-30T02:56:55Z

Test build #105018 has finished for PR 24490 at commit 02e803a.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class DummySuite extends SparkFunSuite

SparkQA · 2019-04-30T05:14:35Z

Test build #105019 has finished for PR 24490 at commit 5708086.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

Hi, @s1ck . This PR doesn't include the dependency changes. I made a PR against to your branch. Could you review and merge that?

s1ck#1

dongjoon-hyun · 2019-04-30T21:44:23Z

For now, I added [test-maven] tag to the title in order to trigger Maven build.
To move forward, we also need to update SBT build file.

dongjoon-hyun · 2019-04-30T21:44:30Z

Retest this please.

dongjoon-hyun · 2019-04-30T21:48:01Z

dev/deps/spark-deps-hadoop-2.7

@s1ck . In general, this PR seems to introduce many transitive dependencies. I'm wondering if we can reduce some of these transitive dependency by adding exclusion rules?

cc @mengxr

Those dependencies are part of the OKAPI stack which is responsible for translating Cypher to relational operations. I understand that adding a lot of dependencies is bad practice for a single PR. By excluding them, we basically require the user to add explicit dependencies? If yes, it doesn't sound like an elegant solution to me.

@s1ck . The alternative in my mind is making these into external modules like external/avro and external/kafka-0-10.

Hi, @rxin , @mengxr , @gatorsmile . Can we make new Graph modules as external modules instead? For now, this PR looks too intrusive to me.

@dongjoon-hyun This module adds a significant feature in a major Spark release. Is this not a reason to have a more relaxed qualification on dependent libraries?

KAFKA and AVRO also does.

s1ck · 2019-04-30T21:52:06Z

@dongjoon-hyun What are the necessary steps to update the SBT build file?

dongjoon-hyun · 2019-04-30T22:05:43Z

For scala build, we need to update project/SparkBuild.scala and the related files. Before updating project directory, we need to decide Maven project structure first; external or not. Let's wait for the PMC's decision.

SparkQA · 2019-04-30T23:18:42Z

Test build #105043 has finished for PR 24490 at commit b65a7b0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-05-01T02:18:28Z

Test build #105047 has finished for PR 24490 at commit b65a7b0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2019-05-02T22:25:51Z

@dongjoon-hyun I'm working with @s1ck to create a shared jar of okapi-relational and rename it transitive dependencies. So the graph component only has this one external dependency. The size is about 18MB.

dongjoon-hyun · 2019-05-02T22:31:03Z

That's nice. Thank you for updating, @mengxr !

mengxr · 2019-05-02T23:06:43Z

Here are a flattened list of runtime transitive dependencies of okapi-relational:

com.lihaoyi:fastparse_2.12:
com.lihaoyi:sourcecode_2.12:
com.lihaoyi:ujson_2.12:
com.lihaoyi:upack_2.12:
com.lihaoyi:upickle-core_2.12:
com.lihaoyi:upickle-implicits_2.12:
com.lihaoyi:upickle_2.12:
org.apache.commons:commons-lang3:
org.apache.logging.log4j:log4j-api-scala_2.12:
org.apache.logging.log4j:log4j-api:
org.atnos:eff_2.12:
org.opencypher:ast-9.0:
org.opencypher:expressions-9.0:
org.opencypher:front-end-9.0:
org.opencypher:parser-9.0:
org.opencypher:rewriting-9.0:
org.opencypher:util-9.0:
org.parboiled:parboiled-core:
org.parboiled:parboiled-scala_2.12:
org.scala-lang.modules:scala-xml_2.12:
org.scala-lang:scala-compiler:
org.scala-lang:scala-library:
org.scala-lang:scala-reflect:
org.typelevel:cats-core_2.12:
org.typelevel:cats-kernel_2.12:
org.typelevel:cats-macros_2.12:
org.typelevel:machinist_2.12:

All are licensed under either Apache or MIT. Some test packages got pulled into runtime. I submitted a clean-up PR here: opencypher/morpheus#907

dongjoon-hyun · 2019-05-15T18:05:25Z

Hi, @s1ck . Could you update this PR if the dependency is reduced?

s1ck · 2019-05-15T21:17:08Z

@dongjoon-hyun We are currently looking into solving an issue around shading Scala libs. @mengxr described the issue here: https://contributors.scala-lang.org/t/tools-for-shading-a-scala-library/3317
This blocks us from introducing just one dependency that contains the needed Scala libs.

Do you have experience in shading Scala libs without running into downstream issues?

felixcheung

at least, bring dependencies would need to be listed in license

s1ck · 2019-05-28T08:40:50Z

@dongjoon-hyun We have successfully reduced the dependencies using a shaded jar including relocated dependencies. Please have a second look.
@felixcheung The META-INF folder within the shaded jar lists all licenses (Apache 2.0 / MIT) and the libs that use them. Are there specific requirements where the license list should be stored?
@mengxr The shaded jar has been produced solely using Gradle Shadow, we did not fix annotations with ScalaShade or any other tooling. Instead, we removed leaking references to relocated libraries from our APIs. This allows us to successfully compile and run spark-cypher tests. I can summarize the relocation story in another PR comment but would not involve the mailing list, as this is basically unblocked now.

s1ck · 2019-05-28T09:29:26Z

For scala build, we need to update project/SparkBuild.scala and the related files. Before updating project directory, we need to decide Maven project structure first; external or not. Let's wait for the PMC's decision.

I updated ScalaBuild.scala and can successfully run sbt compile. @dongjoon-hyun which additional locations do we need to update?

SparkQA · 2019-05-28T12:14:27Z

Test build #105860 has finished for PR 24490 at commit 3af6ce7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-05-28T13:18:11Z

Test build #105862 has finished for PR 24490 at commit bf7db46.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

Mats-SX · 2019-05-28T15:04:37Z

Looks like the unit tests are failing in the Hive Thrift Server module. Not sure this PR affects that module?

s1ck · 2019-05-31T14:41:02Z

@dongjoon-hyun Could you please have another look at the PR? Thanks!

dongjoon-hyun · 2019-06-02T19:13:14Z

Retest this please.

dongjoon-hyun · 2019-06-02T19:25:29Z

Retest this please.

SparkQA · 2019-06-06T10:35:33Z

Test build #106237 has finished for PR 24490 at commit 47bd5bc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

s1ck · 2019-06-06T10:39:17Z

@zsxwing Thanks! That fixed the build.

@srowen @dongjoon-hyun Could you please have another look and finish your reviews? Thanks.

dongjoon-hyun · 2019-06-09T01:53:45Z

Retest this please.

dongjoon-hyun · 2019-06-09T02:01:43Z

Retest this please.

dongjoon-hyun · 2019-06-09T02:09:29Z

Retest this please.

SparkQA · 2019-06-09T03:40:24Z

Test build #106315 has finished for PR 24490 at commit 47bd5bc.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-06-09T04:32:20Z

Retest this please.

SparkQA · 2019-06-09T06:36:12Z

Test build #106321 has finished for PR 24490 at commit 47bd5bc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-06-09T06:46:48Z

Test build #106316 has finished for PR 24490 at commit 47bd5bc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-06-09T07:05:02Z

Test build #106317 has finished for PR 24490 at commit 47bd5bc.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

+1, LGTM. Thank you so much for preparing this for a long time, @s1ck .

Thank you for reviews, @srowen , @mengxr, @zsxwing , @felixcheung , @wangyum.

I also built and tested with the artifact generations, and Jenkins tested sbt/hadoop-2.7/maven/hadoop-2.7. Additionally, maven/hadoop-3.2 was also triggered because it's a running profile on Jenkins farm. Unfortunately, it stopped at R tests due to the midnight reset. However, since this doesn't deliver the real code, the test result looks enough for verification.

As the project structure for the new Graph component, this PR completes the scope. I'm going to merge this to the master branch for the next steps.

gatorsmile · 2019-06-09T09:36:11Z

Checkstyle checks failed at following occurrences:
[ERROR] Failed to execute goal on project spark-cypher_2.12: Could not resolve dependencies for project org.apache.spark:spark-cypher_2.12:jar:3.0.0-SNAPSHOT: Could not find artifact org.apache.spark:spark-graph-api_2.12:jar:3.0.0-SNAPSHOT in apache.snapshots (https://repository.apache.org/snapshots) -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException
[ERROR] 
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn <goals> -rf :spark-cypher_2.12
Build step 'Execute shell' marked build as failure
Finished: FAILURE

This also breaks the style check. Could any of you investigate it ASAP? Or we might need to revert this PR.

rxin · 2019-06-09T09:40:29Z

If you can't figure it out in a few mins, the right way is to revert, fix, and then remerge.

s1ck · 2019-06-09T10:40:29Z

I can have a look, but I'm also surprised that CI didn't complain before the merge. Any ideas @gatorsmile ?

s1ck · 2019-06-09T13:07:01Z

@gatorsmile Where did you find that log output? I can only see one failing test:

org.apache.spark.sql.kafka010.KafkaContinuousSourceSuite.subscribing topic by name from latest offsets (failOnDataLoss: true)

dongjoon-hyun · 2019-06-09T15:49:12Z

We don't have a style issue.

$ dev/scalastyle
Scalastyle checks passed.

$ dev/lint-java
Using `mvn` from path: /usr/local/bin/mvn
Checkstyle checks passed.

I'll going to monitor the ongoing Jenkins.

dongjoon-hyun · 2019-06-09T15:55:33Z

@s1ck . You can check the master lint result here.

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-lint/10363/

It's now green. We are okay.

dongjoon-hyun · 2019-06-09T18:40:09Z

master branch shows green in the following Jenkins jobs and Maven/Hadoop2.7 also looks good so far. This commit is safe.

The root cause of the problem is not inside Apache Spark repository. AmbLab Jenkins script is still using --force option which is deprecated at Apache Spark 2.0.0. We should clean up at this time. I can volunteer for that if the repo becomes accessible to me.

srowen · 2019-06-09T21:11:11Z

@dongjoon-hyun I have admin access. I think @shaneknapp can grant it and interested committers should probably have it. Are there any more jobs to change? that looks OK now?

dongjoon-hyun · 2019-06-09T23:03:38Z

Oh, could you give me the access, too? Thanks, please. I sent an email Shane yesterday, but he is off until 13th unfortunately.

We need to update old Jenkins script and bring back #24824 .

srowen · 2019-06-10T13:09:08Z

I just mean I think I can modify Jenkins job configs if that's what's needed. Do you know which ones need modifying? or are they not in the Jenkins job configs in the UI? I don't think I have access to give admin access, or don't know how to do that.

dongjoon-hyun · 2019-06-10T16:17:52Z

Thanks. Then, at least, for the three profiles for master branch, could you remove --force from the following MVN declaration? I'm not sure the name of script. But, the following is the Jenkins' result.

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/6458/console

+ [[ master < branch-1.5 ]]
+ MVN='build/mvn --force -DzincPort=3428'
+ set +e
+ [[ hadoop-2.7 == hadoop-1 ]]
+ build/mvn --force -DzincPort=3428 -DskipTests -Phadoop-2.7 -Pyarn -Phive -Phive-thriftserver -Pkinesis-asl -Pmesos clean package

FYI, you can remove --force from all active branches because --force is deprecated at 2.0.0.

srowen · 2019-06-10T17:04:13Z

OK, I see it now, I removed this from both of the mvn master builds.

zsxwing · 2019-06-10T17:52:00Z

IIUC, this looks like just because of the known issue in mvn checkstyle:check command (called by lint-java script). It requires all dependencies of a module (even dependencies are just other sub-modules) can be found in the local repo (You can run mvn install to install sub-modules) or a remote repo.

It's green now because this Jenkins job ( https://amplab.cs.berkeley.edu/jenkins/job/spark-master-maven-snapshots/ ) pushed the artifacts to https://repository.apache.org/content/repositories/snapshots/. Hence maven can download all artifacts of sub-modules from this repo now.

In general, the java style check may fail if someone adds multiple maven projects, until the artifacts are pushed to the apache snapshot repo. But this should be fine since that's rare and we push snapshots every day.

By the way, I also noticed that java style check is enabled in the PR build now:

spark/dev/run-tests.py

Line 593 in 742f805

if not changed_files or any(f.endswith(".java")

This issue may happen in the PR build if a PR also adds Java files which will trigger java style check. Java style check needs to be disabled until the artifacts of this project are pushed to the apache snapshot repo. Not sure where to document this. ( I learnt this issue in one of my old PRs #10744 (comment) and disabled the java style check in 4bcea1b#diff-9657d18e98a7dc82ca214359cfd6bdc4 )

## What changes were proposed in this pull request? This PR introduces the necessary Maven modules for the new [Spark Graph](https://issues.apache.org/jira/browse/SPARK-25994) feature for Spark 3.0. * `spark-graph` is a parent module that users depend on to get all graph functionalities (Cypher and Graph Algorithms) * `spark-graph-api` defines the [Property Graph API](https://docs.google.com/document/d/1Wxzghj0PvpOVu7XD1iA8uonRYhexwn18utdcTxtkxlI) that is being shared between Cypher and Algorithms * `spark-cypher` contains a Cypher query engine implementation Both, `spark-graph-api` and `spark-cypher` depend on Spark SQL. Note, that the Maven module for Graph Algorithms is not part of this PR and will be introduced in https://issues.apache.org/jira/browse/SPARK-27302 A PoC for a running Cypher implementation can be found in this WIP PR apache#24297 ## How was this patch tested? Pass the Jenkins with all profiles and manually build and check the followings. ``` $ ls assembly/target/scala-2.12/jars/spark-cypher* assembly/target/scala-2.12/jars/spark-cypher_2.12-3.0.0-SNAPSHOT.jar $ ls assembly/target/scala-2.12/jars/spark-graph* | grep -v graphx assembly/target/scala-2.12/jars/spark-graph-api_2.12-3.0.0-SNAPSHOT.jar assembly/target/scala-2.12/jars/spark-graph_2.12-3.0.0-SNAPSHOT.jar ``` Closes apache#24490 from s1ck/SPARK-27300. Lead-authored-by: Martin Junghanns <martin.junghanns@neotechnology.com> Co-authored-by: Max Kießling <max@kopfueber.org> Co-authored-by: Martin Junghanns <martin.junghanns@neo4j.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

wangyum reviewed Apr 30, 2019

View reviewed changes

graph/api/pom.xml Outdated Show resolved Hide resolved

wangyum reviewed Apr 30, 2019

View reviewed changes

pom.xml Show resolved Hide resolved

dongjoon-hyun requested changes Apr 30, 2019

View reviewed changes

dongjoon-hyun changed the title ~~[SPARK-27300][GRAPH] Add Spark Graph modules and dependencies~~ [SPARK-27300][GRAPH][test-maven] Add Spark Graph modules and dependencies Apr 30, 2019

dongjoon-hyun reviewed Apr 30, 2019

View reviewed changes

felixcheung reviewed May 18, 2019

View reviewed changes

dongjoon-hyun changed the title ~~[SPARK-27300][GRAPH][test-maven] Add Spark Graph modules and dependencies~~ [SPARK-27300][GRAPH] Add Spark Graph modules and dependencies Jun 2, 2019

dongjoon-hyun changed the title ~~[SPARK-27300][GRAPH] Add Spark Graph modules and dependencies~~ [SPARK-27300][GRAPH][test-maven] Add Spark Graph modules and dependencies Jun 9, 2019

dongjoon-hyun changed the title ~~[SPARK-27300][GRAPH][test-maven] Add Spark Graph modules and dependencies~~ [SPARK-27300][GRAPH][test-maven][test-hadoop3.2] Add Spark Graph modules and dependencies Jun 9, 2019

dongjoon-hyun changed the title ~~[SPARK-27300][GRAPH][test-maven][test-hadoop3.2] Add Spark Graph modules and dependencies~~ [SPARK-27300][GRAPH] Add Spark Graph modules and dependencies Jun 9, 2019

dongjoon-hyun approved these changes Jun 9, 2019

View reviewed changes

dongjoon-hyun closed this in 709387d Jun 9, 2019

[SPARK-27300][GRAPH] Add Spark Graph modules and dependencies #24490

[SPARK-27300][GRAPH] Add Spark Graph modules and dependencies #24490

Uh oh!

Conversation

s1ck commented Apr 29, 2019 • edited by dongjoon-hyun Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

s1ck commented Apr 29, 2019

Uh oh!

Uh oh!

Uh oh!

dongjoon-hyun commented Apr 30, 2019

Uh oh!

SparkQA commented Apr 30, 2019

Uh oh!

SparkQA commented Apr 30, 2019

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Apr 30, 2019

Uh oh!

dongjoon-hyun commented Apr 30, 2019

Uh oh!

dongjoon-hyun Apr 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

s1ck Apr 30, 2019

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Apr 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Mats-SX May 14, 2019

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun May 15, 2019

Choose a reason for hiding this comment

Uh oh!

s1ck commented Apr 30, 2019

Uh oh!

dongjoon-hyun commented Apr 30, 2019

Uh oh!

SparkQA commented Apr 30, 2019

Uh oh!

SparkQA commented May 1, 2019

Uh oh!

mengxr commented May 2, 2019

Uh oh!

dongjoon-hyun commented May 2, 2019

Uh oh!

mengxr commented May 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented May 15, 2019

Uh oh!

s1ck commented May 15, 2019

Uh oh!

felixcheung left a comment

Choose a reason for hiding this comment

Uh oh!

s1ck commented May 28, 2019

Uh oh!

s1ck commented May 28, 2019

Uh oh!

SparkQA commented May 28, 2019

Uh oh!

SparkQA commented May 28, 2019

Uh oh!

Mats-SX commented May 28, 2019

Uh oh!

s1ck commented May 31, 2019

Uh oh!

dongjoon-hyun commented Jun 2, 2019

Uh oh!

dongjoon-hyun commented Jun 2, 2019

s1ck commented Apr 29, 2019 •

edited by dongjoon-hyun

Loading

dongjoon-hyun Apr 30, 2019 •

edited

Loading

dongjoon-hyun Apr 30, 2019 •

edited

Loading

mengxr commented May 2, 2019 •

edited

Loading

s1ck commented Jun 9, 2019 •

edited

Loading

dongjoon-hyun commented Jun 9, 2019 •

edited

Loading