Skip to content

Conversation

@s1ck
Copy link
Contributor

@s1ck s1ck commented Apr 29, 2019

What changes were proposed in this pull request?

This PR introduces the necessary Maven modules for the new Spark Graph feature for Spark 3.0.

  • spark-graph is a parent module that users depend on to get all graph functionalities (Cypher and Graph Algorithms)
  • spark-graph-api defines the Property Graph API that is being shared between Cypher and Algorithms
  • spark-cypher contains a Cypher query engine implementation

Both, spark-graph-api and spark-cypher depend on Spark SQL.

Note, that the Maven module for Graph Algorithms is not part of this PR and will be introduced in https://issues.apache.org/jira/browse/SPARK-27302

A PoC for a running Cypher implementation can be found in this WIP PR #24297

How was this patch tested?

Pass the Jenkins with all profiles and manually build and check the followings.

$ ls assembly/target/scala-2.12/jars/spark-cypher*
assembly/target/scala-2.12/jars/spark-cypher_2.12-3.0.0-SNAPSHOT.jar

$ ls assembly/target/scala-2.12/jars/spark-graph* | grep -v graphx
assembly/target/scala-2.12/jars/spark-graph-api_2.12-3.0.0-SNAPSHOT.jar
assembly/target/scala-2.12/jars/spark-graph_2.12-3.0.0-SNAPSHOT.jar

@s1ck
Copy link
Contributor Author

s1ck commented Apr 29, 2019

@mengxr would be great if you could have a look

@dongjoon-hyun
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Apr 30, 2019

Test build #105018 has finished for PR 24490 at commit 02e803a.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class DummySuite extends SparkFunSuite

@SparkQA
Copy link

SparkQA commented Apr 30, 2019

Test build #105019 has finished for PR 24490 at commit 5708086.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @s1ck . This PR doesn't include the dependency changes. I made a PR against to your branch. Could you review and merge that?

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-27300][GRAPH] Add Spark Graph modules and dependencies [SPARK-27300][GRAPH][test-maven] Add Spark Graph modules and dependencies Apr 30, 2019
@dongjoon-hyun
Copy link
Member

For now, I added [test-maven] tag to the title in order to trigger Maven build.
To move forward, we also need to update SBT build file.

@dongjoon-hyun
Copy link
Member

Retest this please.

Copy link
Member

@dongjoon-hyun dongjoon-hyun Apr 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@s1ck . In general, this PR seems to introduce many transitive dependencies. I'm wondering if we can reduce some of these transitive dependency by adding exclusion rules?

cc @mengxr

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those dependencies are part of the OKAPI stack which is responsible for translating Cypher to relational operations. I understand that adding a lot of dependencies is bad practice for a single PR. By excluding them, we basically require the user to add explicit dependencies? If yes, it doesn't sound like an elegant solution to me.

Copy link
Member

@dongjoon-hyun dongjoon-hyun Apr 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@s1ck . The alternative in my mind is making these into external modules like external/avro and external/kafka-0-10.

Hi, @rxin , @mengxr , @gatorsmile . Can we make new Graph modules as external modules instead? For now, this PR looks too intrusive to me.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun This module adds a significant feature in a major Spark release. Is this not a reason to have a more relaxed qualification on dependent libraries?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KAFKA and AVRO also does.

@s1ck
Copy link
Contributor Author

s1ck commented Apr 30, 2019

@dongjoon-hyun What are the necessary steps to update the SBT build file?

@dongjoon-hyun
Copy link
Member

For scala build, we need to update project/SparkBuild.scala and the related files. Before updating project directory, we need to decide Maven project structure first; external or not. Let's wait for the PMC's decision.

@SparkQA
Copy link

SparkQA commented Apr 30, 2019

Test build #105043 has finished for PR 24490 at commit b65a7b0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 1, 2019

Test build #105047 has finished for PR 24490 at commit b65a7b0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented May 2, 2019

@dongjoon-hyun I'm working with @s1ck to create a shared jar of okapi-relational and rename it transitive dependencies. So the graph component only has this one external dependency. The size is about 18MB.

@dongjoon-hyun
Copy link
Member

That's nice. Thank you for updating, @mengxr !

@mengxr
Copy link
Contributor

mengxr commented May 2, 2019

Here are a flattened list of runtime transitive dependencies of okapi-relational:

com.lihaoyi:fastparse_2.12:
com.lihaoyi:sourcecode_2.12:
com.lihaoyi:ujson_2.12:
com.lihaoyi:upack_2.12:
com.lihaoyi:upickle-core_2.12:
com.lihaoyi:upickle-implicits_2.12:
com.lihaoyi:upickle_2.12:
org.apache.commons:commons-lang3:
org.apache.logging.log4j:log4j-api-scala_2.12:
org.apache.logging.log4j:log4j-api:
org.atnos:eff_2.12:
org.opencypher:ast-9.0:
org.opencypher:expressions-9.0:
org.opencypher:front-end-9.0:
org.opencypher:parser-9.0:
org.opencypher:rewriting-9.0:
org.opencypher:util-9.0:
org.parboiled:parboiled-core:
org.parboiled:parboiled-scala_2.12:
org.scala-lang.modules:scala-xml_2.12:
org.scala-lang:scala-compiler:
org.scala-lang:scala-library:
org.scala-lang:scala-reflect:
org.typelevel:cats-core_2.12:
org.typelevel:cats-kernel_2.12:
org.typelevel:cats-macros_2.12:
org.typelevel:machinist_2.12:

All are licensed under either Apache or MIT. Some test packages got pulled into runtime. I submitted a clean-up PR here: opencypher/morpheus#907

@dongjoon-hyun
Copy link
Member

Hi, @s1ck . Could you update this PR if the dependency is reduced?

@s1ck
Copy link
Contributor Author

s1ck commented May 15, 2019

@dongjoon-hyun We are currently looking into solving an issue around shading Scala libs. @mengxr described the issue here: https://contributors.scala-lang.org/t/tools-for-shading-a-scala-library/3317
This blocks us from introducing just one dependency that contains the needed Scala libs.

Do you have experience in shading Scala libs without running into downstream issues?

Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at least, bring dependencies would need to be listed in license

@s1ck
Copy link
Contributor Author

s1ck commented May 28, 2019

@dongjoon-hyun We have successfully reduced the dependencies using a shaded jar including relocated dependencies. Please have a second look.
@felixcheung The META-INF folder within the shaded jar lists all licenses (Apache 2.0 / MIT) and the libs that use them. Are there specific requirements where the license list should be stored?
@mengxr The shaded jar has been produced solely using Gradle Shadow, we did not fix annotations with ScalaShade or any other tooling. Instead, we removed leaking references to relocated libraries from our APIs. This allows us to successfully compile and run spark-cypher tests. I can summarize the relocation story in another PR comment but would not involve the mailing list, as this is basically unblocked now.

@s1ck
Copy link
Contributor Author

s1ck commented May 28, 2019

For scala build, we need to update project/SparkBuild.scala and the related files. Before updating project directory, we need to decide Maven project structure first; external or not. Let's wait for the PMC's decision.

I updated ScalaBuild.scala and can successfully run sbt compile. @dongjoon-hyun which additional locations do we need to update?

@SparkQA
Copy link

SparkQA commented May 28, 2019

Test build #105860 has finished for PR 24490 at commit 3af6ce7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 28, 2019

Test build #105862 has finished for PR 24490 at commit bf7db46.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@Mats-SX
Copy link

Mats-SX commented May 28, 2019

Looks like the unit tests are failing in the Hive Thrift Server module. Not sure this PR affects that module?

@s1ck
Copy link
Contributor Author

s1ck commented May 31, 2019

@dongjoon-hyun Could you please have another look at the PR? Thanks!

@dongjoon-hyun
Copy link
Member

Retest this please.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-27300][GRAPH][test-maven] Add Spark Graph modules and dependencies [SPARK-27300][GRAPH] Add Spark Graph modules and dependencies Jun 2, 2019
@dongjoon-hyun
Copy link
Member

Retest this please.

@SparkQA
Copy link

SparkQA commented Jun 6, 2019

Test build #106237 has finished for PR 24490 at commit 47bd5bc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@s1ck
Copy link
Contributor Author

s1ck commented Jun 6, 2019

@zsxwing Thanks! That fixed the build.

@srowen @dongjoon-hyun Could you please have another look and finish your reviews? Thanks.

@dongjoon-hyun
Copy link
Member

Retest this please.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-27300][GRAPH] Add Spark Graph modules and dependencies [SPARK-27300][GRAPH][test-maven] Add Spark Graph modules and dependencies Jun 9, 2019
@dongjoon-hyun
Copy link
Member

Retest this please.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-27300][GRAPH][test-maven] Add Spark Graph modules and dependencies [SPARK-27300][GRAPH][test-maven][test-hadoop3.2] Add Spark Graph modules and dependencies Jun 9, 2019
@dongjoon-hyun
Copy link
Member

Retest this please.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-27300][GRAPH][test-maven][test-hadoop3.2] Add Spark Graph modules and dependencies [SPARK-27300][GRAPH] Add Spark Graph modules and dependencies Jun 9, 2019
@SparkQA
Copy link

SparkQA commented Jun 9, 2019

Test build #106315 has finished for PR 24490 at commit 47bd5bc.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Retest this please.

@SparkQA
Copy link

SparkQA commented Jun 9, 2019

Test build #106321 has finished for PR 24490 at commit 47bd5bc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 9, 2019

Test build #106316 has finished for PR 24490 at commit 47bd5bc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 9, 2019

Test build #106317 has finished for PR 24490 at commit 47bd5bc.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you so much for preparing this for a long time, @s1ck .

Thank you for reviews, @srowen , @mengxr, @zsxwing , @felixcheung , @wangyum.

I also built and tested with the artifact generations, and Jenkins tested sbt/hadoop-2.7/maven/hadoop-2.7. Additionally, maven/hadoop-3.2 was also triggered because it's a running profile on Jenkins farm. Unfortunately, it stopped at R tests due to the midnight reset. However, since this doesn't deliver the real code, the test result looks enough for verification.

As the project structure for the new Graph component, this PR completes the scope. I'm going to merge this to the master branch for the next steps.

@gatorsmile
Copy link
Member

Checkstyle checks failed at following occurrences:
[ERROR] Failed to execute goal on project spark-cypher_2.12: Could not resolve dependencies for project org.apache.spark:spark-cypher_2.12:jar:3.0.0-SNAPSHOT: Could not find artifact org.apache.spark:spark-graph-api_2.12:jar:3.0.0-SNAPSHOT in apache.snapshots (https://repository.apache.org/snapshots) -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException
[ERROR] 
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn <goals> -rf :spark-cypher_2.12
Build step 'Execute shell' marked build as failure
Finished: FAILURE

This also breaks the style check. Could any of you investigate it ASAP? Or we might need to revert this PR.

@rxin
Copy link
Contributor

rxin commented Jun 9, 2019

If you can't figure it out in a few mins, the right way is to revert, fix, and then remerge.

@s1ck
Copy link
Contributor Author

s1ck commented Jun 9, 2019

I can have a look, but I'm also surprised that CI didn't complain before the merge. Any ideas @gatorsmile ?

@s1ck
Copy link
Contributor Author

s1ck commented Jun 9, 2019

@gatorsmile Where did you find that log output? I can only see one failing test:

org.apache.spark.sql.kafka010.KafkaContinuousSourceSuite.subscribing topic by name from latest offsets (failOnDataLoss: true)

@dongjoon-hyun
Copy link
Member

We don't have a style issue.

$ dev/scalastyle
Scalastyle checks passed.

$ dev/lint-java
Using `mvn` from path: /usr/local/bin/mvn
Checkstyle checks passed.

I'll going to monitor the ongoing Jenkins.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jun 9, 2019

@s1ck . You can check the master lint result here.

It's now green. We are okay.

@dongjoon-hyun
Copy link
Member

master branch shows green in the following Jenkins jobs and Maven/Hadoop2.7 also looks good so far. This commit is safe.

The root cause of the problem is not inside Apache Spark repository. AmbLab Jenkins script is still using --force option which is deprecated at Apache Spark 2.0.0. We should clean up at this time. I can volunteer for that if the repo becomes accessible to me.

@srowen
Copy link
Member

srowen commented Jun 9, 2019

@dongjoon-hyun I have admin access. I think @shaneknapp can grant it and interested committers should probably have it. Are there any more jobs to change? that looks OK now?

@dongjoon-hyun
Copy link
Member

Oh, could you give me the access, too? Thanks, please. I sent an email Shane yesterday, but he is off until 13th unfortunately.

We need to update old Jenkins script and bring back #24824 .

@srowen
Copy link
Member

srowen commented Jun 10, 2019

I just mean I think I can modify Jenkins job configs if that's what's needed. Do you know which ones need modifying? or are they not in the Jenkins job configs in the UI? I don't think I have access to give admin access, or don't know how to do that.

@dongjoon-hyun
Copy link
Member

Thanks. Then, at least, for the three profiles for master branch, could you remove --force from the following MVN declaration? I'm not sure the name of script. But, the following is the Jenkins' result.

+ [[ master < branch-1.5 ]]
+ MVN='build/mvn --force -DzincPort=3428'
+ set +e
+ [[ hadoop-2.7 == hadoop-1 ]]
+ build/mvn --force -DzincPort=3428 -DskipTests -Phadoop-2.7 -Pyarn -Phive -Phive-thriftserver -Pkinesis-asl -Pmesos clean package

FYI, you can remove --force from all active branches because --force is deprecated at 2.0.0.

@srowen
Copy link
Member

srowen commented Jun 10, 2019

OK, I see it now, I removed this from both of the mvn master builds.

@zsxwing
Copy link
Member

zsxwing commented Jun 10, 2019

IIUC, this looks like just because of the known issue in mvn checkstyle:check command (called by lint-java script). It requires all dependencies of a module (even dependencies are just other sub-modules) can be found in the local repo (You can run mvn install to install sub-modules) or a remote repo.

It's green now because this Jenkins job ( https://amplab.cs.berkeley.edu/jenkins/job/spark-master-maven-snapshots/ ) pushed the artifacts to https://repository.apache.org/content/repositories/snapshots/. Hence maven can download all artifacts of sub-modules from this repo now.

In general, the java style check may fail if someone adds multiple maven projects, until the artifacts are pushed to the apache snapshot repo. But this should be fine since that's rare and we push snapshots every day.

By the way, I also noticed that java style check is enabled in the PR build now:

if not changed_files or any(f.endswith(".java")
This issue may happen in the PR build if a PR also adds Java files which will trigger java style check. Java style check needs to be disabled until the artifacts of this project are pushed to the apache snapshot repo. Not sure where to document this. ( I learnt this issue in one of my old PRs #10744 (comment) and disabled the java style check in 4bcea1b#diff-9657d18e98a7dc82ca214359cfd6bdc4 )

emanuelebardelli pushed a commit to emanuelebardelli/spark that referenced this pull request Jun 15, 2019
## What changes were proposed in this pull request?

This PR introduces the necessary Maven modules for the new [Spark Graph](https://issues.apache.org/jira/browse/SPARK-25994) feature for Spark 3.0.

* `spark-graph` is a parent module that users depend on to get all graph functionalities (Cypher and Graph Algorithms)
* `spark-graph-api` defines the [Property Graph API](https://docs.google.com/document/d/1Wxzghj0PvpOVu7XD1iA8uonRYhexwn18utdcTxtkxlI) that is being shared between Cypher and Algorithms
* `spark-cypher` contains a Cypher query engine implementation

Both, `spark-graph-api` and `spark-cypher` depend on Spark SQL.

Note, that the Maven module for Graph Algorithms is not part of this PR and will be introduced in https://issues.apache.org/jira/browse/SPARK-27302

A PoC for a running Cypher implementation can be found in this WIP PR apache#24297

## How was this patch tested?

Pass the Jenkins with all profiles and manually build and check the followings.
```
$ ls assembly/target/scala-2.12/jars/spark-cypher*
assembly/target/scala-2.12/jars/spark-cypher_2.12-3.0.0-SNAPSHOT.jar

$ ls assembly/target/scala-2.12/jars/spark-graph* | grep -v graphx
assembly/target/scala-2.12/jars/spark-graph-api_2.12-3.0.0-SNAPSHOT.jar
assembly/target/scala-2.12/jars/spark-graph_2.12-3.0.0-SNAPSHOT.jar
```

Closes apache#24490 from s1ck/SPARK-27300.

Lead-authored-by: Martin Junghanns <martin.junghanns@neotechnology.com>
Co-authored-by: Max Kießling <max@kopfueber.org>
Co-authored-by: Martin Junghanns <martin.junghanns@neo4j.com>
Co-authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.