-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-27300][GRAPH] Add Spark Graph modules and dependencies #24490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@mengxr would be great if you could have a look |
|
ok to test |
|
Test build #105018 has finished for PR 24490 at commit
|
|
Test build #105019 has finished for PR 24490 at commit
|
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
For now, I added |
|
Retest this please. |
dev/deps/spark-deps-hadoop-2.7
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those dependencies are part of the OKAPI stack which is responsible for translating Cypher to relational operations. I understand that adding a lot of dependencies is bad practice for a single PR. By excluding them, we basically require the user to add explicit dependencies? If yes, it doesn't sound like an elegant solution to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@s1ck . The alternative in my mind is making these into external modules like external/avro and external/kafka-0-10.
Hi, @rxin , @mengxr , @gatorsmile . Can we make new Graph modules as external modules instead? For now, this PR looks too intrusive to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun This module adds a significant feature in a major Spark release. Is this not a reason to have a more relaxed qualification on dependent libraries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KAFKA and AVRO also does.
|
@dongjoon-hyun What are the necessary steps to update the SBT build file? |
|
For scala build, we need to update |
|
Test build #105043 has finished for PR 24490 at commit
|
|
Test build #105047 has finished for PR 24490 at commit
|
|
@dongjoon-hyun I'm working with @s1ck to create a shared jar of |
|
That's nice. Thank you for updating, @mengxr ! |
|
Here are a flattened list of runtime transitive dependencies of All are licensed under either Apache or MIT. Some test packages got pulled into runtime. I submitted a clean-up PR here: opencypher/morpheus#907 |
|
Hi, @s1ck . Could you update this PR if the dependency is reduced? |
|
@dongjoon-hyun We are currently looking into solving an issue around shading Scala libs. @mengxr described the issue here: https://contributors.scala-lang.org/t/tools-for-shading-a-scala-library/3317 Do you have experience in shading Scala libs without running into downstream issues? |
felixcheung
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at least, bring dependencies would need to be listed in license
|
@dongjoon-hyun We have successfully reduced the dependencies using a shaded jar including relocated dependencies. Please have a second look. |
I updated |
|
Test build #105860 has finished for PR 24490 at commit
|
|
Test build #105862 has finished for PR 24490 at commit
|
|
Looks like the unit tests are failing in the |
|
@dongjoon-hyun Could you please have another look at the PR? Thanks! |
|
Retest this please. |
|
Retest this please. |
|
Test build #106237 has finished for PR 24490 at commit
|
|
@zsxwing Thanks! That fixed the build. @srowen @dongjoon-hyun Could you please have another look and finish your reviews? Thanks. |
|
Retest this please. |
|
Retest this please. |
|
Retest this please. |
|
Test build #106315 has finished for PR 24490 at commit
|
|
Retest this please. |
|
Test build #106321 has finished for PR 24490 at commit
|
|
Test build #106316 has finished for PR 24490 at commit
|
|
Test build #106317 has finished for PR 24490 at commit
|
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you so much for preparing this for a long time, @s1ck .
Thank you for reviews, @srowen , @mengxr, @zsxwing , @felixcheung , @wangyum.
I also built and tested with the artifact generations, and Jenkins tested sbt/hadoop-2.7/maven/hadoop-2.7. Additionally, maven/hadoop-3.2 was also triggered because it's a running profile on Jenkins farm. Unfortunately, it stopped at R tests due to the midnight reset. However, since this doesn't deliver the real code, the test result looks enough for verification.
As the project structure for the new Graph component, this PR completes the scope. I'm going to merge this to the master branch for the next steps.
This also breaks the style check. Could any of you investigate it ASAP? Or we might need to revert this PR. |
|
If you can't figure it out in a few mins, the right way is to revert, fix, and then remerge. |
|
I can have a look, but I'm also surprised that CI didn't complain before the merge. Any ideas @gatorsmile ? |
|
@gatorsmile Where did you find that log output? I can only see one failing test: |
|
We don't have a style issue. I'll going to monitor the ongoing Jenkins.
|
|
@s1ck . You can check the master lint result here. It's now green. We are okay. |
|
The root cause of the problem is not inside Apache Spark repository. AmbLab Jenkins script is still using |
|
@dongjoon-hyun I have admin access. I think @shaneknapp can grant it and interested committers should probably have it. Are there any more jobs to change? that looks OK now? |
|
Oh, could you give me the access, too? Thanks, please. I sent an email Shane yesterday, but he is off until 13th unfortunately. We need to update old Jenkins script and bring back #24824 . |
|
I just mean I think I can modify Jenkins job configs if that's what's needed. Do you know which ones need modifying? or are they not in the Jenkins job configs in the UI? I don't think I have access to give admin access, or don't know how to do that. |
|
Thanks. Then, at least, for the three profiles for FYI, you can remove |
|
OK, I see it now, I removed this from both of the mvn master builds. |
|
IIUC, this looks like just because of the known issue in It's green now because this Jenkins job ( https://amplab.cs.berkeley.edu/jenkins/job/spark-master-maven-snapshots/ ) pushed the artifacts to https://repository.apache.org/content/repositories/snapshots/. Hence maven can download all artifacts of sub-modules from this repo now. In general, the java style check may fail if someone adds multiple maven projects, until the artifacts are pushed to the apache snapshot repo. But this should be fine since that's rare and we push snapshots every day. By the way, I also noticed that java style check is enabled in the PR build now: Line 593 in 742f805
|
## What changes were proposed in this pull request? This PR introduces the necessary Maven modules for the new [Spark Graph](https://issues.apache.org/jira/browse/SPARK-25994) feature for Spark 3.0. * `spark-graph` is a parent module that users depend on to get all graph functionalities (Cypher and Graph Algorithms) * `spark-graph-api` defines the [Property Graph API](https://docs.google.com/document/d/1Wxzghj0PvpOVu7XD1iA8uonRYhexwn18utdcTxtkxlI) that is being shared between Cypher and Algorithms * `spark-cypher` contains a Cypher query engine implementation Both, `spark-graph-api` and `spark-cypher` depend on Spark SQL. Note, that the Maven module for Graph Algorithms is not part of this PR and will be introduced in https://issues.apache.org/jira/browse/SPARK-27302 A PoC for a running Cypher implementation can be found in this WIP PR apache#24297 ## How was this patch tested? Pass the Jenkins with all profiles and manually build and check the followings. ``` $ ls assembly/target/scala-2.12/jars/spark-cypher* assembly/target/scala-2.12/jars/spark-cypher_2.12-3.0.0-SNAPSHOT.jar $ ls assembly/target/scala-2.12/jars/spark-graph* | grep -v graphx assembly/target/scala-2.12/jars/spark-graph-api_2.12-3.0.0-SNAPSHOT.jar assembly/target/scala-2.12/jars/spark-graph_2.12-3.0.0-SNAPSHOT.jar ``` Closes apache#24490 from s1ck/SPARK-27300. Lead-authored-by: Martin Junghanns <martin.junghanns@neotechnology.com> Co-authored-by: Max Kießling <max@kopfueber.org> Co-authored-by: Martin Junghanns <martin.junghanns@neo4j.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
What changes were proposed in this pull request?
This PR introduces the necessary Maven modules for the new Spark Graph feature for Spark 3.0.
spark-graphis a parent module that users depend on to get all graph functionalities (Cypher and Graph Algorithms)spark-graph-apidefines the Property Graph API that is being shared between Cypher and Algorithmsspark-cyphercontains a Cypher query engine implementationBoth,
spark-graph-apiandspark-cypherdepend on Spark SQL.Note, that the Maven module for Graph Algorithms is not part of this PR and will be introduced in https://issues.apache.org/jira/browse/SPARK-27302
A PoC for a running Cypher implementation can be found in this WIP PR #24297
How was this patch tested?
Pass the Jenkins with all profiles and manually build and check the followings.