Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark 3.0 support #93

Closed
anujambekar opened this issue Jun 8, 2020 · 9 comments
Closed

Spark 3.0 support #93

anujambekar opened this issue Jun 8, 2020 · 9 comments

Comments

@anujambekar
Copy link

Spark 3.0 preview was released in December. Have you tested Spline with Spark 3.0? Are there any plans / roadmap for Spark 3.0 support?

Btw, this tool is really awesome! Kudos to all the great work!

@wajda
Copy link
Contributor

wajda commented Jun 8, 2020

We didn't try Spark 3.0 yet, it's not a priority for us. But adding support for it shouldn't be difficult.

@wajda wajda transferred this issue from AbsaOSS/spline Jun 8, 2020
@wajda wajda added feature good first issue Good for newcomers labels Jun 8, 2020
@DaimonPl
Copy link

DaimonPl commented Jun 9, 2020

Yep, tool is great especially after latest fixes :)

Spark 3 already has 3rd release candidate so final version is pretty soon http://apache-spark-developers-list.1001551.n3.nabble.com/vote-Apache-Spark-3-0-RC3-td29499.html

@DaimonPl
Copy link

Actually "final" 3.0.0 is already available on maven :)

https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.12/3.0.0

@DaimonPl
Copy link

@wajda wajda changed the title Spark 3.0 (preview) support Spark 3.0 support Sep 21, 2020
@wajda
Copy link
Contributor

wajda commented Sep 30, 2020

I have tested the spark 3.0 compatibility after making changes to commons and spark agent. Few things we need to consider

  1. Spark 3.0 supports json4s 3.6.6, and created a pull request for the same to commons.

  2. Spark agent should refer appropriate commons version based on spark version, handled it using profiles. I will create a pull > request but versioning is not proper at the moment as commons needs to be published with Spark3.0 changes

  3. sql.kafka010 with version 3.0.0 is giving build failures, and hence kept to 2.4.4 only. We will have to check more why consumerstratergy,assignstrategy are not accessible if we use 3.0

Originally posted by @uday1409 in AbsaOSS/commons#27 (comment)

@wajda
Copy link
Contributor

wajda commented Sep 30, 2020

Regarding points No. 1 and 2, Commons 0.0.16 should support multiple json4s versions automatically, so no profiles are needed.

Re point No. 3 - Kafka is an optional dependency, only used by KafkaPlugin and tests, so can be disabled to proceed with the remaining testing.
If kafka-sql 2.4 and 3.0 APIs are incompatible we could try solve it by invoking that API reflectively, or by using separate plugins (one per Kafka version) and bundle a proper version into the final assembly that corresponds the target Spark version.

@shanebell
Copy link

shanebell commented Nov 19, 2020

Hi @wajda , I'd like to help progress this to get an official release of the Spline agent that supports Spark 3.

I’ve had a look through the "spark-3-support" branch. I made a few tweaks and managed to create a bundle jar that seems to work ok with a simple Spark 3 job (using PySpark).

But I need some guidance on where to go from here. Looking at the "bundle-2.x" directories there are a LOT of dependencies in the pom and I'm not sure where to start to create one for Spark 3. For now I just copied the "bundle-2.4" directory and made a new one called "bundle-3.0" and it seems to work ok but I'm sure there are library versions that will need to be updated.

Any guidance you can give me would be greatly appreciated.

@wajda
Copy link
Contributor

wajda commented Nov 19, 2020

Thank you Shane!
Those bundle poms are created semi-automatically by using the jar-pommefizer tool against respective Spark's /jars directory.
The main idea under those resultant fat jars' POMs is that we need to exclude all dependencies that are used by Spline core and are also available in the Spark runtime, so we don't create a classpath collision when a Spline bundle jar is deployed to Spark.
For that we look at the Spark's /jars directory trying to resolve Maven coordinates for those jars where possible, and generate a huge pom.xml with all the deps marked provided. Sometime it works 100%, the other times a few manual post-corrections are required.

As a suggestion, try to generate a POM for e.g. Spark 2.4.4 using that method and compare it with bundle-2.4/pom.xml to get an idea of what kind of manual changes might be required.

@wajda wajda added this to the 0.6.0 milestone Dec 8, 2020
@wajda wajda pinned this issue Dec 8, 2020
@wajda wajda unpinned this issue Dec 8, 2020
@wajda wajda modified the milestones: 0.6.0, 1.0.0 Dec 8, 2020
@lokm01
Copy link

lokm01 commented Feb 18, 2021

Hey all, has there been any progress from the community on testing with spark 3 & scala 2.12?

@wajda wajda modified the milestones: 1.0.0, 0.6.0 Feb 18, 2021
@wajda wajda added dependency: Spark 3.0+ and removed good first issue Good for newcomers labels Feb 18, 2021
@wajda wajda closed this as completed May 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

No branches or pull requests

6 participants