-
Notifications
You must be signed in to change notification settings - Fork 1.9k
[Build] Add support for cross-Spark-version building and publishing #5418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
build.sbt
Outdated
| val sparkVersion = settingKey[String]("Spark version") | ||
|
|
||
| // Flag to control cross-Spark publishing: set -DcrossSparkPublish=true to enable | ||
| val crossSparkPublishEnabled = sys.props.getOrElse("crossSparkPublish", "false").toBoolean |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome, nice work @tdas !
|
I am actually thinking of changing this slightly. if we release this, python support of delta will break as pip-installed delta will dynamically injects spark configuration to force spark to download the maven jar. since the major jar name is changing, this will break. so my approach to not changing the maven jar name by default avoids the existing tests from changing.. but if we dont change the tests, the tests dont ensure that post-release things will work. So i am going to revise this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall makes sense just a few comments. I think it would be good if @scottsand-db takes a look when he gets back as well since he did all the original cross compilation work.
| ) | ||
|
|
||
| /** Latest released Spark version */ | ||
| val LATEST_RELEASED = spark35 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's kind of weird that sometimes we determine if isLatest by comparing with this, sometimes via the spec arg isLatest seems like two SOT for this. We can probably remove one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. good catch. removed.
project/CrossSparkVersions.scala
Outdated
| * Returns true if the current Spark version is the latest released version. | ||
| */ | ||
| def isLatestReleasedVersion(): Boolean = { | ||
| getSparkVersion() == SparkVersionSpec.LATEST_RELEASED.fullVersion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
like this could also be getSparkVersionSpec().isLatest
project/CrossSparkVersions.scala
Outdated
| * Returns the Spark binary version (major.minor) for the given full Spark version. | ||
| * E.g., "3.5.7" -> "3.5", "4.0.2-SNAPSHOT" -> "4.0" | ||
| */ | ||
| def getSparkBinaryVersion(sparkVer: String): String = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same thing here, this is the same exact impl as SparkVersionSpec:: shortVersion why do we have both
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
| * Latest released Spark version: "module-name" (e.g., delta-spark) | ||
| * Other Spark versions: "module-name_X.Y" (e.g., delta-spark_4.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why we want to release like this? I worry it's confusing to sometimes have the suffix and sometimes not. Is there a main reason why we don't want to always have the suffix? (probably my preference)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
discussed offline. we are keeping the model of Delta default artifact (without Spark version) always supporting latest Spark release. We can break that model in the next major Delta version release if we want to.
project/CrossSparkVersions.scala
Outdated
| * @param allScalaVersions All Scala versions to cross-build | ||
| * @param scala213 Scala 2.13 version string | ||
| */ | ||
| def crossSparkSettings(defaultScalaVersion: SettingKey[String], allScalaVersions: Seq[String], scala213: String): Seq[Setting[_]] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find it kind of weird to pass these in here. Should we consider adding the scala version information to the SparkVersionSpec? This seems to make sense to me
project/CrossSparkVersions.scala
Outdated
| scalaVersion := (if (spec.isLatest) defaultScalaVersion.value else scala213), | ||
| crossScalaVersions := (if (spec.isLatest) allScalaVersions else Seq(scala213)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic is
- if latest: use defaultScalaVersion.value
- if latest: cross compile all scala versions
- if master: only use scala213 for both
I'm wondering why we have this logic?
| # Determine the artifact name based on Spark version | ||
| # For Spark 3.5.x (latest released): delta-spark_2.13 | ||
| # For other versions (e.g., 4.0.x): delta-spark_<spark_binary>_2.13 | ||
| # | ||
| # NOTE: When updating LATEST_RELEASED_SPARK_VERSION in project/CrossSparkVersions.scala, | ||
| # also update the version check here to match the new latest version. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we always add the spark version prefix, this logic becomes easier right?
Which Delta project/connector is this regarding?
Description
This PR introduces cross-Spark version build support, enabling Delta Lake to publish artifacts for multiple Spark versions (Spark 3.5.7
and Spark 4.0.2-SNAPSHOT) with distinct module names.
Changes
1. New
project/CrossSparkVersions.scalaSBT pluginSparkVersionSpeccase class encapsulates all Spark version-specific configuration (JVM version, ANTLR version, source directories,Java options)
spark35(3.5.7),spark40(4.0.2-SNAPSHOT), plusLATEST_RELEASEDandMASTERreferencesgetSparkVersion(),isLatestReleasedVersion(),isMasterVersion(),moduleName(),crossSparkSettings(), andcrossSparkReleasecommand2. Refactored
build.sbt(simplified by 121 lines)CrossSparkVersionsandSparkVersionSpecAPI calls3. Updated
python/delta/pip_utils.pydelta-spark_2.13, other versions →delta-spark_{version}_2.134. Added tests to verify locally published jar names
Why
This enables publishing multiple Spark version artifacts simultaneously. Previously, supporting both Spark 3.5.x stable and 4.0.x preview
required manual build config switching. Now distinct artifacts (e.g.,
delta-spark_4.0_2.13vsdelta-spark_2.13) allow parallelsupport.
Artifact Naming
delta-spark_2.13(no Spark suffix)delta-spark_4.0_2.13(includes Spark binary version)How was this patch tested?
Build verification:
Module naming verification:
Artifact packaging verification:
Verified all 7 Spark-related modules produce correctly named artifacts for both versions.
Does this PR introduce any user-facing changes?
Yes - New artifact naming for non-latest Spark versions:
Python users: Automatic - configure_spark_with_delta_pip() detects Spark version and resolves correct artifact.
Maven/SBT users: Must use new artifact names for non-latest Spark:
// Spark 3.5.x (unchanged)
"io.delta" %% "delta-spark" % "3.4.0"
// Spark 4.0.x (new)
"io.delta" %% "delta-spark_4.0" % "3.4.0"
Developers: New build commands:
build/sbt -DsparkVersion=master <task>- Build for Spark masterbuild/sbt "crossSparkRelease publishM2"- Run task for all Spark versions