Skip to content

Conversation

@tdas
Copy link
Contributor

@tdas tdas commented Oct 30, 2025

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here) - Overall build

Description

This PR introduces cross-Spark version build support, enabling Delta Lake to publish artifacts for multiple Spark versions (Spark 3.5.7
and Spark 4.0.2-SNAPSHOT) with distinct module names.

Changes

1. New project/CrossSparkVersions.scala SBT plugin

  • SparkVersionSpec case class encapsulates all Spark version-specific configuration (JVM version, ANTLR version, source directories,
    Java options)
  • Companion object defines specs: spark35 (3.5.7), spark40 (4.0.2-SNAPSHOT), plus LATEST_RELEASED and MASTER references
  • Plugin provides: getSparkVersion(), isLatestReleasedVersion(), isMasterVersion(), moduleName(), crossSparkSettings(), and
    crossSparkRelease command

2. Refactored build.sbt (simplified by 121 lines)

  • Moved Spark version logic to new plugin
  • Added version-aware module naming to all Spark modules
  • Replaced local constants with CrossSparkVersions and SparkVersionSpec API calls

3. Updated python/delta/pip_utils.py

  • Auto-detects PySpark version and resolves correct Delta artifact
  • Spark 3.5.x → delta-spark_2.13, other versions → delta-spark_{version}_2.13

4. Added tests to verify locally published jar names

  • New Github action for testing the build
  • Test script: project/tests/test_cross_spark_publish.py

Why

This enables publishing multiple Spark version artifacts simultaneously. Previously, supporting both Spark 3.5.x stable and 4.0.x preview
required manual build config switching. Now distinct artifacts (e.g., delta-spark_4.0_2.13 vs delta-spark_2.13) allow parallel
support.

Artifact Naming

  • Latest Spark (3.5.7): delta-spark_2.13 (no Spark suffix)
  • Other Spark (4.0.x): delta-spark_4.0_2.13 (includes Spark binary version)

How was this patch tested?

Build verification:

build/sbt compile                          # Spark 3.5.7 ✓
build/sbt -DsparkVersion=master compile    # Spark 4.0.2-SNAPSHOT ✓
build/sbt -DsparkVersion=3.5 compile       # Alias support ✓

Module naming verification:

build/sbt "show spark/moduleName"                    # → delta-spark
build/sbt -DsparkVersion=master "show spark/moduleName"  # → delta-spark_4.0

Artifact packaging verification:

  • ✅ delta-spark_2.13-3.4.0-SNAPSHOT.jar (Spark 3.5.7)
  • ✅ delta-spark_4.0_2.13-3.4.0-SNAPSHOT.jar (Spark master)
  • ✅ delta-connect-client_4.0_2.13-3.4.0-SNAPSHOT.jar (Spark master)

Verified all 7 Spark-related modules produce correctly named artifacts for both versions.

Does this PR introduce any user-facing changes?

Yes - New artifact naming for non-latest Spark versions:

Python users: Automatic - configure_spark_with_delta_pip() detects Spark version and resolves correct artifact.

Maven/SBT users: Must use new artifact names for non-latest Spark:
// Spark 3.5.x (unchanged)
"io.delta" %% "delta-spark" % "3.4.0"

// Spark 4.0.x (new)
"io.delta" %% "delta-spark_4.0" % "3.4.0"

Developers: New build commands:

  • build/sbt -DsparkVersion=master <task> - Build for Spark master
  • build/sbt "crossSparkRelease publishM2" - Run task for all Spark versions

@tdas tdas changed the title enable spark cross publishing [Build] Add support for cross-Spark-version building and publishing Oct 30, 2025
build.sbt Outdated
val sparkVersion = settingKey[String]("Spark version")

// Flag to control cross-Spark publishing: set -DcrossSparkPublish=true to enable
val crossSparkPublishEnabled = sys.props.getOrElse("crossSparkPublish", "false").toBoolean
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome, nice work @tdas !

@tdas
Copy link
Contributor Author

tdas commented Oct 30, 2025

I am actually thinking of changing this slightly. if we release this, python support of delta will break as pip-installed delta will dynamically injects spark configuration to force spark to download the maven jar. since the major jar name is changing, this will break.

so my approach to not changing the maven jar name by default avoids the existing tests from changing.. but if we dont change the tests, the tests dont ensure that post-release things will work. So i am going to revise this.

Copy link
Collaborator

@allisonport-db allisonport-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall makes sense just a few comments. I think it would be good if @scottsand-db takes a look when he gets back as well since he did all the original cross compilation work.

)

/** Latest released Spark version */
val LATEST_RELEASED = spark35
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's kind of weird that sometimes we determine if isLatest by comparing with this, sometimes via the spec arg isLatest seems like two SOT for this. We can probably remove one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. good catch. removed.

* Returns true if the current Spark version is the latest released version.
*/
def isLatestReleasedVersion(): Boolean = {
getSparkVersion() == SparkVersionSpec.LATEST_RELEASED.fullVersion
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like this could also be getSparkVersionSpec().isLatest

* Returns the Spark binary version (major.minor) for the given full Spark version.
* E.g., "3.5.7" -> "3.5", "4.0.2-SNAPSHOT" -> "4.0"
*/
def getSparkBinaryVersion(sparkVer: String): String = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same thing here, this is the same exact impl as SparkVersionSpec:: shortVersion why do we have both

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

Comment on lines +167 to +168
* Latest released Spark version: "module-name" (e.g., delta-spark)
* Other Spark versions: "module-name_X.Y" (e.g., delta-spark_4.0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why we want to release like this? I worry it's confusing to sometimes have the suffix and sometimes not. Is there a main reason why we don't want to always have the suffix? (probably my preference)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussed offline. we are keeping the model of Delta default artifact (without Spark version) always supporting latest Spark release. We can break that model in the next major Delta version release if we want to.

* @param allScalaVersions All Scala versions to cross-build
* @param scala213 Scala 2.13 version string
*/
def crossSparkSettings(defaultScalaVersion: SettingKey[String], allScalaVersions: Seq[String], scala213: String): Seq[Setting[_]] = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it kind of weird to pass these in here. Should we consider adding the scala version information to the SparkVersionSpec? This seems to make sense to me

Comment on lines 191 to 192
scalaVersion := (if (spec.isLatest) defaultScalaVersion.value else scala213),
crossScalaVersions := (if (spec.isLatest) allScalaVersions else Seq(scala213)),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is

  • if latest: use defaultScalaVersion.value
  • if latest: cross compile all scala versions
  • if master: only use scala213 for both

I'm wondering why we have this logic?

Comment on lines +83 to +88
# Determine the artifact name based on Spark version
# For Spark 3.5.x (latest released): delta-spark_2.13
# For other versions (e.g., 4.0.x): delta-spark_<spark_binary>_2.13
#
# NOTE: When updating LATEST_RELEASED_SPARK_VERSION in project/CrossSparkVersions.scala,
# also update the version check here to match the new latest version.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we always add the spark version prefix, this logic becomes easier right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants