Support Spark 3.4 #1720

allisonport-db · 2023-04-25T00:26:05Z

Description

Makes changes to support Spark 3.4. These include compile necessary changes, and test and code changes due to changes in Spark behavior.

Some of the bigger changes include

A lot of changes regarding error classes. These include...
- Spark 3.4 changed class ErrorInfo to private. This means the current approach in DeltaThrowableHelper can no longer work. We now use ErrorClassJsonReader (these are the changes to DeltaThrowableHelper and DeltaThrowableSuite
- Many error functions switched the first argument from message: String to errorClass: String which does not cause a compile error, but instead causes a "SparkException-error not found" when called. Some things affected include ParseException(...), a.failAnalysis(..).
- Supports error subclasses
Spark 3.4 supports insert-into-by-name and no longer reorders such queries to be insert-into-by-ordinal. See [SPARK-41806][SQL] Use AppendData.byName for SQL INSERT INTO by name for DSV2 apache/spark#39334. In DeltaAnalysis.scala we need to perform schema validation checks and schema evolution for such queries; right now we only match when !isByName
SPARK-27561 added support for lateral column alias. This broke our generation expression validation checks for generated columns. We now separately check for generated columns that reference other generated columns in GeneratedColumn.scala
DelegatingCatalogExtension deprecates createTable(..., schema: StructType, ...) in favor of createTable(..., columns: Array[Column], ...)
_metadata.file_path is not always encoded. We update DeleteWithDeletionVectorsHelper.scala to accomodate for this.
Support for SQL REPLACE WHERE. [TESTS IN FOLLOW-UP PR]
Misc test changes due to minor changes in Spark behavior or error messages

Resolves #1696

How was this patch tested?

Existing tests should suffice since there are no major Delta behavior changes besides support for REPLACE WHERE for which we have added tests.

Does this PR introduce any user-facing changes?

Yes. Spark 3.4 will be supported. REPLACE WHERE is supported in SQL.

allisonport-db · 2023-04-25T02:03:36Z

core/src/main/scala/org/apache/spark/sql/delta/DeltaAnalysis.scala

+   */
+  private def needsSchemaAdjustmentByName(query: LogicalPlan, targetAttrs: Seq[Attribute],
+      deltaTable: DeltaTableV2): Boolean = {
+    // TODO: update this to allow columns with default expressions to not be


This will come in a follow-up PR with tests

What does this mean? Are there any current tests covering this function... or is the entire thing TODO?

The last thing we want is some half-incorrectly-implemented functionality.

We add the code to support the new functionality (not specifying generated columns in insert into by name statements), but we don't allow it in this PR. Basically in Spark 3.3, insert into by name only worked when all the columns were specified and otherwise would throw an error. We maintain that behavior here (lines 809-812) so no change in behavior (so existing tests should suffice.)

Once we remove the check on 809-812 we will have the full support for insert into with generated columns. I'll remove it and add the tests in a follow-up PR.

Since we have to now do schema validation / evolution for by-name queries it's easier to add all this code now, but artificially block new functionality so the tests can come in a follow-up PR and lessen the amount of code in this PR.

allisonport-db · 2023-04-25T02:04:24Z

core/src/main/scala/org/apache/spark/sql/delta/DeltaSharedExceptions.scala

 import org.apache.spark.sql.AnalysisException
+import org.apache.spark.sql.catalyst.parser.{ParseException, ParserUtils}
+import org.apache.spark.sql.catalyst.trees.Origin

 class DeltaAnalysisException(


We add these parameters so we don't lose information in our exceptions thrown in deltaMerge.scala that used to use a.failAnalysis(...)

allisonport-db · 2023-04-25T02:07:17Z

core/src/main/scala/org/apache/spark/sql/delta/DeltaSharedExceptions.scala

+//   ParseException(errorClass, ...)
+//   Instead of passing just a message here, we could enforce creating an errorClass for each
+//   invocation and make this DeltaParseException(errorClass, ctx)
+class DeltaParseException(


In Spark 3.3 there is a constructor ParseException(message, ctx) but in 3.4 it is ParseException(errorClass, ctx) (and that error class would need to be a Spark error class).

Here we enable creating a ParseException from a message and context.

allisonport-db · 2023-04-25T02:09:00Z

core/src/main/scala/org/apache/spark/sql/delta/commands/WriteIntoDelta.scala

@@ -198,8 +198,7 @@ case class WriteIntoDelta(
      }
    }
    val rearrangeOnly = options.rearrangeOnly
-    // TODO: use `SQLConf.READ_SIDE_CHAR_PADDING` after Spark 3.4 is released.
-    val charPadding = sparkSession.conf.get("spark.sql.readSideCharPadding", "false") == "true"
+    val charPadding = sparkSession.conf.get(SQLConf.READ_SIDE_CHAR_PADDING.key, "false") == "true"


The default in Spark is "true" so we cannot use sparkSession.conf.get(SQLConf.READ_SIDE_CHAR_PADDING)

build.sbt

tdas · 2023-04-28T12:05:58Z

core/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/deltaMerge.scala

@@ -424,7 +424,12 @@ object DeltaMergeInto {
        // Note: This will throw error only on unresolved attribute issues,
        // not other resolution errors like mismatched data types.
        val cols = "columns " + plan.children.flatMap(_.output).map(_.sql).mkString(", ")
-        a.failAnalysis(msg = s"cannot resolve ${a.sql} in $mergeClauseType given $cols")
+        // todo: added a new Delta error for this to avoid rewriting tests, but existing


Will there be a follow up PR for this? If so, please clarify what need to be done. This is a little vague for a TODO

Suggested change

// todo: added a new Delta error for this to avoid rewriting tests, but existing

// TODO: added a new Delta error for this to avoid rewriting tests, but existing

Ah just a note to myself / reviewers on the other option. I think I prefer using the new Delta error I will remove this comment

It's like rewriting 50+ tests otherwise...

tdas · 2023-04-28T12:09:08Z

core/src/main/scala/org/apache/spark/sql/delta/DeltaAnalysis.scala

@@ -85,6 +86,18 @@ class DeltaAnalysis(session: SparkSession)
      }


+    // INSERT INTO by name


is this new functionality? if not, how did the the corresponding command work before with this case statement? What changed with Spark 3.4?

Spark 3.3:

insert into by name queries

if all columns specified --> rearrange order and convert to an insert by ordinal query

not all column specified --> exception (missing columns)

In Spark 3.4 insert into by name queries for DSV2 are not converted to by ordinal apache/spark#39334

So before insert by name queries would be converted to insert by ordinal and would match the above AppendDelta pattern.

This is mentioned in the PR description albeit a little less thoroughly :)

understood. thankyou for explaining in detail

core/src/main/scala/org/apache/spark/sql/delta/DeltaAnalysis.scala

tdas · 2023-04-28T14:52:11Z

core/src/main/scala/org/apache/spark/sql/delta/DeltaSharedExceptions.scala

@@ -48,3 +58,17 @@ class DeltaUnsupportedOperationException(
    override def getErrorClass: String = errorClass
    def getMessageParametersArray: Array[String] = messageParameters
 }
+
+// todo: we had to add this since in Spark 3.4 ParseException(message, ...) was replaced by


nit: todo's are generally capitalized. please fix them all over this PR

Sorry also didn't plan to merge this todo. Mostly a note to myself or reviewers. I think it's easiest not to add an error class for every invocation in DeltaSqlParser but I'm not super clear on how much we want to stick to enforcing using the Spark error framework. I'm happy to do it if we want to stick to the framework

tdas · 2023-04-28T14:54:55Z

core/src/main/scala/org/apache/spark/sql/delta/DeltaAnalysis.scala

@@ -759,15 +830,16 @@ class DeltaAnalysis(session: SparkSession)
          Cast(input, dt, Option(timeZone), ansiEnabled = false)
      case SQLConf.StoreAssignmentPolicy.ANSI =>
        (input: Expression, dt: DataType, name: String) => {
-          AnsiCast(input, dt, Option(timeZone))
+          val cast = Cast(input, dt, Option(timeZone), ansiEnabled = true)


What is this change for?

Compile error; AnsiCast removed in 3.4 and consolidated into one Cast class

tdas · 2023-04-28T14:57:33Z

core/src/main/scala/org/apache/spark/sql/delta/GeneratedColumn.scala

+   * - A generated column references itself
+   * - A generated column references another generated column
+   */
+  def validateColumnReferences(


is there test coverage for this?

Yes in GeneratedColumnSuite, tests fail without this change.

core/src/test/scala/org/apache/spark/sql/delta/HiveDeltaDDLSuite.scala

tdas · 2023-04-28T15:15:25Z

python/delta/tests/test_pip_utils.py

@@ -73,10 +73,12 @@ def tearDown(self) -> None:
        shutil.rmtree(self.tempPath)

    def test_maven_jar_loaded(self) -> None:
-        packages: List[str] = self.spark.conf.get("spark.jars.packages").split(",")
-
+        packagesConf: Optional[str] = self.spark.conf.get("spark.jars.packages")


what is this change?
this seems a weird piece of code, could use more inline docs.

3.4 changed the return type from conf.get from str --> Optional[str]
mypi checks fail without this specific assert statement

allisonport-db · 2023-04-28T19:18:18Z

python/delta/tests/test_pip_utils.py

-        packages: List[str] = self.spark.conf.get("spark.jars.packages").split(",")
-
+        packagesConf: Optional[str] = self.spark.conf.get("spark.jars.packages")
+        assert packagesConf is not None  # mypi needs this to assign type str


Suggested change

assert packagesConf is not None # mypi needs this to assign type str

assert packagesConf is not None # mypi needs this to assign type str from Optional[str]

@tdas does this help? since the types are with variable definition i don't know if docs mentioning the return type from conf.get will help

I found all this python typing a little confusing.... I.e. type Optiona[str] is either type str OR None

Maybe there is a better way to do this I had to google this

tdas · 2023-05-02T17:41:40Z

core/src/main/scala/org/apache/spark/sql/delta/commands/DeleteWithDeletionVectorsHelper.scala

    val matchedRowsDf = targetDf
-      .withColumn(FILE_NAME_COL, col(s"${METADATA_NAME}.${FILE_PATH}"))
+      .withColumn(FILE_NAME_COL, uriEncode(col(s"${METADATA_NAME}.${FILE_PATH}")))


what is this change for?

what does this fix? does it have test coverage?

#1725

Tests fail without the fix. Adding a TODO in the code here to provide a more robust solution for 3.4.0 and 3.4.1

allisonport-db · 2023-05-03T20:58:06Z

Closed by 5c3f4d3

## Description (Cherry-pick of d9a5f9f to branch-2.4) Reenable iceberg build that was previously disabled in #1720 ## How was this patch tested? N/A Signed-off-by: Felipe Fujiy Pessoto <fepessot@microsoft.com>

allisonport-db commented Apr 25, 2023

View reviewed changes

allisonport-db requested review from tdas and vkorukanti April 25, 2023 18:09

tdas reviewed Apr 28, 2023

View reviewed changes

build.sbt Show resolved Hide resolved

tdas reviewed Apr 28, 2023

View reviewed changes

core/src/main/scala/org/apache/spark/sql/delta/DeltaAnalysis.scala Show resolved Hide resolved

tdas reviewed Apr 28, 2023

View reviewed changes

core/src/test/scala/org/apache/spark/sql/delta/HiveDeltaDDLSuite.scala Show resolved Hide resolved

tdas reviewed Apr 28, 2023

View reviewed changes

allisonport-db commented Apr 28, 2023

View reviewed changes

tdas reviewed May 2, 2023

View reviewed changes

tdas approved these changes May 2, 2023

View reviewed changes

allisonport-db added 6 commits May 2, 2023 21:50

Fixes to compile and pass tests with 3.4

643b00f

remove obsolete comment

659d1ac

fix pyspark tests and some style stuff

f722376

fix delta-spark pyspark version and describe table test

33fe42b

respond to comments

d37a1b1

minor updates

ba265f6

allisonport-db force-pushed the upgrade-spark-3.4 branch from 3955b07 to ba265f6 Compare May 3, 2023 04:56

fix scalastyle

513117f

allisonport-db closed this May 3, 2023

felipepessoto mentioned this pull request Jul 18, 2023

[2.4] Re-enable iceberg module #1921

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Spark 3.4 #1720

Support Spark 3.4 #1720

allisonport-db commented Apr 25, 2023 •

edited

Loading

allisonport-db Apr 25, 2023

tdas Apr 28, 2023

tdas Apr 28, 2023

allisonport-db Apr 28, 2023

allisonport-db Apr 28, 2023

allisonport-db Apr 25, 2023

allisonport-db Apr 25, 2023

allisonport-db Apr 25, 2023

tdas Apr 28, 2023

allisonport-db Apr 28, 2023

allisonport-db Apr 28, 2023

tdas Apr 28, 2023

allisonport-db Apr 28, 2023

allisonport-db Apr 28, 2023

tdas May 2, 2023

tdas Apr 28, 2023 •

edited

Loading

allisonport-db Apr 28, 2023

tdas Apr 28, 2023

allisonport-db Apr 28, 2023

tdas Apr 28, 2023

allisonport-db Apr 28, 2023

tdas Apr 28, 2023

allisonport-db Apr 28, 2023

allisonport-db Apr 28, 2023 •

edited

Loading

allisonport-db Apr 28, 2023

tdas May 2, 2023

tdas May 2, 2023

allisonport-db May 3, 2023

allisonport-db commented May 3, 2023

	// todo: added a new Delta error for this to avoid rewriting tests, but existing
	// TODO: added a new Delta error for this to avoid rewriting tests, but existing

		@@ -85,6 +86,18 @@ class DeltaAnalysis(session: SparkSession)
		}


		// INSERT INTO by name

	assert packagesConf is not None # mypi needs this to assign type str
	assert packagesConf is not None # mypi needs this to assign type str from Optional[str]

Support Spark 3.4 #1720

Support Spark 3.4 #1720

Conversation

allisonport-db commented Apr 25, 2023 • edited Loading

Description

How was this patch tested?

Does this PR introduce any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tdas Apr 28, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

allisonport-db Apr 28, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

allisonport-db commented May 3, 2023

allisonport-db commented Apr 25, 2023 •

edited

Loading

tdas Apr 28, 2023 •

edited

Loading

allisonport-db Apr 28, 2023 •

edited

Loading