[SPARK-40815][SQL] Add `DelegateSymlinkTextInputFormat` to workaround `SymlinkTextInputSplit` bug #38277

sadikovi · 2022-10-17T03:44:33Z

What changes were proposed in this pull request?

This PR is a follow-up for #31909. In the original PR, spark.hadoopRDD.ignoreEmptySplits was enabled due to seemingly no side-effects, however, this change breaks SymlinkTextInputFormat so any table that uses the input format would return empty results.

This is due to a combination of problems:

Incorrect implementation of SymlinkTextInputSplit. The input format does not set start and length fields from the target split. SymlinkTextInputSplit is an abstraction over FileSplit and all downstream systems treat it as such - those fields should be extracted and passed from the target split.
spark.hadoopRDD.ignoreEmptySplits being enabled causes HadoopRDD to filter out all of the empty splits which does not work in the case of SymlinkTextInputFormat. This is due to 1. Because we don't set any length (and start) those splits are considered to be empty and are removed from the final list of partitions even though the target splits themselves are non-empty.

Technically, this needs to be addressed in Hive but I figured it would be much faster to fix this in Spark.

The PR introduces DelegateSymlinkTextInputFormat that wraps SymlinkTextInputFormat and provides splits with the correct start and length attributes. This is controlled by spark.sql.hive.useDelegateForSymlinkTextInputFormat which is enabled by default. When disabled, the user-provided SymlinkTextInputFormat will be used.

Why are the changes needed?

Fixes a correctness issue when using SymlinkTextInputSplit in Spark.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

I added a unit test that reproduces the issue and verified that it passes with the fix.

sadikovi · 2022-10-17T03:50:04Z

@dongjoon-hyun Would you be able to review this PR?

I have read the comments on the original PR and was ambivalent about disabling the flag again vs fixing it in the code. I suppose SymlinkTextInputFormat is one of the cases where this change is not fully safe as it silently causes incorrect results instead of throwing an error.

I also considered an alternative fix of substituting SymlinkTextInputFormat with a shim input format in HiveTableScanExec that correctly sets those fields. This would be transparent to the users, they could still specify the original input format in a CREATE TABLE statement.

Or fixing it in Hive but I don't know how long it would take to do so.

Maybe it would be better to implement the shim input format in Spark instead of disabling the flag, so let me know. I am open to alternative approaches.

dongjoon-hyun · 2022-10-17T03:52:16Z

Thank you for pinging me, @sadikovi . Will take a look at SymlinkTextInputSplit and the related part in a few days.

dongjoon-hyun · 2022-10-17T03:54:42Z

cc @sunchao too since he is the Apache Hive PMC member.

sadikovi · 2022-10-17T03:55:21Z

Thank you. I also thought about fixing it in Hive but I don't know how long that would take to fix it there and make the fix available in Spark, therefore I am open to alternative approaches, let me know 👍 .

AmplabJenkins · 2022-10-17T08:25:57Z

Can one of the admins verify this patch?

dongjoon-hyun

+1 for the change from my side. Please add a migration guide like we did before, @sadikovi .

cc @sunchao once more.

sadikovi · 2022-10-19T01:43:55Z

Thank you for the review. Sure, I can add the migration guide notes.

sadikovi · 2022-10-19T01:44:51Z

@sunchao Could you review and comment on the alternative solutions?

Patching Spark code to handle SymlinkTextInputFormat separately.
Fixing SymlinkTextInputFormat in Hive and updating Hive version in Spark.

mridulm · 2022-10-19T05:09:54Z

Given that this went into 3.2 already, and was a change in behavior - do we want to revert this back ?
Wouldn't it not be better to document this and have the issue fixed upstream in hive instead ?

sunchao · 2022-10-19T06:11:52Z

Is it possible to treat SymlinkTextInputFormat specially in NewHadoopRDD and HadoopRDD, where the spark.hadoopRDD.ignoreEmptySplits is used? e.g. something like:

      val allRowSplits = inputFormat.getSplits(new JobContextImpl(_conf, jobId)).asScala
      val rawSplits = if (ignoreEmptySplits &&
          inputFormat.getClass.getName != "org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat") {
        allRowSplits.filter(_.getLength > 0)
      } else {
        allRowSplits
      }

I think a fix in Spark itself will be a good short term solution. A fix in Hive appears to be more involved. For one, it's hard to give a reasonable start and length for SymlinkTextInputSplit since it's just a link to a list of paths, and I'm not sure if changing the class would affect other places within Hive (as this class has been there for a very long time).

mridulm · 2022-10-19T06:19:55Z

If it is very specific to this case, the approach you detailed sounds fine as a short term measure @sunchao.
But we should really get rid of this asap.

Thoughts @dongjoon-hyun ?

dongjoon-hyun · 2022-10-19T17:37:36Z

Ya, right. +1 for @sunchao 's suggestion, @mridulm .

sadikovi · 2022-10-20T06:13:03Z

@sunchao I think it is possible to give start and length to SymlinkTextInputSplit - it is a one-to-one mapping to the parent file split.

Based on this, I decided to introduce DelegateSymlinkTextInputFormat - I thought it would be a tidier fix compared to hardcoding the class name in HadoopRDD and it only exists in hive module. Since we apply this during table scan, it should not be persisted in the metastore - IMHO, we should be all good here.

@sunchao @mridulm @dongjoon-hyun Could you review the PR again?

dongjoon-hyun · 2022-10-20T06:15:34Z

core/src/main/scala/org/apache/spark/internal/config/package.scala

sql/hive/src/main/java/org/apache/hadoop/hive/ql/io/DelegateSymlinkTextInputFormat.java

dongjoon-hyun · 2022-10-20T09:43:05Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+      .internal()
+      .doc("When true, SymlinkTextInputFormat is replaced with a similar delegate class during " +
+        "table scan in order to fix the issue of empty splits")
+      .version("3.4.0")


Oh, got it. Initially, I thought this PR aims to be backported in order to fix the correctness issues of SymlinkTextInputFormat.

sql/hive/src/main/java/org/apache/hadoop/hive/ql/io/DelegateSymlinkTextInputFormat.java

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveSerDeReadWriteSuite.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

sql/hive/src/main/java/org/apache/hadoop/hive/ql/io/DelegateSymlinkTextInputFormat.java

sunchao · 2022-10-21T16:11:18Z

sql/hive/src/main/java/org/apache/hadoop/hive/ql/io/DelegateSymlinkTextInputFormat.java

+      split = new SymlinkTextInputSplit();
+    }
+
+    public DelegateSymlinkTextInputSplit(Path symlinkPath, SymlinkTextInputSplit split) throws IOException {


nit: do we need symlinkPath? it can be replaced by split.getPath().

No, it cannot. These paths are different https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/SymlinkTextInputFormat.java#L162.

sadikovi · 2022-10-27T21:35:52Z

@sunchao @dongjoon-hyun Could you take another look? Thanks. I have addressed your comments.

sunchao · 2022-10-27T21:47:20Z

@sadikovi sorry for the delay, will take a look.

sunchao

LGTM

sunchao · 2022-10-31T15:51:39Z

sql/hive/src/main/java/org/apache/hadoop/hive/ql/io/DelegateSymlinkTextInputFormat.java

+ * Delegate for SymlinkTextInputFormat, created to address SPARK-40815.
+ * Fixes an issue where SymlinkTextInputFormat returns empty splits which could result in
+ * the correctness issue when "spark.hadoopRDD.ignoreEmptySplits" is enabled.
+ *


nit: add <p> for better formatting

sunchao · 2022-10-31T15:51:48Z

sql/hive/src/main/java/org/apache/hadoop/hive/ql/io/DelegateSymlinkTextInputFormat.java

+ * In this class, we update the split start and length to match the target file input thus fixing
+ * the issue.
+ */
+@SuppressWarnings("deprecation")


nit: unnecessary

dongjoon-hyun

+1, LGTM.

dongjoon-hyun · 2022-10-31T22:13:08Z

I revise the PR title to simplify, @sadikovi . You can change it back if you want.

sadikovi · 2022-10-31T23:20:15Z

Thanks for updating the PR title 👍.

dongjoon-hyun · 2022-10-31T23:35:05Z

Thank you so much, @sadikovi , @sunchao , @mridulm .
Merged to master for Apache Spark 3.4.0.

sadikovi · 2022-10-31T23:52:24Z

Thank you @dongjoon-hyun for merging 👍.

HyukjinKwon · 2022-11-03T08:06:59Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveSerDeReadWriteSuite.scala

+    )
+  }
+
+  test("SPARK-40815: Read SymlinkTextInputFormat") {


This test fails in JDK 11 and 17 😢
https://github.com/apache/spark/actions/runs/3379157338/jobs/5610899432
https://github.com/apache/spark/actions/runs/3381461270/jobs/5615405153

[info] - SPARK-40815: Read SymlinkTextInputFormat *** FAILED *** (587 milliseconds) [info] Results do not match for query: [info] Timezone: sun.util.calendar.ZoneInfo[id="America/Los_Angeles",offset=-28800000,dstSavings=3600000,useDaylight=true,transitions=185,lastRule=java.util.SimpleTimeZone[id=America/Los_Angeles,offset=-28800000,dstSavings=3600000,useDaylight=true,startYear=0,startMode=3,startMonth=2,startDay=8,startDayOfWeek=1,startTime=7200000,startTimeMode=0,endMode=3,endMonth=10,endDay=1,endDayOfWeek=1,endTime=7200000,endTimeMode=0]] [info] Timezone Env: [info] [info] == Parsed Logical Plan == [info] 'Sort ['id ASC NULLS FIRST], true [info] +- 'Project ['id] [info] +- 'UnresolvedRelation [t], [], false [info] [info] == Analyzed Logical Plan == [info] id: bigint [info] Sort [id#175602L ASC NULLS FIRST], true [info] +- Project [id#175602L] [info] +- SubqueryAlias spark_catalog.default.t [info] +- HiveTableRelation [`spark_catalog`.`default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [id#175602L], Partition Cols: []] [info] [info] == Optimized Logical Plan == [info] Sort [id#175602L ASC NULLS FIRST], true [info] +- HiveTableRelation [`spark_catalog`.`default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [id#175602L], Partition Cols: []] [info] [info] == Physical Plan == [info] AdaptiveSparkPlan isFinalPlan=true [info] +- == Final Plan == [info] LocalTableScan <empty>, [id#175602L] [info] +- == Initial Plan == [info] Sort [id#175602L ASC NULLS FIRST], true, 0 [info] +- Exchange rangepartitioning(id#175602L ASC NULLS FIRST, 5), ENSURE_REQUIREMENTS, [plan_id=176717] [info] +- Scan hive spark_catalog.default.t [id#175602L], HiveTableRelation [`spark_catalog`.`default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [id#175602L], Partition Cols: []] [info] [info] == Results == [info] [info] == Results == [info] !== Correct Answer - 10 == == Spark Answer - 0 == [info] struct<> struct<> [info] ![0] [info] ![1] [info] ![2] [info] ![3] [info] ![4] [info] ![5] [info] ![6] [info] ![7] [info] ![8] [info] ![9] (QueryTest.scala:243) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) [info] at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) [info] at org.apache.spark.sql.QueryTest$.newAssertionFailedException(QueryTest.scala:233) [info] at org.scalatest.Assertions.fail(Assertions.scala:933) [info] at org.scalatest.Assertions.fail$(Assertions.scala:929) [info] at org.apache.spark.sql.QueryTest$.fail(QueryTest.scala:233) [info] at org.apache.spark.sql.QueryTest$.checkAnswer(QueryTest.scala:243) [info] at org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:150) [info] at org.apache.spark.sql.hive.execution.HiveSerDeReadWriteSuite.$anonfun$new$9(HiveSerDeReadWriteSuite.scala:293) [info] at org.apache.spark.sql.hive.execution.HiveSerDeReadWriteSuite.$anonfun$new$9$adapted(HiveSerDeReadWriteSuite.scala:266) [info] at org.apache.spark.sql.test.SQLTestUtils.$anonfun$withTempDir$1(SQLTestUtils.scala:79) [info] at org.apache.spark.sql.test.SQLTestUtils.$anonfun$withTempDir$1$adapted(SQLTestUtils.scala:78) [info] at org.apache.spark.SparkFunSuite.withTempDir(SparkFunSuite.scala:225) [info] at org.apache.spark.sql.hive.execution.HiveSerDeReadWriteSuite.org$apache$spark$sql$test$SQLTestUtils$$super$withTempDir(HiveSerDeReadWriteSuite.scala:37) [info] at org.apache.spark.sql.test.SQLTestUtils.withTempDir(SQLTestUtils.scala:78) [info] at org.apache.spark.sql.test.SQLTestUtils.withTempDir$(SQLTestUtils.scala:77) [info] at org.apache.spark.sql.hive.execution.HiveSerDeReadWriteSuite.withTempDir(HiveSerDeReadWriteSuite.scala:37) [info] at org.apache.spark.sql.hive.execution.HiveSerDeReadWriteSuite.$anonfun$new$8(HiveSerDeReadWriteSuite.scala:266) [info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) [info] at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1491) [info] at org.apache.spark.sql.test.SQLTestUtilsBase.withTable(SQLTestUtils.scala:306) [info] at org.apache.spark.sql.test.SQLTestUtilsBase.withTable$(SQLTestUtils.scala:304) [info] at org.apache.spark.sql.hive.execution.HiveSerDeReadWriteSuite.withTable(HiveSerDeReadWriteSuite.scala:37) [info] at org.apache.spark.sql.hive.execution.HiveSerDeReadWriteSuite.$anonfun$new$7(HiveSerDeReadWriteSuite.scala:266) [info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at org.scalatest.Transformer.apply(Transformer.scala:20) [info] at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226) [info] at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:207) [info] at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224) [info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236) [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218) [info] at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:66)

sadikovi · 2022-11-03T09:28:35Z

@HyukjinKwon Strange, there is nothing JDK specific in the test. Can you share how I can reproduce the issue locally with JDK 11 or JDK 17? I can open a follow-up PR to fix the test tomorrow.

HyukjinKwon · 2022-11-03T09:54:10Z

I haven't tested it locally actually. It's GitHub Actions build .. just installing JDK 11 and setting JAVA_HOME to it should work. I will monitor the scheduled jobs for a couple of days (if this consistently fails or is flaky).

My gut feeling is about Hive behaviour based on JDK version ..

HyukjinKwon · 2022-11-03T09:56:43Z

Actually, it does look like consistently failing (I retriggered: https://github.com/apache/spark/actions/runs/3379157338/jobs/5620261594). Let me see if it fails in my local.

HyukjinKwon · 2022-11-03T10:16:44Z

Yes, it's locally reproduced with JDK 11 (just set JAVA_HOME):

./build/sbt clean "testOnly org.apache.spark.sql.hive.execution.HiveSerDeReadWriteSuite"

dongjoon-hyun · 2022-11-03T17:48:39Z

Thank you for spotting that, @HyukjinKwon .

dongjoon-hyun · 2022-11-03T18:01:27Z

I also confirmed it fails on Java 11 although the symlink is read by Spark.

10:56:59.862 Executor task launch worker for task 0.0 in stage 575.0 (TID 805) INFO HadoopRDD: Input split: file:/Users/dongjoon/APACHE/spark-merge/sql/hive/target/tmp/hive_execution_test_group/spark-8291c0da-bb82-4337-b3e2-14ae18722e60/symlink/symlink.txt:0+6
10:56:59.865 Executor task launch worker for task 0.0 in stage 575.0 (TID 805) INFO Executor: Finished task 0.0 in stage 575.0 (TID 805). 1421 bytes result sent to driver
10:56:59.865 dispatcher-event-loop-1 INFO TaskSetManager: Starting task 1.0 in stage 575.0 (TID 806) (localhost, executor driver, partition 1, PROCESS_LOCAL, 7833 bytes)
10:56:59.865 Executor task launch worker for task 1.0 in stage 575.0 (TID 806) INFO Executor: Running task 1.0 in stage 575.0 (TID 806)
10:56:59.865 task-result-getter-1 INFO TaskSetManager: Finished task 0.0 in stage 575.0 (TID 805) in 8 ms on localhost (executor driver) (1/4)
10:56:59.866 Executor task launch worker for task 1.0 in stage 575.0 (TID 806) INFO HadoopRDD: Input split: file:/Users/dongjoon/APACHE/spark-merge/sql/hive/target/tmp/hive_execution_test_group/spark-8291c0da-bb82-4337-b3e2-14ae18722e60/symlink/symlink.txt:0+4
10:56:59.867 Executor task launch worker for task 1.0 in stage 575.0 (TID 806) INFO Executor: Finished task 1.0 in stage 575.0 (TID 806). 1378 bytes result sent to driver
10:56:59.868 dispatcher-event-loop-1 INFO TaskSetManager: Starting task 2.0 in stage 575.0 (TID 807) (localhost, executor driver, partition 2, PROCESS_LOCAL, 7833 bytes)
10:56:59.868 task-result-getter-2 INFO TaskSetManager: Finished task 1.0 in stage 575.0 (TID 806) in 3 ms on localhost (executor driver) (2/4)
10:56:59.868 Executor task launch worker for task 2.0 in stage 575.0 (TID 807) INFO Executor: Running task 2.0 in stage 575.0 (TID 807)
10:56:59.869 Executor task launch worker for task 2.0 in stage 575.0 (TID 807) INFO HadoopRDD: Input split: file:/Users/dongjoon/APACHE/spark-merge/sql/hive/target/tmp/hive_execution_test_group/spark-8291c0da-bb82-4337-b3e2-14ae18722e60/symlink/symlink.txt:0+4
10:56:59.870 Executor task launch worker for task 2.0 in stage 575.0 (TID 807) INFO Executor: Finished task 2.0 in stage 575.0 (TID 807). 1378 bytes result sent to driver
10:56:59.870 dispatcher-event-loop-1 INFO TaskSetManager: Starting task 3.0 in stage 575.0 (TID 808) (localhost, executor driver, partition 3, PROCESS_LOCAL, 7833 bytes)
10:56:59.870 task-result-getter-3 INFO TaskSetManager: Finished task 2.0 in stage 575.0 (TID 807) in 2 ms on localhost (executor driver) (3/4)
10:56:59.870 Executor task launch worker for task 3.0 in stage 575.0 (TID 808) INFO Executor: Running task 3.0 in stage 575.0 (TID 808)
10:56:59.871 Executor task launch worker for task 3.0 in stage 575.0 (TID 808) INFO HadoopRDD: Input split: file:/Users/dongjoon/APACHE/spark-merge/sql/hive/target/tmp/hive_execution_test_group/spark-8291c0da-bb82-4337-b3e2-14ae18722e60/symlink/symlink.txt:0+6
10:56:59.872 Executor task launch worker for task 3.0 in stage 575.0 (TID 808) INFO Executor: Finished task 3.0 in stage 575.0 (TID 808). 1378 bytes result sent to driver
10:56:59.873 task-result-getter-0 INFO TaskSetManager: Finished task 3.0 in stage 575.0 (TID 808) in 3 ms on localhost (executor driver) (4/4)

sadikovi · 2022-11-03T21:03:59Z

Let me open a PR to disable the test and I will open a fix as a follow-up.

…tests for JDK 9+ ### What changes were proposed in this pull request? This PR is a follow-up for #38277. This change is required due to test failures in JDK 11 and JDK 17. The patch disables the unit test for JDK 9+. This is a temporary measure while I am debugging and working on the fix for higher versions of JDK. ### Why are the changes needed? Fixes the test failure in JDK 11. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A. Closes #38504 from sadikovi/fix-symlink-test. Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

sadikovi · 2022-11-04T05:56:05Z

@dongjoon-hyun I am trying to repro with JDK 11 (11.0.16) and the test passes just fine. Did you have to do any special setup to trigger the problem?

sadikovi · 2022-11-04T06:04:21Z

Actually, I can repro by running the entire test suite, does not reproduce when running the test individually.

sadikovi · 2022-11-04T06:48:44Z

Issue is reproducible even with SymlinkTextInputFormat, not related to delegate class. Also, the test runs just fine when "SPARK-34512: Disable validate default values when parsing Avro schemas" is not run prior or swapped to run after my test. I will continue to debug but it does not appear to be related to the patch itself.

… `SymlinkTextInputSplit` bug ### What changes were proposed in this pull request? This PR is a follow-up for apache#31909. In the original PR, `spark.hadoopRDD.ignoreEmptySplits` was enabled due to seemingly no side-effects, however, this change breaks `SymlinkTextInputFormat` so any table that uses the input format would return empty results. This is due to a combination of problems: 1. Incorrect implementation of [SymlinkTextInputSplit](https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/SymlinkTextInputFormat.java#L73). The input format does not set `start` and `length` fields from the target split. `SymlinkTextInputSplit` is an abstraction over FileSplit and all downstream systems treat it as such - those fields should be extracted and passed from the target split. 2. `spark.hadoopRDD.ignoreEmptySplits` being enabled causes HadoopRDD to filter out all of the empty splits which does not work in the case of SymlinkTextInputFormat. This is due to 1. Because we don't set any length (and start) those splits are considered to be empty and are removed from the final list of partitions even though the target splits themselves are non-empty. Technically, this needs to be addressed in Hive but I figured it would be much faster to fix this in Spark. The PR introduces `DelegateSymlinkTextInputFormat` that wraps SymlinkTextInputFormat and provides splits with the correct start and length attributes. This is controlled by `spark.sql.hive.useDelegateForSymlinkTextInputFormat` which is enabled by default. When disabled, the user-provided SymlinkTextInputFormat will be used. ### Why are the changes needed? Fixes a correctness issue when using `SymlinkTextInputSplit` in Spark. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I added a unit test that reproduces the issue and verified that it passes with the fix. Closes apache#38277 from sadikovi/fix-symlink-input-format. Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…tests for JDK 9+ ### What changes were proposed in this pull request? This PR is a follow-up for apache#38277. This change is required due to test failures in JDK 11 and JDK 17. The patch disables the unit test for JDK 9+. This is a temporary measure while I am debugging and working on the fix for higher versions of JDK. ### Why are the changes needed? Fixes the test failure in JDK 11. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A. Closes apache#38504 from sadikovi/fix-symlink-test. Authored-by: Ivan Sadikov <ivan.sadikov@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

fix issue

a40e28d

github-actions bot added CORE SQL labels Oct 17, 2022

dongjoon-hyun reviewed Oct 19, 2022

View reviewed changes

dongjoon-hyun reviewed Oct 20, 2022

View reviewed changes

core/src/main/scala/org/apache/spark/internal/config/package.scala Outdated

Copy link

Member

dongjoon-hyun Oct 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay!

use DelegateSymlinkTextInputFormat

23c9f0f

dongjoon-hyun reviewed Oct 20, 2022

View reviewed changes

sql/hive/src/main/java/org/apache/hadoop/hive/ql/io/DelegateSymlinkTextInputFormat.java Outdated Show resolved Hide resolved

sadikovi force-pushed the fix-symlink-input-format branch from eeb641c to 23c9f0f Compare October 20, 2022 06:17

sadikovi requested a review from dongjoon-hyun October 20, 2022 06:18

dongjoon-hyun reviewed Oct 20, 2022

View reviewed changes

sql/hive/src/main/java/org/apache/hadoop/hive/ql/io/DelegateSymlinkTextInputFormat.java Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Oct 20, 2022

View reviewed changes

sql/hive/src/main/java/org/apache/hadoop/hive/ql/io/DelegateSymlinkTextInputFormat.java Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Oct 20, 2022

View reviewed changes

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveSerDeReadWriteSuite.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Oct 20, 2022

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

sunchao reviewed Oct 21, 2022

View reviewed changes

address comments

2ff2c5e

trigger ci

4c0d250

sadikovi requested review from dongjoon-hyun and sunchao and removed request for sunchao October 27, 2022 21:35

sunchao approved these changes Oct 31, 2022

View reviewed changes

update javadoc and address comments

b65a2c2

dongjoon-hyun approved these changes Oct 31, 2022

View reviewed changes

dongjoon-hyun closed this in 668ab27 Oct 31, 2022

HyukjinKwon reviewed Nov 3, 2022

View reviewed changes

sadikovi mentioned this pull request Nov 3, 2022

[SPARK-40815][SQL][FOLLOW-UP] Disable DelegateSymlinkTextInputFormat tests for JDK 9+ #38504

Closed

[SPARK-40815][SQL] Add DelegateSymlinkTextInputFormat to workaround SymlinkTextInputSplit bug #38277

[SPARK-40815][SQL] Add DelegateSymlinkTextInputFormat to workaround SymlinkTextInputSplit bug #38277

Uh oh!

Conversation

sadikovi commented Oct 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

sadikovi commented Oct 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Oct 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Oct 17, 2022

Uh oh!

sadikovi commented Oct 17, 2022

Uh oh!

AmplabJenkins commented Oct 17, 2022

Uh oh!

dongjoon-hyun left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sadikovi commented Oct 19, 2022

Uh oh!

sadikovi commented Oct 19, 2022

Uh oh!

mridulm commented Oct 19, 2022

Uh oh!

sunchao commented Oct 19, 2022

Uh oh!

mridulm commented Oct 19, 2022

Uh oh!

dongjoon-hyun commented Oct 19, 2022

Uh oh!

sadikovi commented Oct 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sadikovi commented Oct 27, 2022

Uh oh!

sunchao commented Oct 27, 2022

Uh oh!

sunchao left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Oct 31, 2022

Uh oh!

sadikovi commented Oct 31, 2022

Uh oh!

[SPARK-40815][SQL] Add `DelegateSymlinkTextInputFormat` to workaround `SymlinkTextInputSplit` bug #38277

[SPARK-40815][SQL] Add `DelegateSymlinkTextInputFormat` to workaround `SymlinkTextInputSplit` bug #38277

sadikovi commented Oct 17, 2022 •

edited

Loading

sadikovi commented Oct 17, 2022 •

edited

Loading

dongjoon-hyun commented Oct 17, 2022 •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading

sadikovi commented Oct 20, 2022 •

edited

Loading

HyukjinKwon commented Nov 3, 2022 •

edited

Loading

sadikovi commented Nov 4, 2022 •

edited

Loading