[SPARK-46598][SQL] OrcColumnarBatchReader should respect the memory mode when creating column vectors for the missing column #44598

cloud-fan · 2024-01-04T14:19:31Z

What changes were proposed in this pull request?

This PR fixes a long-standing bug that OrcColumnarBatchReader does not respect the memory mode when creating column vectors for missing columbs. This PR fixes it.

Why are the changes needed?

To not violate the memory mode requirement

Does this PR introduce any user-facing change?

No

How was this patch tested?

new test

Was this patch authored or co-authored using generative AI tooling?

no

cloud-fan · 2024-01-04T14:20:57Z

cc @viirya @yaooqinn

LuciferYang

LGTM

LuciferYang · 2024-01-04T16:10:17Z

...ore/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.java

Is it possible to use ConstantColumnVector for the missingCol? This maybe another story.

It seems simpler to use ConstantColumnVector here. I've updated the PR

LuciferYang · 2024-01-04T16:32:34Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala

[info] - SPARK-28156: self-join should not miss cached view *** FAILED *** (216 milliseconds) [info] org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 1888.0 failed 1 times, most recent failure: Lost task 1.0 in stage 1888.0 (TID 2251) (localhost executor driver): java.lang.NullPointerException: Cannot invoke "org.apache.spark.internal.config.ConfigReader.get(String)" because "reader" is null [info] at org.apache.spark.internal.config.ConfigEntry.readString(ConfigEntry.scala:94) [info] at org.apache.spark.internal.config.FallbackConfigEntry.readFrom(ConfigEntry.scala:270) [info] at org.apache.spark.sql.internal.SQLConf.getConf(SQLConf.scala:5573) [info] at org.apache.spark.sql.internal.SQLConf.offHeapColumnVectorEnabled(SQLConf.scala:5103) [info] at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$2(OrcFileFormat.scala:200) [info] at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:218)

some tests failed

Ya, this part fails.

dongjoon-hyun

Thank you, @cloud-fan . Could you fix the test failures?

dongjoon-hyun

+1, LGTM. This looks much nicer. Thanks!

LuciferYang

+1, LGTM

LuciferYang · 2024-01-05T05:52:40Z

[info] - SPARK-39557 INSERT INTO statements with tables with array defaults *** FAILED *** (448 milliseconds)
[info]   org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 711.0 failed 1 times, most recent failure: Lost task 0.0 in stage 711.0 (TID 965) (localhost executor driver): java.lang.RuntimeException: DataType ARRAY<INT> is not supported in column vectorized reader.
[info] 	at org.apache.spark.sql.execution.vectorized.ColumnVectorUtils.populate(ColumnVectorUtils.java:96)
[info] 	at org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initBatch(OrcColumnarBatchReader.java:197)
[info] 	at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$2(OrcFileFormat.scala:214)
[info] 	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:218)

Seems if using ConstantColumnVector, some refactoring is needed for the ColumnVectorUtils.populate method.

…olumn vectors for the missing column

cloud-fan · 2024-01-05T07:02:25Z

I changed it back. It seems non-trivial to make ConstantColumnVector to support array/struct/map, and we need to create a real vector anyway to keep the data of an array, so we must pass the memory mode.

cloud-fan · 2024-01-05T07:03:32Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala

      assert(supportBatch(sparkSession, resultSchema))
    }

+    val memoryMode = if (sqlConf.offHeapColumnVectorEnabled) {


I moved it outside of the lambda, so that we don't hit NPE by referencing sqlConf.

dongjoon-hyun

Could you re-trigger the failed pipelines?

…ode when creating column vectors for the missing column This PR fixes a long-standing bug that `OrcColumnarBatchReader` does not respect the memory mode when creating column vectors for missing columbs. This PR fixes it. To not violate the memory mode requirement No new test no Closes #44598 from cloud-fan/orc. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 0c1c5e9) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

dongjoon-hyun · 2024-01-06T20:48:35Z

Thank you, @cloud-fan and all.
Merged to master/3.5/3.4.

…ode when creating column vectors for the missing column This PR fixes a long-standing bug that `OrcColumnarBatchReader` does not respect the memory mode when creating column vectors for missing columbs. This PR fixes it. To not violate the memory mode requirement No new test no Closes apache#44598 from cloud-fan/orc. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 0c1c5e9) Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit 53683a8) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

github-actions bot added the SQL label Jan 4, 2024

cloud-fan changed the title ~~[SPARK-46598][SQL] OrcColumnarBatchReader should respect the memory mode when creating c…~~ [SPARK-46598][SQL] OrcColumnarBatchReader should respect the memory mode when creating column vectors for the missing column Jan 4, 2024

LuciferYang approved these changes Jan 4, 2024

View reviewed changes

LuciferYang reviewed Jan 4, 2024

View reviewed changes

viirya approved these changes Jan 4, 2024

View reviewed changes

dongjoon-hyun reviewed Jan 4, 2024

View reviewed changes

cloud-fan changed the title ~~[SPARK-46598][SQL] OrcColumnarBatchReader should respect the memory mode when creating column vectors for the missing column~~ [SPARK-46598][SQL] OrcColumnarBatchReader should should use ConstantColumnVector for missing columns Jan 5, 2024

yaooqinn approved these changes Jan 5, 2024

View reviewed changes

dongjoon-hyun approved these changes Jan 5, 2024

View reviewed changes

LuciferYang approved these changes Jan 5, 2024

View reviewed changes

OrcColumnarBatchReader should respect the memory mode when creating c…

02c7a44

…olumn vectors for the missing column

cloud-fan force-pushed the orc branch from 84664c5 to 02c7a44 Compare January 5, 2024 07:00

cloud-fan changed the title ~~[SPARK-46598][SQL] OrcColumnarBatchReader should should use ConstantColumnVector for missing columns~~ [SPARK-46598][SQL] OrcColumnarBatchReader should respect the memory mode when creating column vectors for the missing column Jan 5, 2024

cloud-fan commented Jan 5, 2024

View reviewed changes

dongjoon-hyun approved these changes Jan 5, 2024

View reviewed changes

dongjoon-hyun closed this in 0c1c5e9 Jan 6, 2024

[SPARK-46598][SQL] OrcColumnarBatchReader should respect the memory mode when creating column vectors for the missing column #44598

[SPARK-46598][SQL] OrcColumnarBatchReader should respect the memory mode when creating column vectors for the missing column #44598

Uh oh!

Conversation

cloud-fan commented Jan 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

cloud-fan commented Jan 4, 2024

Uh oh!

LuciferYang left a comment

Choose a reason for hiding this comment

Uh oh!

LuciferYang Jan 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Jan 5, 2024

Choose a reason for hiding this comment

Uh oh!

LuciferYang Jan 4, 2024

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Jan 4, 2024

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

LuciferYang left a comment

Choose a reason for hiding this comment

Uh oh!

LuciferYang commented Jan 5, 2024

Uh oh!

cloud-fan commented Jan 5, 2024

Uh oh!

cloud-fan Jan 5, 2024

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Jan 6, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

cloud-fan commented Jan 4, 2024 •

edited

Loading

LuciferYang Jan 4, 2024 •

edited

Loading