Skip to content

Conversation

@cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Jan 4, 2024

What changes were proposed in this pull request?

This PR fixes a long-standing bug that OrcColumnarBatchReader does not respect the memory mode when creating column vectors for missing columbs. This PR fixes it.

Why are the changes needed?

To not violate the memory mode requirement

Does this PR introduce any user-facing change?

No

How was this patch tested?

new test

Was this patch authored or co-authored using generative AI tooling?

no

@github-actions github-actions bot added the SQL label Jan 4, 2024
@cloud-fan cloud-fan changed the title [SPARK-46598][SQL] OrcColumnarBatchReader should respect the memory mode when creating c… [SPARK-46598][SQL] OrcColumnarBatchReader should respect the memory mode when creating column vectors for the missing column Jan 4, 2024
@cloud-fan
Copy link
Contributor Author

cc @viirya @yaooqinn

Copy link
Contributor

@LuciferYang LuciferYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@LuciferYang LuciferYang Jan 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to use ConstantColumnVector for the missingCol? This maybe another story.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems simpler to use ConstantColumnVector here. I've updated the PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[info] - SPARK-28156: self-join should not miss cached view *** FAILED *** (216 milliseconds)
[info]   org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 1888.0 failed 1 times, most recent failure: Lost task 1.0 in stage 1888.0 (TID 2251) (localhost executor driver): java.lang.NullPointerException: Cannot invoke "org.apache.spark.internal.config.ConfigReader.get(String)" because "reader" is null
[info] 	at org.apache.spark.internal.config.ConfigEntry.readString(ConfigEntry.scala:94)
[info] 	at org.apache.spark.internal.config.FallbackConfigEntry.readFrom(ConfigEntry.scala:270)
[info] 	at org.apache.spark.sql.internal.SQLConf.getConf(SQLConf.scala:5573)
[info] 	at org.apache.spark.sql.internal.SQLConf.offHeapColumnVectorEnabled(SQLConf.scala:5103)
[info] 	at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$2(OrcFileFormat.scala:200)
[info] 	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:218)

some tests failed

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya, this part fails.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @cloud-fan . Could you fix the test failures?

@cloud-fan cloud-fan changed the title [SPARK-46598][SQL] OrcColumnarBatchReader should respect the memory mode when creating column vectors for the missing column [SPARK-46598][SQL] OrcColumnarBatchReader should should use ConstantColumnVector for missing columns Jan 5, 2024
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. This looks much nicer. Thanks!

Copy link
Contributor

@LuciferYang LuciferYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM

@LuciferYang
Copy link
Contributor

[info] - SPARK-39557 INSERT INTO statements with tables with array defaults *** FAILED *** (448 milliseconds)
[info]   org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 711.0 failed 1 times, most recent failure: Lost task 0.0 in stage 711.0 (TID 965) (localhost executor driver): java.lang.RuntimeException: DataType ARRAY<INT> is not supported in column vectorized reader.
[info] 	at org.apache.spark.sql.execution.vectorized.ColumnVectorUtils.populate(ColumnVectorUtils.java:96)
[info] 	at org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initBatch(OrcColumnarBatchReader.java:197)
[info] 	at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$2(OrcFileFormat.scala:214)
[info] 	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:218)

Seems if using ConstantColumnVector, some refactoring is needed for the ColumnVectorUtils.populate method.

@cloud-fan cloud-fan changed the title [SPARK-46598][SQL] OrcColumnarBatchReader should should use ConstantColumnVector for missing columns [SPARK-46598][SQL] OrcColumnarBatchReader should respect the memory mode when creating column vectors for the missing column Jan 5, 2024
@cloud-fan
Copy link
Contributor Author

I changed it back. It seems non-trivial to make ConstantColumnVector to support array/struct/map, and we need to create a real vector anyway to keep the data of an array, so we must pass the memory mode.

assert(supportBatch(sparkSession, resultSchema))
}

val memoryMode = if (sqlConf.offHeapColumnVectorEnabled) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved it outside of the lambda, so that we don't hit NPE by referencing sqlConf.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you re-trigger the failed pipelines?

dongjoon-hyun pushed a commit that referenced this pull request Jan 6, 2024
…ode when creating column vectors for the missing column

This PR fixes a long-standing bug that `OrcColumnarBatchReader` does not respect the memory mode when creating column vectors for missing columbs. This PR fixes it.

To not violate the memory mode requirement

No

new test

no

Closes #44598 from cloud-fan/orc.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit 0c1c5e9)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
dongjoon-hyun pushed a commit that referenced this pull request Jan 6, 2024
…ode when creating column vectors for the missing column

This PR fixes a long-standing bug that `OrcColumnarBatchReader` does not respect the memory mode when creating column vectors for missing columbs. This PR fixes it.

To not violate the memory mode requirement

No

new test

no

Closes #44598 from cloud-fan/orc.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit 0c1c5e9)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
@dongjoon-hyun
Copy link
Member

Thank you, @cloud-fan and all.
Merged to master/3.5/3.4.

szehon-ho pushed a commit to szehon-ho/spark that referenced this pull request Feb 7, 2024
…ode when creating column vectors for the missing column

This PR fixes a long-standing bug that `OrcColumnarBatchReader` does not respect the memory mode when creating column vectors for missing columbs. This PR fixes it.

To not violate the memory mode requirement

No

new test

no

Closes apache#44598 from cloud-fan/orc.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit 0c1c5e9)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit 53683a8)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants