[SPARK-52651][SQL] Handle User Defined Type in Nested ColumnVector #51349

yaooqinn · 2025-07-02T11:14:04Z

What changes were proposed in this pull request?

When I read a map column with a UDT nested, I encountered:

Caused by: java.lang.IllegalArgumentException: Spark type: ... doesn't match the type: ... in column vector
	at org.apache.spark.sql.execution.datasources.parquet.ParquetColumnVector.<init>(ParquetColumnVector.java:80)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetColumnVector.<init>(ParquetColumnVector.java:139)

This PR adds a recursive loop to omit the UDT

Why are the changes needed?

Add UDT missing features

Does this PR introduce any user-facing change?

No

How was this patch tested?

New Tests

Was this patch authored or co-authored using generative AI tooling?

no

dongjoon-hyun · 2025-07-02T13:56:42Z

...c/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala

    val targetType = sparkReadType.map {
-      case udt: UserDefinedType[_] => udt.sqlType
-      case otherType => otherType
+      _.transformRecursively { case t: UserDefinedType[_] => t.sqlType }


Just a question. What about ORC file format?

Hi @dongjoon-hyun, good question. I have an umbrella ticket for udt improvements. Let me check other formats or readers with followups if necessary

dongjoon-hyun · 2025-07-02T14:04:18Z

cc @peter-toth

yaooqinn · 2025-07-03T01:49:20Z

Merged to master, thank you @dongjoon-hyun @peter-toth

…ith null DataType ### What changes were proposed in this pull request? Check whether the parameter DataType is null in ColumnVector constructor before transforming it ### Why are the changes needed? A subclass of ColumnVector, e.g. Iceberg's [ConstantColumnVector](https://github.com/apache/iceberg/blob/main/spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ConstantColumnVector.java#L41), could be created with null `DataType`. It throws NPE after #51349, which can be verified by failed tests in [integrating Spark 4.1.0-preview1 in Iceberg](apache/iceberg#14155) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? UT. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #52423 from manuzhang/SPARK-53678. Authored-by: manuzhang <owenzhang1990@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

…ith null DataType ### What changes were proposed in this pull request? Check whether the parameter DataType is null in ColumnVector constructor before transforming it ### Why are the changes needed? A subclass of ColumnVector, e.g. Iceberg's [ConstantColumnVector](https://github.com/apache/iceberg/blob/main/spark/v4.0/spark/src/main/java/org/apache/iceberg/spark/data/vectorized/ConstantColumnVector.java#L41), could be created with null `DataType`. It throws NPE after apache#51349, which can be verified by failed tests in [integrating Spark 4.1.0-preview1 in Iceberg](apache/iceberg#14155) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? UT. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#52423 from manuzhang/SPARK-53678. Authored-by: manuzhang <owenzhang1990@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

[SPARK-52651][SQL] Handle User Defined Type in Nested Column Vectors

f026381

github-actions bot added the SQL label Jul 2, 2025

dongjoon-hyun reviewed Jul 2, 2025

View reviewed changes

dongjoon-hyun approved these changes Jul 2, 2025

View reviewed changes

peter-toth approved these changes Jul 2, 2025

View reviewed changes

yaooqinn closed this in 0c6d7fd Jul 3, 2025

yaooqinn deleted the SPARK-52651 branch July 3, 2025 01:49

manuzhang mentioned this pull request Sep 23, 2025

[SPARK-53678][SQL] Fix NPE when subclass of ColumnVector is created with null DataType #52423

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-52651][SQL] Handle User Defined Type in Nested ColumnVector #51349

[SPARK-52651][SQL] Handle User Defined Type in Nested ColumnVector #51349

Uh oh!

yaooqinn commented Jul 2, 2025

Uh oh!

dongjoon-hyun Jul 2, 2025

Uh oh!

yaooqinn Jul 2, 2025

Uh oh!

dongjoon-hyun commented Jul 2, 2025

Uh oh!

yaooqinn commented Jul 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-52651][SQL] Handle User Defined Type in Nested ColumnVector #51349

[SPARK-52651][SQL] Handle User Defined Type in Nested ColumnVector #51349

Uh oh!

Conversation

yaooqinn commented Jul 2, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

dongjoon-hyun Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

yaooqinn Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Jul 2, 2025

Uh oh!

yaooqinn commented Jul 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants