[SPARK-47813][SQL] Replace getArrayDimension with updateExtraColumnMeta #46006

yaooqinn · 2024-04-11T11:16:29Z

What changes were proposed in this pull request?

SPARK-47754 introduced a new developer API called getArrayDimension.

This PR expands the scope of getArrayDimension and renames it to updateExtraColumnMeta. Just as their names said, getArrayDimension handles only one column of metadata that is the dimension of an array, while updateExtraColumnMeta can retrieve any type of metadata based on the given ResultSetMetadata and Connection. This is much more general and useful and reduces the number of potentially new Developer APIs in the same shape. Also the current parameters for
getArrayDimension might not be enough for other dialects

Why are the changes needed?

Refactoring unreleased Developer APIs to make it more sustainable

Does this PR introduce any user-facing change?

no, unreleased API change

How was this patch tested?

existing ut

Was this patch authored or co-authored using generative AI tooling?

no

pan3793 · 2024-04-11T13:21:31Z

sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala

+        try {
+          Using.resource(conn.createStatement()) { stmt =>
+            Using.resource(stmt.executeQuery(query)) { rs =>
+              if (rs.next()) metadata.putLong("arrayDimension", rs.getLong(1))


should metadata properties like "arrayDimension" be pre-defined? like table properties defined in org.apache.spark.sql.connector.catalog.TableCatalog

It's a good point. We do have several required and reserved keys being predefined. Things here might be slightly different as developers are free to create new ones and delete existing props whenever they want to，as none of these keys are strictly required by following steps. Given that，I'm OK to leave it as-is and also OK to define them somewhere to improve the code readability.

Ya, let's do that later before Apache Spark 4.0.0 release.

dongjoon-hyun

+1, LGTM.

…ional arrays ### What changes were proposed in this pull request? There is a bug introduced in this PR #46006. This PR fixed the behaviour for PostgreSQL connector for multidimensional arrays since we have mapped all arrays to 1D arrays. This PR has introduced a bug for one case. Following scenario is broken: - User has a table t1 on Postgres and does CTAS command to create table t2 with same data. PR 46006 is resolving the dimensionality of column by reading the metadata from pg_attribute table and attndims column. - This query returns correct dimensionality for table t1, but for table t2 that is created via CTAS it returns 0 always. This leads to all of the arrays being mapped to 0-D array which is the type itself (for example int). This is a bug on Postgres side. - As a solution, we can query array_ndims function on given column that will return the dimension of the column. It works for CTAS-like-created tables too. We can get the result of this function on first row of table. This change is doing additional query to PG table to find the dimension of array column instead of querying metadata table as before. It might be more expensive but we are sending LIMIT 1 in query. Also, there is one caveat. In PG, there is no dimensionality of array as all the arrays are 1D arrays (https://www.postgresql.org/docs/current/arrays.html#ARRAYS-DECLARATION). Therefore, if there is table with 2D array column, it is totally fine from PG side to insert 1D or 3D array in this column. This makes the read path on Spark problematic since we will get the dimension of array, for example 1 if the first record is 1D array, and then we will try to read 3D array later on which will fail. Vice versa also, getting dimension 3 and reading 1D array later on. The change that I propose is fine with this scenario since it already doesn't work in Spark. What my change implies is that user can get different error message depending on the dimensionality of first record in table (namely, for one table they can get the error message that the expected type is ARRAY<ARRAY<INT>> and for the other that it is ARRAY\<INT\>. ### Why are the changes needed? Bug fix. ### Does this PR introduce _any_ user-facing change? No. It just fixes one case that doesn't work currently. ### How was this patch tested? New tests. ### Was this patch authored or co-authored using generative AI tooling? Closes #48625 from PetarVasiljevic-DB/fix_postgres_multidimensional_arrays. Authored-by: Petar Vasiljevic <petar.vasiljevic@databricks.com> Signed-off-by: Kent Yao <yao@apache.org>

[SPARK-47813][SQL] Replace getArrayDimension with updateExtraColumnMeta

5e7e1ea

github-actions bot added the SQL label Apr 11, 2024

pan3793 reviewed Apr 11, 2024

View reviewed changes

dongjoon-hyun approved these changes Apr 11, 2024

View reviewed changes

yaooqinn closed this in ffc378c Apr 12, 2024

PetarVasiljevic-DB mentioned this pull request Oct 23, 2024

[SPARK-50092][SQL] Fix PostgreSQL connector behaviour for multidimensional arrays #48625

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-47813][SQL] Replace getArrayDimension with updateExtraColumnMeta #46006

[SPARK-47813][SQL] Replace getArrayDimension with updateExtraColumnMeta #46006

Uh oh!

yaooqinn commented Apr 11, 2024 •

edited

Loading

Uh oh!

pan3793 Apr 11, 2024

Uh oh!

yaooqinn Apr 11, 2024 •

edited

Loading

Uh oh!

dongjoon-hyun Apr 11, 2024

Uh oh!

dongjoon-hyun left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-47813][SQL] Replace getArrayDimension with updateExtraColumnMeta #46006

[SPARK-47813][SQL] Replace getArrayDimension with updateExtraColumnMeta #46006

Uh oh!

Conversation

yaooqinn commented Apr 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

pan3793 Apr 11, 2024

Choose a reason for hiding this comment

Uh oh!

yaooqinn Apr 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Apr 11, 2024

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yaooqinn commented Apr 11, 2024 •

edited

Loading

yaooqinn Apr 11, 2024 •

edited

Loading