Skip to content

Conversation

@yaooqinn
Copy link
Member

@yaooqinn yaooqinn commented Apr 11, 2024

What changes were proposed in this pull request?

SPARK-47754 introduced a new developer API called getArrayDimension.

This PR expands the scope of getArrayDimension and renames it to updateExtraColumnMeta. Just as their names said, getArrayDimension handles only one column of metadata that is the dimension of an array, while updateExtraColumnMeta can retrieve any type of metadata based on the given ResultSetMetadata and Connection. This is much more general and useful and reduces the number of potentially new Developer APIs in the same shape. Also the current parameters for
getArrayDimension might not be enough for other dialects

Why are the changes needed?

Refactoring unreleased Developer APIs to make it more sustainable

Does this PR introduce any user-facing change?

no, unreleased API change

How was this patch tested?

existing ut

Was this patch authored or co-authored using generative AI tooling?

no

@github-actions github-actions bot added the SQL label Apr 11, 2024
try {
Using.resource(conn.createStatement()) { stmt =>
Using.resource(stmt.executeQuery(query)) { rs =>
if (rs.next()) metadata.putLong("arrayDimension", rs.getLong(1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should metadata properties like "arrayDimension" be pre-defined? like table properties defined in org.apache.spark.sql.connector.catalog.TableCatalog

Copy link
Member Author

@yaooqinn yaooqinn Apr 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good point. We do have several required and reserved keys being predefined. Things here might be slightly different as developers are free to create new ones and delete existing props whenever they want to,as none of these keys are strictly required by following steps. Given that,I'm OK to leave it as-is and also OK to define them somewhere to improve the code readability.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya, let's do that later before Apache Spark 4.0.0 release.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

@yaooqinn yaooqinn closed this in ffc378c Apr 12, 2024
yaooqinn pushed a commit that referenced this pull request Nov 14, 2024
…ional arrays

### What changes were proposed in this pull request?
There is a bug introduced in this PR #46006. This PR fixed the behaviour for PostgreSQL connector for multidimensional arrays since we have mapped all arrays to 1D arrays.

This PR has introduced a bug for one case. Following scenario is broken:

- User has a table t1 on Postgres and does CTAS command to create table t2 with same data.
PR 46006 is resolving the dimensionality of column by reading the metadata from pg_attribute table and attndims column.
- This query returns correct dimensionality for table t1, but for table t2 that is created via CTAS it returns 0 always. This leads to all of the arrays being mapped to 0-D array which is the type itself (for example int). This is a bug on Postgres side.
- As a solution, we can query array_ndims function on given column that will return the dimension of the column. It works for CTAS-like-created tables too. We can get the result of this function on first row of table.

This change is doing additional query to PG table to find the dimension of array column instead of querying metadata table as before. It might be more expensive but we are sending LIMIT 1 in query.

Also, there is one caveat. In PG, there is no dimensionality of array as all the arrays are 1D arrays (https://www.postgresql.org/docs/current/arrays.html#ARRAYS-DECLARATION). Therefore, if there is table with 2D array column, it is totally fine from PG side to insert 1D or 3D array in this column. This makes the read path on Spark problematic since we will get the dimension of array, for example 1 if the first record is 1D array, and then we will try to read 3D array later on which will fail. Vice versa also, getting dimension 3 and reading 1D array later on. The change that I propose is fine with this scenario since it already doesn't work in Spark. What my change implies is that user can get different error message depending on the dimensionality of first record in table (namely, for one table they can get the error message that the expected type is ARRAY<ARRAY<INT>> and for the other that it is ARRAY\<INT\>.

### Why are the changes needed?
Bug fix.

### Does this PR introduce _any_ user-facing change?
No. It just fixes one case that doesn't work currently.

### How was this patch tested?
New tests.

### Was this patch authored or co-authored using generative AI tooling?

Closes #48625 from PetarVasiljevic-DB/fix_postgres_multidimensional_arrays.

Authored-by: Petar Vasiljevic <petar.vasiljevic@databricks.com>
Signed-off-by: Kent Yao <yao@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants