-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-47813][SQL] Replace getArrayDimension with updateExtraColumnMeta #46006
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| try { | ||
| Using.resource(conn.createStatement()) { stmt => | ||
| Using.resource(stmt.executeQuery(query)) { rs => | ||
| if (rs.next()) metadata.putLong("arrayDimension", rs.getLong(1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should metadata properties like "arrayDimension" be pre-defined? like table properties defined in org.apache.spark.sql.connector.catalog.TableCatalog
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good point. We do have several required and reserved keys being predefined. Things here might be slightly different as developers are free to create new ones and delete existing props whenever they want to,as none of these keys are strictly required by following steps. Given that,I'm OK to leave it as-is and also OK to define them somewhere to improve the code readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya, let's do that later before Apache Spark 4.0.0 release.
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM.
…ional arrays ### What changes were proposed in this pull request? There is a bug introduced in this PR #46006. This PR fixed the behaviour for PostgreSQL connector for multidimensional arrays since we have mapped all arrays to 1D arrays. This PR has introduced a bug for one case. Following scenario is broken: - User has a table t1 on Postgres and does CTAS command to create table t2 with same data. PR 46006 is resolving the dimensionality of column by reading the metadata from pg_attribute table and attndims column. - This query returns correct dimensionality for table t1, but for table t2 that is created via CTAS it returns 0 always. This leads to all of the arrays being mapped to 0-D array which is the type itself (for example int). This is a bug on Postgres side. - As a solution, we can query array_ndims function on given column that will return the dimension of the column. It works for CTAS-like-created tables too. We can get the result of this function on first row of table. This change is doing additional query to PG table to find the dimension of array column instead of querying metadata table as before. It might be more expensive but we are sending LIMIT 1 in query. Also, there is one caveat. In PG, there is no dimensionality of array as all the arrays are 1D arrays (https://www.postgresql.org/docs/current/arrays.html#ARRAYS-DECLARATION). Therefore, if there is table with 2D array column, it is totally fine from PG side to insert 1D or 3D array in this column. This makes the read path on Spark problematic since we will get the dimension of array, for example 1 if the first record is 1D array, and then we will try to read 3D array later on which will fail. Vice versa also, getting dimension 3 and reading 1D array later on. The change that I propose is fine with this scenario since it already doesn't work in Spark. What my change implies is that user can get different error message depending on the dimensionality of first record in table (namely, for one table they can get the error message that the expected type is ARRAY<ARRAY<INT>> and for the other that it is ARRAY\<INT\>. ### Why are the changes needed? Bug fix. ### Does this PR introduce _any_ user-facing change? No. It just fixes one case that doesn't work currently. ### How was this patch tested? New tests. ### Was this patch authored or co-authored using generative AI tooling? Closes #48625 from PetarVasiljevic-DB/fix_postgres_multidimensional_arrays. Authored-by: Petar Vasiljevic <petar.vasiljevic@databricks.com> Signed-off-by: Kent Yao <yao@apache.org>
What changes were proposed in this pull request?
SPARK-47754 introduced a new developer API called
getArrayDimension.This PR expands the scope of
getArrayDimensionand renames it toupdateExtraColumnMeta. Just as their names said,getArrayDimensionhandles only one column of metadata that is the dimension of an array, whileupdateExtraColumnMetacan retrieve any type of metadata based on the given ResultSetMetadata and Connection. This is much more general and useful and reduces the number of potentially new Developer APIs in the same shape. Also the current parameters forgetArrayDimension might not be enough for other dialects
Why are the changes needed?
Refactoring unreleased Developer APIs to make it more sustainable
Does this PR introduce any user-facing change?
no, unreleased API change
How was this patch tested?
existing ut
Was this patch authored or co-authored using generative AI tooling?
no