-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-50092][SQL] Fix PostgreSQL connector behaviour for multidimensional arrays #48625
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-50092][SQL] Fix PostgreSQL connector behaviour for multidimensional arrays #48625
Conversation
bfa95c6 to
49be86a
Compare
...integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/PostgresIntegrationSuite.scala
Outdated
Show resolved
Hide resolved
fd7e9fa to
7e9640a
Compare
b4b6f7f to
4fef609
Compare
MaxGekk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yaooqinn as an author of #46006, I would like to get your review here :).
also cc @dongjoon-hyun as you reviewed the original PR.
sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
7418a99 to
9a05bb7
Compare
...integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/PostgresIntegrationSuite.scala
Outdated
Show resolved
Hide resolved
5493a18 to
3330a27
Compare
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
Outdated
Show resolved
Hide resolved
0db8c40 to
3feff90
Compare
sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala
Outdated
Show resolved
Hide resolved
sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala
Outdated
Show resolved
Hide resolved
...integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/PostgresIntegrationSuite.scala
Outdated
Show resolved
Hide resolved
|
Does this approach still work when the resultset given by limit 1 is empty? |
This is a good point. Shall we keep the information schema query as a fallback when the table is empty? |
This is actually fine as we won't set arrayDimension in metadata. Later on, when reading we are doing something like so we are always going to fallback to 1. This is also what happened so far, if metadata from pg for some reason doesn't contain info about dimensionality we fallback to 1. I can write tests though to make sure this works. |
|
If we execute a CTAS on the Spark side against an empty Postgres table with multidimensional array columns, the new approach now defaults the dimension value to 1, then we have a malformed table in that Spark catalog, which might cause subsequent data pipelines to fail. While the information schema approach was initially impacted by an 'upstream' bug, it now appears that we have introduced a bug of our own. |
|
@yaooqinn makes sense, thanks! I have updated the PR with fallback to the querying of metadata table. |
|
What is your concern here? |
|
@yaooqinn are you questioning the reasoning behind metadata not being correct for CTAS created tables or I misunderstood your question? |
|
The experiment I conducted shows that executing array_ndim on the same column can result in different values across different rows. |
|
@yaooqinn yes, this is is purely dependant on the value of the row so it is possible we get different values. I have mentioned in PR description that we can have 2D array and we are allowed to insert 1D or 3D elements. Therefore we can expect returned dimension to be 1 or 3 depending on the first row fetched, sorry if it was not clear enough. I would say this is fine since spark doesn't allow variable length arrays so if there are 2 rows with different dimensionality in foreign table, read will fail with or without this change. What I can propose as a better way of fixing this is maybe querying the metadata first, and if it returns 0, we can fallback to querying array_ndim. |
|
The outcome of array_dims is data-dependent and volatile. I don't believe we can depend on it to create stable jobs. Instead of experiencing occasional failures, I would prefer to fail at the outset. |
|
Maybe we're not on the same page here but this is what currently happens in Spark:
With the proposed fix, behaviour stays the same:
Currently, jobs that are reading from CTAS tables will always fail. Have I misunderstood the term "stability of job" here? |
|
If postgres array has row with different array dimensionality (array_dims returning different values for different rows) - then from Spark perspective there is really no correct answer to what However, there is one scenario that could be better - but I don't believe this PR makes it worse, it just doesn't solve it, and this PR might decrease "stability" of job (it sometimes fail, sometimes succeed). Lets say customer issues a query: Today if we read from metadata, lets say we get value Now lets say we have query Now lets imagine that However if we switch to new model we will get in scenario that sometimes So in short we shouldn't look at |
|
I see and it makes sense. Thanks @milastdbx. @yaooqinn may I suggest that we at least fallback to the value of 1 if we read 0 from metadata? It is expected to read the ArrayType so dimensionality of array should be at least 1. So something like this: It doesn't cover all the arrays still, but I am expecting that most of the users are using 1D arrays so this would cover most of the cases. Also, it is stable. And I see it as pure improvement not from the current code of PG dialect, but from the previous one, where we were reading all PG arrays (non-CTAS and CTAS) as 1D. With this code, we would support arrays (non-CTAS) of higher dimensions, but for the CTAS tables we would support only 1D arrays. |
|
It makes sense to me w/ 1 as the default dimension value or maybe we can fail directly when encountering 0. Also cc @cloud-fan |
|
Can we merge? Test failures don't seem related to this RP. |
|
Can you rebase master and retry the GA? |
e60f57f to
f468496
Compare
f468496 to
5475b9f
Compare
|
Tests are passing now. |
|
Merged to master, thank you @PetarVasiljevic-DB |
What changes were proposed in this pull request?
There is a bug introduced in this PR #46006. This PR fixed the behaviour for PostgreSQL connector for multidimensional arrays since we have mapped all arrays to 1D arrays.
This PR has introduced a bug for one case. Following scenario is broken:
PR 46006 is resolving the dimensionality of column by reading the metadata from pg_attribute table and attndims column.
This change is doing additional query to PG table to find the dimension of array column instead of querying metadata table as before. It might be more expensive but we are sending LIMIT 1 in query.
Also, there is one caveat. In PG, there is no dimensionality of array as all the arrays are 1D arrays (https://www.postgresql.org/docs/current/arrays.html#ARRAYS-DECLARATION). Therefore, if there is table with 2D array column, it is totally fine from PG side to insert 1D or 3D array in this column. This makes the read path on Spark problematic since we will get the dimension of array, for example 1 if the first record is 1D array, and then we will try to read 3D array later on which will fail. Vice versa also, getting dimension 3 and reading 1D array later on. The change that I propose is fine with this scenario since it already doesn't work in Spark. What my change implies is that user can get different error message depending on the dimensionality of first record in table (namely, for one table they can get the error message that the expected type is ARRAY<ARRAY> and for the other that it is ARRAY<INT>.
Why are the changes needed?
Bug fix.
Does this PR introduce any user-facing change?
No. It just fixes one case that doesn't work currently.
How was this patch tested?
New tests.
Was this patch authored or co-authored using generative AI tooling?