[SPARK-50092][SQL] Fix PostgreSQL connector behaviour for multidimensional arrays #48625

PetarVasiljevic-DB · 2024-10-23T17:41:41Z

What changes were proposed in this pull request?

There is a bug introduced in this PR #46006. This PR fixed the behaviour for PostgreSQL connector for multidimensional arrays since we have mapped all arrays to 1D arrays.

This PR has introduced a bug for one case. Following scenario is broken:

User has a table t1 on Postgres and does CTAS command to create table t2 with same data.
PR 46006 is resolving the dimensionality of column by reading the metadata from pg_attribute table and attndims column.
This query returns correct dimensionality for table t1, but for table t2 that is created via CTAS it returns 0 always. This leads to all of the arrays being mapped to 0-D array which is the type itself (for example int). This is a bug on Postgres side.
As a solution, we can query array_ndims function on given column that will return the dimension of the column. It works for CTAS-like-created tables too. We can get the result of this function on first row of table.

This change is doing additional query to PG table to find the dimension of array column instead of querying metadata table as before. It might be more expensive but we are sending LIMIT 1 in query.

Also, there is one caveat. In PG, there is no dimensionality of array as all the arrays are 1D arrays (https://www.postgresql.org/docs/current/arrays.html#ARRAYS-DECLARATION). Therefore, if there is table with 2D array column, it is totally fine from PG side to insert 1D or 3D array in this column. This makes the read path on Spark problematic since we will get the dimension of array, for example 1 if the first record is 1D array, and then we will try to read 3D array later on which will fail. Vice versa also, getting dimension 3 and reading 1D array later on. The change that I propose is fine with this scenario since it already doesn't work in Spark. What my change implies is that user can get different error message depending on the dimensionality of first record in table (namely, for one table they can get the error message that the expected type is ARRAY<ARRAY> and for the other that it is ARRAY<INT>.

Why are the changes needed?

Bug fix.

Does this PR introduce any user-facing change?

No. It just fixes one case that doesn't work currently.

How was this patch tested?

New tests.

Was this patch authored or co-authored using generative AI tooling?

PetarVasiljevic-DB · 2024-10-23T18:40:49Z

@yaooqinn as an author of #46006, I would like to get your review here :).

...integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/PostgresIntegrationSuite.scala

MaxGekk

@yaooqinn as an author of #46006, I would like to get your review here :).

also cc @dongjoon-hyun as you reviewed the original PR.

sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

...integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/PostgresIntegrationSuite.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala

...integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/PostgresIntegrationSuite.scala

yaooqinn · 2024-10-29T02:34:55Z

Does this approach still work when the resultset given by limit 1 is empty?

cloud-fan · 2024-10-29T14:06:29Z

Does this approach still work when the resultset given by limit 1 is empty?

This is a good point. Shall we keep the information schema query as a fallback when the table is empty?

PetarVasiljevic-DB · 2024-10-29T14:22:02Z

Does this approach still work when the resultset given by limit 1 is empty?

This is actually fine as we won't set arrayDimension in metadata. Later on, when reading we are doing something like

val dim = if (metadata.contains("arrayDimension")) {
  metadata.getLong("arrayDimension").toInt
} else {
  1
}

so we are always going to fallback to 1. This is also what happened so far, if metadata from pg for some reason doesn't contain info about dimensionality we fallback to 1. I can write tests though to make sure this works.

yaooqinn · 2024-10-29T15:29:17Z

If we execute a CTAS on the Spark side against an empty Postgres table with multidimensional array columns, the new approach now defaults the dimension value to 1, then we have a malformed table in that Spark catalog, which might cause subsequent data pipelines to fail. While the information schema approach was initially impacted by an 'upstream' bug, it now appears that we have introduced a bug of our own.

PetarVasiljevic-DB · 2024-10-29T16:36:42Z

@yaooqinn makes sense, thanks! I have updated the PR with fallback to the querying of metadata table.

PetarVasiljevic-DB · 2024-11-01T10:05:03Z

What is your concern here?

PetarVasiljevic-DB · 2024-11-04T20:57:45Z

@yaooqinn are you questioning the reasoning behind metadata not being correct for CTAS created tables or I misunderstood your question?

yaooqinn · 2024-11-05T02:34:28Z

Hi @PetarVasiljevic-DB,

The experiment I conducted shows that executing array_ndim on the same column can result in different values across different rows.

PetarVasiljevic-DB · 2024-11-05T10:00:20Z

@yaooqinn yes, this is is purely dependant on the value of the row so it is possible we get different values. I have mentioned in PR description that we can have 2D array and we are allowed to insert 1D or 3D elements. Therefore we can expect returned dimension to be 1 or 3 depending on the first row fetched, sorry if it was not clear enough.

I would say this is fine since spark doesn't allow variable length arrays so if there are 2 rows with different dimensionality in foreign table, read will fail with or without this change.

What I can propose as a better way of fixing this is maybe querying the metadata first, and if it returns 0, we can fallback to querying array_ndim.

yaooqinn · 2024-11-05T10:34:30Z

The outcome of array_dims is data-dependent and volatile. I don't believe we can depend on it to create stable jobs. Instead of experiencing occasional failures, I would prefer to fail at the outset.

PetarVasiljevic-DB · 2024-11-05T14:10:44Z

Maybe we're not on the same page here but this is what currently happens in Spark:

If array column on Postgres has such values that dimension of these values across all rows is same then job will be stable.
If array column on Postgres has such values that dimension of these values across all rows are not same, the job won't be stable.

With the proposed fix, behaviour stays the same:

If array_ndims is same for all rows, job will be stable
If array_ndims returns multiple values across all rows, job won't be stable.

Currently, jobs that are reading from CTAS tables will always fail. Have I misunderstood the term "stability of job" here?

milastdbx · 2024-11-05T23:02:45Z

If postgres array has row with different array dimensionality (array_dims returning different values for different rows) - then from Spark perspective there is really no correct answer to what array_dims we should use given that spark cannot process this rows anyway. Given that there is no correct answer, all answers are wrong answers, so returning rand() (which limit 1 basically does) is same as reading it from metadata, as for Spark these are all incorrect.

However, there is one scenario that could be better - but I don't believe this PR makes it worse, it just doesn't solve it, and this PR might decrease "stability" of job (it sometimes fail, sometimes succeed).

Lets say customer issues a query:

select id, array_col from table1
where id = 3

Today if we read from metadata, lets say we get value 2, and this value for id = 3 just accidentaly happens to be correct and query predictably succeeds.

Now lets say we have query

select id, array_col from table1
where id = 4

Now lets imagine that array_col for id = 4 has dimensionality 4 - which will cause this query to predicatbly fail.

However if we switch to new model we will get in scenario that sometimes where id = 3 fails, sometimes succeeds.

So in short we shouldn't look at array_col dimensionality on table level but rather table + predicate level

PetarVasiljevic-DB · 2024-11-12T09:10:39Z

I see and it makes sense. Thanks @milastdbx. @yaooqinn may I suggest that we at least fallback to the value of 1 if we read 0 from metadata? It is expected to read the ArrayType so dimensionality of array should be at least 1. So something like this:

metadata.putLong("arrayDimension", Math.max(1, rs.getLong(1)))

It doesn't cover all the arrays still, but I am expecting that most of the users are using 1D arrays so this would cover most of the cases. Also, it is stable.

And I see it as pure improvement not from the current code of PG dialect, but from the previous one, where we were reading all PG arrays (non-CTAS and CTAS) as 1D. With this code, we would support arrays (non-CTAS) of higher dimensions, but for the CTAS tables we would support only 1D arrays.

yaooqinn · 2024-11-12T09:15:45Z

It makes sense to me w/ 1 as the default dimension value or maybe we can fail directly when encountering 0. Also cc @cloud-fan

PetarVasiljevic-DB · 2024-11-12T13:29:42Z

Can we merge? Test failures don't seem related to this RP.

yaooqinn · 2024-11-13T07:25:12Z

Can you rebase master and retry the GA?

PetarVasiljevic-DB · 2024-11-13T18:44:06Z

Tests are passing now.

yaooqinn · 2024-11-14T03:51:41Z

Merged to master, thank you @PetarVasiljevic-DB

github-actions bot added the SQL label Oct 23, 2024

PetarVasiljevic-DB force-pushed the fix_postgres_multidimensional_arrays branch from bfa95c6 to 49be86a Compare October 23, 2024 17:58

PetarVasiljevic-DB closed this Oct 23, 2024

PetarVasiljevic-DB reopened this Oct 23, 2024

uros-db reviewed Oct 23, 2024

View reviewed changes

...integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/PostgresIntegrationSuite.scala Outdated Show resolved Hide resolved

PetarVasiljevic-DB force-pushed the fix_postgres_multidimensional_arrays branch from fd7e9fa to 7e9640a Compare October 23, 2024 21:17

PetarVasiljevic-DB closed this Oct 24, 2024

PetarVasiljevic-DB reopened this Oct 24, 2024

PetarVasiljevic-DB force-pushed the fix_postgres_multidimensional_arrays branch 2 times, most recently from b4b6f7f to 4fef609 Compare October 24, 2024 11:52

MaxGekk reviewed Oct 24, 2024

View reviewed changes

cloud-fan reviewed Oct 24, 2024

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Oct 24, 2024

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

PetarVasiljevic-DB force-pushed the fix_postgres_multidimensional_arrays branch 3 times, most recently from 7418a99 to 9a05bb7 Compare October 24, 2024 15:59

PetarVasiljevic-DB commented Oct 24, 2024

View reviewed changes

...integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/PostgresIntegrationSuite.scala Outdated Show resolved Hide resolved

PetarVasiljevic-DB force-pushed the fix_postgres_multidimensional_arrays branch 2 times, most recently from 5493a18 to 3330a27 Compare October 25, 2024 00:32

yaooqinn reviewed Oct 25, 2024

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

PetarVasiljevic-DB force-pushed the fix_postgres_multidimensional_arrays branch 2 times, most recently from 0db8c40 to 3feff90 Compare October 26, 2024 14:07

milastdbx approved these changes Oct 28, 2024

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala Outdated Show resolved Hide resolved

sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala Outdated Show resolved Hide resolved

yaooqinn reviewed Oct 29, 2024

View reviewed changes

...integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/PostgresIntegrationSuite.scala Outdated Show resolved Hide resolved

cloud-fan approved these changes Nov 12, 2024

View reviewed changes

PetarVasiljevic-DB added 8 commits November 13, 2024 11:41

initial commit

c0ed91f

initial commit

63e756d

fix perf

600600b

change flag name

86d0201

remove config

7ff3d67

revert

e699710

add fallback

55dd1f6

refactor

f008d41

PetarVasiljevic-DB force-pushed the fix_postgres_multidimensional_arrays branch 3 times, most recently from e60f57f to f468496 Compare November 13, 2024 12:05

refactor

5475b9f

PetarVasiljevic-DB force-pushed the fix_postgres_multidimensional_arrays branch from f468496 to 5475b9f Compare November 13, 2024 12:06

yaooqinn approved these changes Nov 14, 2024

View reviewed changes

yaooqinn closed this in 0b1b676 Nov 14, 2024

[SPARK-50092][SQL] Fix PostgreSQL connector behaviour for multidimensional arrays #48625

[SPARK-50092][SQL] Fix PostgreSQL connector behaviour for multidimensional arrays #48625

Uh oh!

Conversation

PetarVasiljevic-DB commented Oct 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

PetarVasiljevic-DB commented Oct 23, 2024

Uh oh!

Uh oh!

MaxGekk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yaooqinn commented Oct 29, 2024

Uh oh!

cloud-fan commented Oct 29, 2024

Uh oh!

PetarVasiljevic-DB commented Oct 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yaooqinn commented Oct 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PetarVasiljevic-DB commented Oct 29, 2024

Uh oh!

PetarVasiljevic-DB commented Nov 1, 2024

Uh oh!

PetarVasiljevic-DB commented Nov 4, 2024

Uh oh!

yaooqinn commented Nov 5, 2024

Uh oh!

PetarVasiljevic-DB commented Nov 5, 2024

Uh oh!

yaooqinn commented Nov 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PetarVasiljevic-DB commented Nov 5, 2024

Uh oh!

milastdbx commented Nov 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PetarVasiljevic-DB commented Nov 12, 2024

Uh oh!

yaooqinn commented Nov 12, 2024

Uh oh!

PetarVasiljevic-DB commented Nov 12, 2024

Uh oh!

yaooqinn commented Nov 13, 2024

Uh oh!

PetarVasiljevic-DB commented Nov 13, 2024

Uh oh!

yaooqinn commented Nov 14, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

PetarVasiljevic-DB commented Oct 23, 2024 •

edited

Loading

PetarVasiljevic-DB commented Oct 29, 2024 •

edited

Loading

yaooqinn commented Oct 29, 2024 •

edited

Loading

yaooqinn commented Nov 5, 2024 •

edited

Loading

milastdbx commented Nov 5, 2024 •

edited

Loading