Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Postgres Source - Unable to handle jsonb[] (regression) #29172

Closed
1 task
haithem-souala opened this issue Aug 7, 2023 · 16 comments · Fixed by #30534
Closed
1 task

Postgres Source - Unable to handle jsonb[] (regression) #29172

haithem-souala opened this issue Aug 7, 2023 · 16 comments · Fixed by #30534
Labels
area/connectors Connector related issues team/db-dw-sources Backlog for Database and Data Warehouse Sources team team/destinations Destinations team's backlog type/bug Something isn't working

Comments

@haithem-souala
Copy link
Contributor

haithem-souala commented Aug 7, 2023

Connector Name

destination-bigquery

Connector Version

1.7.2

What step the error happened?

During the sync

Revelant information

Normalization: raw data (json)
Source: Postgres v3.0.2

In table A (at Postgres) we have a field tags (jsonb[]), so the data stored, looks like this:

{"{\"key\": \"xxx_field\", \"value\": \"DEFAULT\", \"yyy_field\": true}","{\"key\": \"zzz_field\", \"value\": \"obfuscated\", \"rrr_field\": false}"}

Or like this:

{"{\"key\": \"xxx_field\", \"value\": \"DEFAULT\", \"yyy_field\": true}"}

The BigQuery Destination v2 connector, write the data in the _airbyte_data like this:

{
  "tags": "{\"{\\\"key\\\": \\\"origin\\\", \\\"value\\\": \\\"obfuscated\\\"}\"}"
}

In the final destination table, airbyte is unable to handle the tags field content, so it will be ignored with this message in the _airbyte_meta :

{"errors":["Problem with tags"]}

Expected bahaviour:
Just set the field content in the final destination table.

Relevant log output

No response

Contribute

  • Yes, I want to contribute
@evantahler
Copy link
Contributor

Grooming:

  • we will research sources sending JSONB[]s and see what destinations V1 does. destinations v2 is different, we will fix it, if not, we will wait until our JSONB project in the next few months.

@evantahler
Copy link
Contributor

evantahler commented Aug 8, 2023

@haithem-souala - can you please share some examples of what this JSONb[] output looks like in BigQuery for both Destinations V1 and V2? Ideally, both a screenshot and CSV dump would be helpful, as we want to be very careful with escaping and quotes. Email would also work - I'm {firstname}@airbyte.io

@haithem-souala
Copy link
Contributor Author

haithem-souala commented Aug 9, 2023

Fyi, in destination v1, it's working fine.

Source (Postgres):

                                                                                             tags                                                                                              
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 {"{\"key\": \"pricing\", \"value\": \"DEFAULT\", \"applyToOrchestration\": true}"}
 {"{\"key\": \"pricing\", \"value\": \"DEFAULT\", \"applyToOrchestration\": true}"}
 {"{\"key\": \"selected_carrier\", \"value\": \"obfuscated_val\", \"applyToOrchestration\": false}"}
 {"{\"key\": \"pricing\", \"value\": \"DEFAULT\", \"applyToOrchestration\": true}"}
 {"{\"key\": \"selected_carrier\", \"value\": \"obfuscated_val\", \"applyToOrchestration\": false}"}
 {"{\"key\": \"pricing\", \"value\": \"DEFAULT\", \"applyToOrchestration\": true}"}
 {"{\"key\": \"pricing\", \"value\": \"obfuscated_val\", \"applyToOrchestration\": true}"}
 {"{\"key\": \"carrierMethod\", \"value\": \"obfuscated_val\", \"applyToOrchestration\": true}"}
 {"{\"key\": \"pricing\", \"value\": \"DEFAULT\", \"applyToOrchestration\": true}","{\"key\": \"selected_carrier\", \"value\": \"obfuscated_val\", \"applyToOrchestration\": false}"}
 {"{\"key\": \"pricing\", \"value\": \"DEFAULT\", \"applyToOrchestration\": true}"}

Destination V1

_airbyte_raw_table_a:

{
    "tags": [
        "{\"key\": \"installation_date\", \"value\": \"2019-09-20\", \"applyToOrchestration\": false}"
    ]
}

table_a:

[{
  "tags": ["\"{\\\"key\\\": \\\"pricing\\\", \\\"value\\\": \\\"obfuscated_val\\\", \\\"applyToOrchestration\\\": true}\""]
}, {
  "tags": ["\"{\\\"key\\\": \\\"pricing\\\", \\\"value\\\": \\\"DEFAULT\\\"}\""]
}, {
  "tags": ["\"{\\\"key\\\": \\\"pricing\\\", \\\"value\\\": \\\"DEFAULT\\\", \\\"applyToOrchestration\\\": true}\""]
}, {
  "tags": ["\"{\\\"key\\\": \\\"pricing\\\", \\\"value\\\": \\\"DEFAULT\\\", \\\"applyToOrchestration\\\": true}\""]
}, {
  "tags": ["\"{\\\"key\\\": \\\"pricing\\\", \\\"value\\\": \\\"obfuscated_val\\\", \\\"applyToOrchestration\\\": true}\""]
}, {
  "tags": ["\"{\\\"key\\\": \\\"pricing\\\", \\\"value\\\": \\\"obfuscated_val\\\", \\\"applyToOrchestration\\\": true}\""]
}, {
  "tags": ["\"{\\\"key\\\": \\\"pricing\\\", \\\"value\\\": \\\"obfuscated_val\\\", \\\"applyToOrchestration\\\": true}\""]
}, {
  "tags": ["\"{\\\"key\\\": \\\"pricing\\\", \\\"value\\\": \\\"DEFAULT\\\", \\\"applyToOrchestration\\\": true}\""]
}, {
  "tags": ["\"{\\\"key\\\": \\\"pricing\\\", \\\"value\\\": \\\"DEFAULT\\\", \\\"applyToOrchestration\\\": true}\""]
}, {
  "tags": ["\"{\\\"key\\\": \\\"pricing\\\", \\\"value\\\": \\\"DEFAULT\\\", \\\"applyToOrchestration\\\": true}\""]
}]

//CSV
bquxjob_37795b49_189d9620d09.csv

//Screenshot
Capture d’écran 2023-08-09 à 10 22 34

@haithem-souala
Copy link
Contributor Author

Any ETA? This issue prevent us from migrating the last DB.

@cynthiaxyin
Copy link
Contributor

We were able to repro this issue in BigQuery and Snowflake and are investigating what the fix should be.

It looks like Postgres arrays are defined with curly braces instead of brackets, and v2 is not recognizing this syntax? And maybe there is a discrepancy in when the data is deserialized?

  • source data (curly brace annotation + all single quotes) {"{\"key\": \"pricing\", \"value\": \"DEFAULT\", \"applyToOrchestration\": true}"}
  • v1 raw data (bracket annotation + still single quotes) {"tags": ["{\"key\": \"installation_date\", \"value\": \"2019-09-20\", \"applyToOrchestration\": false}"]}
  • v1 normalized data (bracket annotation + inner triple quotes) {"tags": ["\"{\\\"key\\\": \\\"pricing\\\", \\\"value\\\": \\\"obfuscated_val\\\", \\\"applyToOrchestration\\\": true}\""]}
  • v2 raw data (curly brace annotation + inner triple quotes) {"tags":"{\"{\\\"key\\\": \\\"pricing\\\", \\\"value\\\": \\\"DEFAULT\\\", \\\"applyToOrchestration\\\": true}\"}"}

Running the SQL that gets output in the V2 logs e.g. JSON_QUERY(`_airbyte_data`, '$.tags')), the result does seem to be better (actually an array) by switching out the curly braces for the brackets.
Screenshot 2023-08-16 at 9 49 54 AM

Normalization seemed to previously convert arrays to strings around here:

@edgao edgao self-assigned this Aug 17, 2023
@edgao
Copy link
Contributor

edgao commented Aug 17, 2023

@haithem-souala are there any differences (other than the destinations v2 toggle) between your v1 and v2 connections? Specifically: is the v1 connection using CDC and the v2 connection using standard replication?

I ran a v1 sync in standard replication and got the same behavior you're seeing in v2:
image

image

And ran a v2 sync with CDC, which seems to have worked:
image
image

I think this is a bug in source-postgres, trying to find the relevant issue.

@edgao
Copy link
Contributor

edgao commented Aug 17, 2023

couldn't find an existing issue; submitted #29551

@haithem-souala
Copy link
Contributor Author

haithem-souala commented Aug 17, 2023

The V1 and V2 are both standard replication.
V1: cursor based (table field)
V2: Xmin

The source-postgres version in V1 -> 1.0.30
The source-postgres version in V2 -> 3.0.2

I hope that will help debugging the bug.

@edgao
Copy link
Contributor

edgao commented Aug 17, 2023

interesting! This looks like a source-postgres regression between those versions; can you try downgrading the v2 connection to source-postgres:1.0.30? I ran a cursor-based sync with 1.0.30 + bigquery v2 and got some good results:
image
image

And I'll update the source-postgres issue to note that this is a regression.

(in the meantime, I'm bisecting source-postgres versions to find the newest version where jsonb[] works as expected)

@edgao
Copy link
Contributor

edgao commented Aug 17, 2023

looks like 1.0.42 is the highest version of source-postgres that supports jsonb[] correctly, if you want to give it a try

@haithem-souala
Copy link
Contributor Author

Unfortunately, i cannot downgrade source-postgres to 1.0.42, as it might affect other streams.

@edgao
Copy link
Contributor

edgao commented Aug 18, 2023

might affect other streams

can you explain some more on this? My understanding is that the two major releases between 1.x and 3.x were upgrading the underlying debezium library in 2.0.0 (which shouldn't affect non-CDC syncs) and changing how source-postgres's cursor works in CDC mode in 3.0.0 (which, again, only affects CDC syncs).

also tagging in @prateekmukhedkar for DB sources expertise, in case I missed something in the changelog.

@haithem-souala
Copy link
Contributor Author

might affect other streams

can you explain some more on this? My understanding is that the two major releases between 1.x and 3.x were upgrading the underlying debezium library in 2.0.0 (which shouldn't affect non-CDC syncs) and changing how source-postgres's cursor works in CDC mode in 3.0.0 (which, again, only affects CDC syncs).

also tagging in @prateekmukhedkar for DB sources expertise, in case I missed something in the changelog.

Yeah, i don't know what changed "really" between the 1.0.42 and the 3.X, if someone can confirm the changes are related only to CDC mode, ill be down for a downgrade of the connector.

@evantahler evantahler added the team/db-dw-sources Backlog for Database and Data Warehouse Sources team label Aug 28, 2023
@evantahler evantahler changed the title Bigquery Destination V2 - Unable to handle jsonb[] Postgres Source - Unable to handle jsonb[] (regression) Aug 28, 2023
@haithem-souala
Copy link
Contributor Author

might affect other streams

can you explain some more on this? My understanding is that the two major releases between 1.x and 3.x were upgrading the underlying debezium library in 2.0.0 (which shouldn't affect non-CDC syncs) and changing how source-postgres's cursor works in CDC mode in 3.0.0 (which, again, only affects CDC syncs).

also tagging in @prateekmukhedkar for DB sources expertise, in case I missed something in the changelog.

Hey @prateekmukhedkar, we're migration the syncs to V2 destination, and this issue blocking us to migrate the last two sources.

@edgao
Copy link
Contributor

edgao commented Sep 26, 2023

@haithem-souala source-postgres:3.1.9 has been released, which should work correctly for you - can you give it a try?

@haithem-souala
Copy link
Contributor Author

Yes, it's working fine, thank you @edgao!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues team/db-dw-sources Backlog for Database and Data Warehouse Sources team team/destinations Destinations team's backlog type/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants