🎉 New BigQuery destination with Structured/Repeated Records #4176

ChristopheDuong · 2021-06-17T12:57:25Z

What

Closes #1927

How

A new destination that does not rely on base-normalization but implements its own native normalization with BigQuery by converting JSON Schema into google cloud schema, and thus, handle structured//repeated records/arrays.

Implementation notes

This issue was not doable from base-normalization python/dbt codebase as it seemed difficult (not possible?) to implement in BQ Standard SQL logic that parses a JSON column string into separate columns while building or casting to a STRUCT field... (the logic with nested is making this harder to think about too, and certainly not very efficient!)

It seemed more standard (from google docs) to provide the JSON Schema to BigQuery at loading time instead when creating the tables. Therefore, do the implementation in the java codebase.

I started implementing this following the same pattern of CopyDestination that is first uploading to a cloud storage and then load the warehouse or directly to the warehouse depending on some configs values.

However, we then run into challenges to solve where conflicts can arise:

Run a BigQuery destination with struct/repeated fields
The destination could additionally persist into _airbyte_raw tables with JSON Blobs
Run normalization that will override what was produced by step 1 using what is produced by step 2 is counterproductive

Normalization code should probably be tweaked to be more easily compatible with the tables produced by this new denormalized destination instead. But that would mean more development work into the scope of this issue.

As a result, it is more straightforward to separate them into two distinct connectors for the moment with the "de-normalized" destination not being able to support normalization (and not supporting append-dedup either). In those cases, users can fallback to standard destination-bigquery or implement their own custom transformations post-sync instead for the moment.

When/if normalization is refactored to be compatible with de-normalized tables, then we could merge the two bigquery destinations back together into a single connector.

Pre-merge Checklist

Expand the checklist which is relevant for this PR.

Connector checklist

ChristopheDuong · 2021-06-17T13:30:53Z

/test connector=connectors/destination-bigquery

🕑 connectors/destination-bigquery https://github.com/airbytehq/airbyte/actions/runs/946571871
✅ connectors/destination-bigquery https://github.com/airbytehq/airbyte/actions/runs/946571871

ChristopheDuong · 2021-06-17T14:05:07Z

/test connector=connectors/destination-bigquery-denormalized

🕑 connectors/destination-bigquery-denormalized https://github.com/airbytehq/airbyte/actions/runs/946679518
✅ connectors/destination-bigquery-denormalized https://github.com/airbytehq/airbyte/actions/runs/946679518

tuliren

Looks good! Left a few minor comments.

Also this destination should have a doc.

tuliren · 2021-06-18T18:58:09Z

airbyte-integrations/connectors/destination-bigquery-denormalized/CHANGELOG.md

+# Changelog
+
+## 0.1.0
+Implementation of a destination for BigQuery with RECORD/REPEATED columns instead of raw JSON blobs.


According to the latest guide, we should track the changelog in the public documentation of the connector:
https://docs.airbyte.io/contributing-to-airbyte/updating-documentation#changelogs

...ion/java/io/airbyte/integrations/destination/bigquery/BigQueryDestinationAcceptanceTest.java

airbyte-integrations/connectors/destination-bigquery-denormalized/src/main/resources/spec.json

tuliren · 2021-06-18T20:53:38Z

...in/java/io/airbyte/integrations/destination/bigquery/BigQueryDenormalizedRecordConsumer.java

+          .filter(key -> {
+            final boolean validKey = fieldNames.contains(namingResolver.getIdentifier(key));
+            if (!validKey) {
+              LOGGER.warn("Ignoring field {} as it is not defined in catalog", key);


Should this be a debug level message? Otherwise, it can be quite noisy, since it can emitted for every record.

good catch, thanks

...airbyte/integrations/destination/bigquery/BigQueryDenormalizedDestinationAcceptanceTest.java

tuliren · 2021-06-18T21:08:37Z

.../main/java/io/airbyte/integrations/destination/bigquery/BigQueryDenormalizedDestination.java

+    }
+    if (fieldList.stream().noneMatch(f -> f.getName().equals(JavaBaseConstants.COLUMN_NAME_EMITTED_AT))) {
+      fieldList.add(Field.of(JavaBaseConstants.COLUMN_NAME_EMITTED_AT, StandardSQLTypeName.TIMESTAMP));
+    }


Are the above two if checks always true? It seems that the original Json schema will never have the two Airbyte columns.

the original JSON Schema can have the two airbyte columns if the streams were produced by airbyte and re-used as source streams

Oh, I see. Good to know this.

cgardens

Looks good. I feel like this should support copy destination functionality too. If that's too hard to do now then let's at least create an issue to do it.

cgardens · 2021-06-18T21:19:17Z

...n/resources/config/STANDARD_DESTINATION_DEFINITION/079d5540-f236-4294-ba7c-ade8fd918496.json

@@ -0,0 +1,7 @@
+{
+  "destinationDefinitionId": "079d5540-f236-4294-ba7c-ade8fd918496",
+  "name": "BigQuery de-normalized",


Suggested change

"name": "BigQuery de-normalized",

"name": "BigQuery (Typed Struct)",

I think this display name is a little bit clearer? Fine with me if you want to stick with denormalized but if you do it should be one word and I'd suggest putting it in parens. so BigQuery (Denormalized).

cgardens · 2021-06-18T21:20:42Z

...ses/base-java/src/main/java/io/airbyte/integrations/destination/StandardNameTransformer.java

@@ -41,7 +41,7 @@ public String getRawTableName(String streamName) {

  @Override
  public String getTmpTableName(String streamName) {
-    return convertStreamName("_airbyte_" + Instant.now().toEpochMilli() + "_" + getRawTableName(streamName));
+    return convertStreamName(Strings.addRandomSuffix("_airbyte_tmp", "_", 3) + "_" + streamName);


why is this needed?

it was originally needed when I was producing both a raw table (json blob) and struct typed table by the same destination, so I needed two different "tmp" names. But then I moved away from that approach...

Should I revert to the old naming with timestamp?

cgardens · 2021-06-18T21:21:18Z

...te-integrations/bases/base-normalization/normalization/transform_catalog/stream_processor.py

@@ -389,9 +389,9 @@ def cast_property_type(self, property_name: str, column_name: str, jinja_column:
            print(f"WARN: Unknown type for column {property_name} at {self.current_json_path()}")
            return column_name
        elif is_array(definition["type"]):
-            return self.cast_property_type_as_array(property_name, column_name)


why is this changing? isn't this affecting the original BQ destination?

This is doing the exact same thing as before (minus the extra function call) so it's not changing anything.

The extra function call was a placeholder to implement the struct/repeated "casting" there but it's actually not doable so not useful anymore

cgardens · 2021-06-18T21:26:22Z

.../main/java/io/airbyte/integrations/destination/bigquery/BigQueryDenormalizedDestination.java

+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+public class BigQueryDenormalizedDestination extends BigQueryDestination {


Could you explain some more why we wouldn't want to make this a CopyDestination? I understand why this is split from BigQueryDestination--that totally makes sense to me. But it seems like this should be able to support CopyDestination and normal insert?

I am not saying it shouldn't be a CopyDestination, I just tried a similar approach to CopyDestination, and I faced issues that did not make it easy to pursue for this use case with two modes of writing for a destination. Thus, chose to make two destinations.

BigQueryDestination isn't currently implemented for CopyDestination, and it was not the goal of this PR either. But I guess we could indeed make both destination-bigquery and destination-bigquery-denormalized adopt a CopyDestination strategy too

marcosmarxm · 2021-06-20T20:02:47Z

@ChristopheDuong i'd bumped a new version of the BigQuery destination last week, please merge master into your branch.

ChristopheDuong · 2021-06-23T13:05:10Z

/publish connector=connectors/destination-bigquery

🕑 connectors/destination-bigquery https://github.com/airbytehq/airbyte/actions/runs/964401886
✅ connectors/destination-bigquery https://github.com/airbytehq/airbyte/actions/runs/964401886

ChristopheDuong · 2021-06-23T13:05:17Z

/publish connector=connectors/destination-bigquery-denormalized

🕑 connectors/destination-bigquery-denormalized https://github.com/airbytehq/airbyte/actions/runs/964402184
✅ connectors/destination-bigquery-denormalized https://github.com/airbytehq/airbyte/actions/runs/964402184

New BigQuery destination Structured

bfa8130

github-actions bot added the area/connectors Connector related issues label Jun 17, 2021

ChristopheDuong marked this pull request as draft June 17, 2021 12:57

ChristopheDuong added 4 commits June 17, 2021 15:00

bumpversion

fd60133

Clean up

c043d27

Renaming

0eb05c3

Array of is converted to Record of Array of

ae63e9e

ChristopheDuong added 3 commits June 17, 2021 15:45

Format

cbada8e

Generate credentials for new destination

2351a5c

Fix tests

e2047d5

Clean up placeholder code in normalization

e60e149

github-actions bot added the normalization label Jun 17, 2021

ChristopheDuong marked this pull request as ready for review June 17, 2021 15:48

ChristopheDuong requested review from subodh1810, tuliren and jrhizor June 17, 2021 15:48

tuliren reviewed Jun 18, 2021

View reviewed changes

cgardens approved these changes Jun 18, 2021

View reviewed changes

Changes from code review

eb60f93

github-actions bot added the area/documentation Improvements or additions to documentation label Jun 21, 2021

ChristopheDuong added 2 commits June 21, 2021 14:47

Merge remote-tracking branch 'origin/master' into bq-nested-struct

ab784d4

merging and updating from master

45a647b

subodh1810 approved these changes Jun 23, 2021

View reviewed changes

Merge remote-tracking branch 'origin/master' into bq-nested-struct

12cc8b0

Update docs

3b7c80a

ChristopheDuong merged commit 75a1dda into master Jun 23, 2021

ChristopheDuong deleted the chris/bigquery-struct-destination branch June 23, 2021 14:19

karinakuz added connectors/destination/bigquery connectors/destinations-api connectors/destinations-warehouse and removed connectors/destinations-api labels Jan 12, 2022

evantahler mentioned this pull request Jul 19, 2023

⚠️ Retire destination-bigquery-denormalized #28488

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🎉 New BigQuery destination with Structured/Repeated Records #4176

🎉 New BigQuery destination with Structured/Repeated Records #4176

ChristopheDuong commented Jun 17, 2021 •

edited

Loading

ChristopheDuong commented Jun 17, 2021 •

edited by github-actions bot

Loading

ChristopheDuong commented Jun 17, 2021 •

edited by github-actions bot

Loading

tuliren left a comment

tuliren Jun 18, 2021

tuliren Jun 18, 2021

ChristopheDuong Jun 21, 2021

tuliren Jun 18, 2021 •

edited

Loading

ChristopheDuong Jun 21, 2021

tuliren Jun 21, 2021

cgardens left a comment

cgardens Jun 18, 2021

cgardens Jun 18, 2021

ChristopheDuong Jun 21, 2021

cgardens Jun 18, 2021

ChristopheDuong Jun 21, 2021

cgardens Jun 18, 2021

ChristopheDuong Jun 21, 2021 •

edited

Loading

marcosmarxm commented Jun 20, 2021

ChristopheDuong commented Jun 23, 2021 •

edited by github-actions bot

Loading

ChristopheDuong commented Jun 23, 2021 •

edited by github-actions bot

Loading

	"name": "BigQuery de-normalized",
	"name": "BigQuery (Typed Struct)",

🎉 New BigQuery destination with Structured/Repeated Records #4176

🎉 New BigQuery destination with Structured/Repeated Records #4176

Conversation

ChristopheDuong commented Jun 17, 2021 • edited Loading

What

How

Implementation notes

Recommended reading order

Pre-merge Checklist

ChristopheDuong commented Jun 17, 2021 • edited by github-actions bot Loading

ChristopheDuong commented Jun 17, 2021 • edited by github-actions bot Loading

tuliren left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tuliren Jun 18, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cgardens left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ChristopheDuong Jun 21, 2021 • edited Loading

Choose a reason for hiding this comment

marcosmarxm commented Jun 20, 2021

ChristopheDuong commented Jun 23, 2021 • edited by github-actions bot Loading

ChristopheDuong commented Jun 23, 2021 • edited by github-actions bot Loading

ChristopheDuong commented Jun 17, 2021 •

edited

Loading

ChristopheDuong commented Jun 17, 2021 •

edited by github-actions bot

Loading

ChristopheDuong commented Jun 17, 2021 •

edited by github-actions bot

Loading

tuliren Jun 18, 2021 •

edited

Loading

ChristopheDuong Jun 21, 2021 •

edited

Loading

ChristopheDuong commented Jun 23, 2021 •

edited by github-actions bot

Loading

ChristopheDuong commented Jun 23, 2021 •

edited by github-actions bot

Loading