Destination bigquery: airbyte_meta/sync_id/generation_id #38359

edgao · 2024-05-20T18:47:19Z

create new raw tables with meta+gen ID (also partition on gen ID - this seems like a good idea? but could be convinced to skip it for now)
write meta+gen ID to raw tables in direct upload mode
- gcs upload mode is handled by instantiating StagingStreamOperations using V2_WITH_GENERATION (see BigqueryDestination)
pass generation ID through from raw to final table
concat airbyte_meta.changes from raw to final table
giant pile of test fixture updates, including a new test case to exercise the migration directly

vercel · 2024-05-20T18:47:21Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
airbyte-docs	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	May 29, 2024 6:32pm

edgao · 2024-05-20T18:47:32Z

Destination bigquery: airbyte_meta/sync_id/generation_id #38359 👈
Destinations CDK: generation_id/sync_id plumbing #38358
Destinations cdk: ThreadCreationInfo cast as nullable #38738 : 1 other dependent PR (#38658 )
master

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @edgao and the rest of your teammates on Graphite

gisripa · 2024-05-28T21:23:21Z

...ation-bigquery/src/main/java/io/airbyte/integrations/destination/bigquery/BigQueryUtils.java

@@ -178,26 +178,34 @@ public static Table createTable(final BigQuery bigquery, final String datasetNam
   */
  public static void createPartitionedTableIfNotExists(final BigQuery bigquery, final TableId tableId, final Schema schema) {
    try {
-      final var chunkingColumn = JavaBaseConstants.COLUMN_NAME_AB_EXTRACTED_AT;
-      final TimePartitioning partitioning = TimePartitioning.newBuilder(TimePartitioning.Type.DAY)
-          .setField(chunkingColumn)


Will there be any side effect in performance by losing the time partition on ab_extracted_at ?

thinking loud Should it be reverse as-in clustering by genId and partitioning by extracted_at

I could be convinced to leave this unchanged :P

if we ever want to do the delete ... where generation_id < ? thing, then partitioning on gen ID is how we support that cheaply

and my theory is that partitioning on extracted_at is redundant, since bigquery can just optimize within each partition anyway? but I'm not super familiar with this

...ain/java/io/airbyte/integrations/destination/bigquery/formatter/BigQueryRecordFormatter.java

gisripa · 2024-05-28T21:31:01Z

...e/integrations/destination/bigquery/migrators/BigqueryAirbyteMetaAndGenerationIdMigration.kt

+                .build()
+        newRawTable.update()
+
+        if (state.isFinalTablePresent) {


Interesting, so this is the first time we are doing a final table alter I believe right ?

afaik yeah. definitely a little weird. There's probably technically edge cases that it doesn't handle (e.g. if the raw table alter succeeds, then the final table alter fails - should we detect that in the next run?)

in redshift, we let it happen using softReset iirc. we are avoiding that here ?

did we need to change the final table at all in redshift? I thought that was purely a raw table change to add airbyte_meta, since the final table already had airbyte_meta?

yeah, we're explicitly setting softReset=false here

airbyte/airbyte-integrations/connectors/destination-redshift/src/main/java/io/airbyte/integrations/destination/redshift/typing_deduping/RedshiftRawTableAirbyteMetaMigration.kt

Lines 74 to 77 in d31f0dd

return Migration.MigrationResult(

state.destinationState.copy(needsSoftReset = false, isAirbyteMetaPresentInRaw = true),

false

)

oh you are right, final table had meta as part of v1v2 migration doing soft reset.

edgao mentioned this pull request May 20, 2024

Destinations CDK: generation_id/sync_id plumbing #38358

Merged

octavia-squidington-iii added area/connectors Connector related issues connectors/destination/bigquery labels May 20, 2024

edgao force-pushed the edgao/generation_id_in_cdk branch from 1290443 to 7b49601 Compare May 20, 2024 20:25

edgao force-pushed the edgao/bigquery_new_columns branch from 0a4365a to 86a47e9 Compare May 20, 2024 20:25

edgao force-pushed the edgao/generation_id_in_cdk branch from 7b49601 to 0651902 Compare May 20, 2024 22:26

edgao force-pushed the edgao/bigquery_new_columns branch from 86a47e9 to d414927 Compare May 20, 2024 22:26

edgao force-pushed the edgao/generation_id_in_cdk branch from 0651902 to 8e7b892 Compare May 21, 2024 16:56

edgao force-pushed the edgao/bigquery_new_columns branch 3 times, most recently from 39e3b36 to d16d022 Compare May 21, 2024 17:02

edgao force-pushed the edgao/generation_id_in_cdk branch from 8e7b892 to 5593756 Compare May 21, 2024 21:56

edgao force-pushed the edgao/bigquery_new_columns branch from d16d022 to 7da596a Compare May 21, 2024 21:56

edgao force-pushed the edgao/generation_id_in_cdk branch from 5593756 to fded86d Compare May 21, 2024 22:02

edgao force-pushed the edgao/bigquery_new_columns branch from 7da596a to 81cdb92 Compare May 21, 2024 22:02

edgao force-pushed the edgao/generation_id_in_cdk branch from fded86d to 7e6d5c5 Compare May 21, 2024 22:14

edgao force-pushed the edgao/bigquery_new_columns branch from 81cdb92 to 5201ed5 Compare May 21, 2024 22:14

edgao force-pushed the edgao/generation_id_in_cdk branch from 7e6d5c5 to cd70353 Compare May 21, 2024 22:14

edgao force-pushed the edgao/bigquery_new_columns branch from 5201ed5 to 370e430 Compare May 21, 2024 22:14

edgao mentioned this pull request May 21, 2024

Destination bigquery: Bump cdk again #38331

Merged

2 tasks

edgao force-pushed the edgao/generation_id_in_cdk branch from cd70353 to d473021 Compare May 21, 2024 22:20

edgao force-pushed the edgao/bigquery_new_columns branch from 370e430 to bf2f023 Compare May 21, 2024 22:20

edgao force-pushed the edgao/generation_id_in_cdk branch from d473021 to d07690d Compare May 22, 2024 15:40

edgao force-pushed the edgao/bigquery_new_columns branch from bf2f023 to 44c385c Compare May 22, 2024 15:40

edgao force-pushed the edgao/generation_id_in_cdk branch from d07690d to 289318e Compare May 22, 2024 15:47

edgao force-pushed the edgao/bigquery_new_columns branch from 44c385c to 0a49e69 Compare May 22, 2024 15:47

edgao force-pushed the edgao/generation_id_in_cdk branch from 289318e to e1c03fa Compare May 22, 2024 15:54

edgao force-pushed the edgao/bigquery_new_columns branch from 0a49e69 to b18faab Compare May 22, 2024 15:54

edgao force-pushed the edgao/generation_id_in_cdk branch from 6daa73d to f797a95 Compare May 28, 2024 18:43

edgao force-pushed the edgao/bigquery_new_columns branch 2 times, most recently from 396d1e7 to 4d006f2 Compare May 28, 2024 19:57

octavia-squidington-iii added the area/documentation Improvements or additions to documentation label May 28, 2024

vercel bot deployed to Preview May 28, 2024 20:05 View deployment

edgao force-pushed the edgao/generation_id_in_cdk branch from f797a95 to 942ef3a Compare May 28, 2024 21:18

edgao force-pushed the edgao/bigquery_new_columns branch from 4d006f2 to 2cbcbe6 Compare May 28, 2024 21:18

vercel bot deployed to Preview May 28, 2024 21:23 View deployment

gisripa reviewed May 28, 2024

View reviewed changes

...ain/java/io/airbyte/integrations/destination/bigquery/formatter/BigQueryRecordFormatter.java Show resolved Hide resolved

gisripa reviewed May 28, 2024

View reviewed changes

edgao force-pushed the edgao/generation_id_in_cdk branch from 942ef3a to e8fea2e Compare May 28, 2024 22:31

edgao force-pushed the edgao/bigquery_new_columns branch from 2cbcbe6 to b79a502 Compare May 28, 2024 22:31

vercel bot deployed to Preview May 28, 2024 22:35 View deployment

edgao force-pushed the edgao/generation_id_in_cdk branch from e8fea2e to 33d22aa Compare May 29, 2024 15:12

edgao force-pushed the edgao/bigquery_new_columns branch from b79a502 to 1b64d77 Compare May 29, 2024 15:12

vercel bot deployed to Preview May 29, 2024 15:17 View deployment

edgao force-pushed the edgao/generation_id_in_cdk branch from 33d22aa to 8d2d17b Compare May 29, 2024 15:18

edgao force-pushed the edgao/bigquery_new_columns branch from 1b64d77 to 4590d29 Compare May 29, 2024 15:18

vercel bot deployed to Preview May 29, 2024 15:22 View deployment

gisripa approved these changes May 29, 2024

View reviewed changes

edgao force-pushed the edgao/generation_id_in_cdk branch from 8d2d17b to 2936b14 Compare May 29, 2024 18:03

Base automatically changed from edgao/generation_id_in_cdk to master May 29, 2024 18:26

bigquery handles airbyte_meta/sync_id/generation_id

ae33cea

edgao force-pushed the edgao/bigquery_new_columns branch from 4590d29 to ae33cea Compare May 29, 2024 18:28

edgao enabled auto-merge (squash) May 29, 2024 18:29

vercel bot deployed to Preview May 29, 2024 18:32 View deployment

edgao merged commit fc205e4 into master May 29, 2024
30 checks passed

edgao deleted the edgao/bigquery_new_columns branch May 29, 2024 18:43

xiaohansong pushed a commit that referenced this pull request May 29, 2024

Destination bigquery: airbyte_meta/sync_id/generation_id (#38359)

7a55b9c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Destination bigquery: airbyte_meta/sync_id/generation_id #38359

Destination bigquery: airbyte_meta/sync_id/generation_id #38359

edgao commented May 20, 2024 •

edited

Loading

vercel bot commented May 20, 2024 •

edited

Loading

edgao commented May 20, 2024 •

edited by gisripa

Loading

gisripa May 28, 2024

gisripa May 28, 2024

edgao May 28, 2024

gisripa May 28, 2024

edgao May 28, 2024

gisripa May 28, 2024

edgao May 29, 2024

edgao May 29, 2024

gisripa May 29, 2024

	return Migration.MigrationResult(
	state.destinationState.copy(needsSoftReset = false, isAirbyteMetaPresentInRaw = true),
	false
	)

Destination bigquery: airbyte_meta/sync_id/generation_id #38359

Destination bigquery: airbyte_meta/sync_id/generation_id #38359

Conversation

edgao commented May 20, 2024 • edited Loading

vercel bot commented May 20, 2024 • edited Loading

edgao commented May 20, 2024 • edited by gisripa Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

edgao commented May 20, 2024 •

edited

Loading

vercel bot commented May 20, 2024 •

edited

Loading

edgao commented May 20, 2024 •

edited by gisripa

Loading