Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Destination bigquery: airbyte_meta/sync_id/generation_id #38359

Merged
merged 1 commit into from
May 29, 2024

Conversation

edgao
Copy link
Contributor

@edgao edgao commented May 20, 2024

  • create new raw tables with meta+gen ID (also partition on gen ID - this seems like a good idea? but could be convinced to skip it for now)
  • write meta+gen ID to raw tables in direct upload mode
    • gcs upload mode is handled by instantiating StagingStreamOperations using V2_WITH_GENERATION (see BigqueryDestination)
  • pass generation ID through from raw to final table
  • concat airbyte_meta.changes from raw to final table
  • giant pile of test fixture updates, including a new test case to exercise the migration directly

Copy link

vercel bot commented May 20, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
airbyte-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback May 29, 2024 6:32pm

Copy link
Contributor Author

edgao commented May 20, 2024

@edgao edgao force-pushed the edgao/generation_id_in_cdk branch from 1290443 to 7b49601 Compare May 20, 2024 20:25
@edgao edgao force-pushed the edgao/bigquery_new_columns branch from 0a4365a to 86a47e9 Compare May 20, 2024 20:25
@edgao edgao force-pushed the edgao/generation_id_in_cdk branch from 7b49601 to 0651902 Compare May 20, 2024 22:26
@edgao edgao force-pushed the edgao/bigquery_new_columns branch from 86a47e9 to d414927 Compare May 20, 2024 22:26
@edgao edgao force-pushed the edgao/generation_id_in_cdk branch from 0651902 to 8e7b892 Compare May 21, 2024 16:56
@edgao edgao force-pushed the edgao/bigquery_new_columns branch 3 times, most recently from 39e3b36 to d16d022 Compare May 21, 2024 17:02
@edgao edgao force-pushed the edgao/generation_id_in_cdk branch from 8e7b892 to 5593756 Compare May 21, 2024 21:56
@edgao edgao force-pushed the edgao/bigquery_new_columns branch from d16d022 to 7da596a Compare May 21, 2024 21:56
@edgao edgao force-pushed the edgao/generation_id_in_cdk branch from 5593756 to fded86d Compare May 21, 2024 22:02
@edgao edgao force-pushed the edgao/bigquery_new_columns branch from 7da596a to 81cdb92 Compare May 21, 2024 22:02
@edgao edgao force-pushed the edgao/generation_id_in_cdk branch from fded86d to 7e6d5c5 Compare May 21, 2024 22:14
@edgao edgao force-pushed the edgao/bigquery_new_columns branch from 81cdb92 to 5201ed5 Compare May 21, 2024 22:14
@edgao edgao force-pushed the edgao/generation_id_in_cdk branch from 7e6d5c5 to cd70353 Compare May 21, 2024 22:14
@edgao edgao force-pushed the edgao/bigquery_new_columns branch from 5201ed5 to 370e430 Compare May 21, 2024 22:14
@edgao edgao mentioned this pull request May 21, 2024
2 tasks
@edgao edgao force-pushed the edgao/generation_id_in_cdk branch from cd70353 to d473021 Compare May 21, 2024 22:20
@edgao edgao force-pushed the edgao/bigquery_new_columns branch from 370e430 to bf2f023 Compare May 21, 2024 22:20
@edgao edgao force-pushed the edgao/generation_id_in_cdk branch from d473021 to d07690d Compare May 22, 2024 15:40
@edgao edgao force-pushed the edgao/bigquery_new_columns branch from bf2f023 to 44c385c Compare May 22, 2024 15:40
@edgao edgao force-pushed the edgao/generation_id_in_cdk branch from d07690d to 289318e Compare May 22, 2024 15:47
@edgao edgao force-pushed the edgao/bigquery_new_columns branch from 44c385c to 0a49e69 Compare May 22, 2024 15:47
@edgao edgao force-pushed the edgao/generation_id_in_cdk branch from 289318e to e1c03fa Compare May 22, 2024 15:54
@edgao edgao force-pushed the edgao/bigquery_new_columns branch from 0a49e69 to b18faab Compare May 22, 2024 15:54
@edgao edgao force-pushed the edgao/generation_id_in_cdk branch from 6daa73d to f797a95 Compare May 28, 2024 18:43
@edgao edgao force-pushed the edgao/bigquery_new_columns branch 2 times, most recently from 396d1e7 to 4d006f2 Compare May 28, 2024 19:57
@octavia-squidington-iii octavia-squidington-iii added the area/documentation Improvements or additions to documentation label May 28, 2024
@edgao edgao force-pushed the edgao/generation_id_in_cdk branch from f797a95 to 942ef3a Compare May 28, 2024 21:18
@edgao edgao force-pushed the edgao/bigquery_new_columns branch from 4d006f2 to 2cbcbe6 Compare May 28, 2024 21:18
@@ -178,26 +178,34 @@ public static Table createTable(final BigQuery bigquery, final String datasetNam
*/
public static void createPartitionedTableIfNotExists(final BigQuery bigquery, final TableId tableId, final Schema schema) {
try {
final var chunkingColumn = JavaBaseConstants.COLUMN_NAME_AB_EXTRACTED_AT;
final TimePartitioning partitioning = TimePartitioning.newBuilder(TimePartitioning.Type.DAY)
.setField(chunkingColumn)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will there be any side effect in performance by losing the time partition on ab_extracted_at ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thinking loud Should it be reverse as-in clustering by genId and partitioning by extracted_at

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could be convinced to leave this unchanged :P

if we ever want to do the delete ... where generation_id < ? thing, then partitioning on gen ID is how we support that cheaply

and my theory is that partitioning on extracted_at is redundant, since bigquery can just optimize within each partition anyway? but I'm not super familiar with this

.build()
newRawTable.update()

if (state.isFinalTablePresent) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, so this is the first time we are doing a final table alter I believe right ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

afaik yeah. definitely a little weird. There's probably technically edge cases that it doesn't handle (e.g. if the raw table alter succeeds, then the final table alter fails - should we detect that in the next run?)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in redshift, we let it happen using softReset iirc. we are avoiding that here ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did we need to change the final table at all in redshift? I thought that was purely a raw table change to add airbyte_meta, since the final table already had airbyte_meta?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, we're explicitly setting softReset=false here

return Migration.MigrationResult(
state.destinationState.copy(needsSoftReset = false, isAirbyteMetaPresentInRaw = true),
false
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh you are right, final table had meta as part of v1v2 migration doing soft reset.

@edgao edgao force-pushed the edgao/generation_id_in_cdk branch from 8d2d17b to 2936b14 Compare May 29, 2024 18:03
Base automatically changed from edgao/generation_id_in_cdk to master May 29, 2024 18:26
@edgao edgao force-pushed the edgao/bigquery_new_columns branch from 4590d29 to ae33cea Compare May 29, 2024 18:28
@edgao edgao enabled auto-merge (squash) May 29, 2024 18:29
@edgao edgao merged commit fc205e4 into master May 29, 2024
30 checks passed
@edgao edgao deleted the edgao/bigquery_new_columns branch May 29, 2024 18:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues area/documentation Improvements or additions to documentation connectors/destination/bigquery
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants