-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Destination bigquery: airbyte_meta/sync_id/generation_id #38359
Conversation
edgao
commented
May 20, 2024
•
edited
Loading
edited
- create new raw tables with meta+gen ID (also partition on gen ID - this seems like a good idea? but could be convinced to skip it for now)
- write meta+gen ID to raw tables in direct upload mode
- gcs upload mode is handled by instantiating StagingStreamOperations using V2_WITH_GENERATION (see BigqueryDestination)
- pass generation ID through from raw to final table
- concat airbyte_meta.changes from raw to final table
- giant pile of test fixture updates, including a new test case to exercise the migration directly
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
This stack of pull requests is managed by Graphite. Learn more about stacking. |
1290443
to
7b49601
Compare
0a4365a
to
86a47e9
Compare
7b49601
to
0651902
Compare
86a47e9
to
d414927
Compare
0651902
to
8e7b892
Compare
39e3b36
to
d16d022
Compare
8e7b892
to
5593756
Compare
d16d022
to
7da596a
Compare
5593756
to
fded86d
Compare
7da596a
to
81cdb92
Compare
fded86d
to
7e6d5c5
Compare
81cdb92
to
5201ed5
Compare
7e6d5c5
to
cd70353
Compare
5201ed5
to
370e430
Compare
cd70353
to
d473021
Compare
370e430
to
bf2f023
Compare
d473021
to
d07690d
Compare
bf2f023
to
44c385c
Compare
d07690d
to
289318e
Compare
44c385c
to
0a49e69
Compare
289318e
to
e1c03fa
Compare
0a49e69
to
b18faab
Compare
6daa73d
to
f797a95
Compare
396d1e7
to
4d006f2
Compare
f797a95
to
942ef3a
Compare
4d006f2
to
2cbcbe6
Compare
@@ -178,26 +178,34 @@ public static Table createTable(final BigQuery bigquery, final String datasetNam | |||
*/ | |||
public static void createPartitionedTableIfNotExists(final BigQuery bigquery, final TableId tableId, final Schema schema) { | |||
try { | |||
final var chunkingColumn = JavaBaseConstants.COLUMN_NAME_AB_EXTRACTED_AT; | |||
final TimePartitioning partitioning = TimePartitioning.newBuilder(TimePartitioning.Type.DAY) | |||
.setField(chunkingColumn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will there be any side effect in performance by losing the time partition on ab_extracted_at
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thinking loud Should it be reverse as-in clustering by genId and partitioning by extracted_at
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could be convinced to leave this unchanged :P
if we ever want to do the delete ... where generation_id < ?
thing, then partitioning on gen ID is how we support that cheaply
and my theory is that partitioning on extracted_at is redundant, since bigquery can just optimize within each partition anyway? but I'm not super familiar with this
...ain/java/io/airbyte/integrations/destination/bigquery/formatter/BigQueryRecordFormatter.java
Show resolved
Hide resolved
.build() | ||
newRawTable.update() | ||
|
||
if (state.isFinalTablePresent) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, so this is the first time we are doing a final table alter I believe right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
afaik yeah. definitely a little weird. There's probably technically edge cases that it doesn't handle (e.g. if the raw table alter succeeds, then the final table alter fails - should we detect that in the next run?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in redshift, we let it happen using softReset iirc. we are avoiding that here ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did we need to change the final table at all in redshift? I thought that was purely a raw table change to add airbyte_meta, since the final table already had airbyte_meta?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, we're explicitly setting softReset=false here
Lines 74 to 77 in d31f0dd
return Migration.MigrationResult( | |
state.destinationState.copy(needsSoftReset = false, isAirbyteMetaPresentInRaw = true), | |
false | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh you are right, final table had meta as part of v1v2 migration doing soft reset.
942ef3a
to
e8fea2e
Compare
2cbcbe6
to
b79a502
Compare
e8fea2e
to
33d22aa
Compare
b79a502
to
1b64d77
Compare
33d22aa
to
8d2d17b
Compare
1b64d77
to
4590d29
Compare
8d2d17b
to
2936b14
Compare
4590d29
to
ae33cea
Compare