-
Notifications
You must be signed in to change notification settings - Fork 3k
BigQuery: Eliminate redundant table load by using ETag for conflict detection #14940
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BigQuery: Eliminate redundant table load by using ETag for conflict detection #14940
Conversation
… calls Cache the Table object loaded in doRefresh() for reuse in updateTable(), eliminating a redundant tables.get call per commit. Concurrent modification detection is preserved via ETag based optimistic locking in tables.patch.
| try { | ||
| metadataLocation = | ||
| loadMetadataLocationOrThrow(client.load(tableReference).getExternalCatalogTableOptions()); | ||
| Table table = client.load(tableReference); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this local variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this local variable?
Thank you for your review Manu. I used the local variable for readability, but happy to inline if you think it's a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that it's unnecessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
| ExternalCatalogTableOptions options = table.getExternalCatalogTableOptions(); | ||
| addConnectionIfProvided(table, metadata.properties()); | ||
|
|
||
| // If `metadataLocationFromMetastore` is different from metadata location of base, it means |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this check removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this check removed?
Thank you for your review Manu.
This check becomes redundant with caching.
Before:
- doRefresh() loads table -> metadata location = "v1"
- Someone else commits -> metadata location = "v2"
- updateTable() loads table again -> sees "v2"
- Check catches: "v1" != "v2" -> fail
With caching:
- doRefresh() loads table -> metadata location = "v1", cached
- Someone else commits -> metadata location = "v2"
- updateTable() uses cached table -> still sees "v1"
- Check passes: "v1" == "v1" (compares against itself)
- tables.patch fails with HTTP 412 (ETag mismatch) -> Iceberg retries
The ETag check in tables.patch catches the same conflict, so this check no longer adds value.
|
|
||
| @Test | ||
| public void failWhenMetadataLocationDiff() throws Exception { | ||
| public void failWhenConcurrentModificationDetected() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you verify table is only loaded once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the review Manu. Sorry about that, I have added verification to confirm table is loaded only once in this commit.
Verify table is loaded only once in test
|
Hello @talatuyarer, @rambleraptor, could I please request you to review this PR when you have some time? |
rambleraptor
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concerns are mostly around making sure that concurrent changes are respected. I agree we can use the ETag for this purpose, so this sounds good to me! Thanks for writing this
kevinjqliu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i dont think we should couple table refresh with table update. Maybe i'm missing nuisance of this implementation. @talatuyarer wdyt?
| Table table = this.refreshedTable; | ||
| if (table == null) { | ||
| LOG.warn("Table not set from doRefresh() for {}, loading from BigQuery", tableName()); | ||
| table = client.load(tableReference); | ||
| } | ||
|
|
||
| this.refreshedTable = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this feels like an anti-pattern to me, and should not live in updateTable. i think we should separate concerns for table refresh and update.
Looking at JdbcTableOperations, updateTable should just be an atomic operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would agree with @kevinjqliu's comment here. I think it's safe to assume that the member reference is not null (or we can add a check, but I don't see the scenario where we would reload). We also don't need the local variable in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed, removed the defensive reload and replaced with Preconditions.checkState(). Let me know if you'd like to skip the check entirely.
danielcweeks
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments, but I think we either want to just track the metastoreTable explicitly or just use the ETag which appears to be equivalent in the current usage.
| private final TableReference tableReference; | ||
|
|
||
| /** Table loaded in doRefresh() for reuse in updateTable() to avoid redundant API call. */ | ||
| private volatile Table refreshedTable; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the full table here? It looks like what we're doing is replacing the location check with an ETag check, which is fine, but then we just need the ETag, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, if we want to preserve the table, I think we should change the name to metastoreTable since it technically just refers to the metastore's representation of the table and even though the method is called refresh, it's also used just for the initial load.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, renamed to metastoreTable.
We need the full Table in my understanding because updateTable() calls getExternalCatalogTableOptions(), addConnectionIfProvided(), and passes it to client.update().
| try { | ||
| metadataLocation = | ||
| loadMetadataLocationOrThrow(client.load(tableReference).getExternalCatalogTableOptions()); | ||
| Table table = client.load(tableReference); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that it's unnecessary.
| String oldMetadataLocation, String newMetadataLocation, TableMetadata metadata) { | ||
| Table table = client.load(tableReference); | ||
| private void updateTable(String newMetadataLocation, TableMetadata metadata) { | ||
| Table table = this.refreshedTable; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we only use the this. member reference on assignment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
- Rename refreshedTable to metastoreTable - Remove unnecessary local variables in doRefresh() and updateTable() - Replace defensive reload with Preconditions.checkState() - Use this. prefix only on assignment
danielcweeks
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @joyhaldar !
The current commit path loads the BigQuery table twice:
ETagfor the update callThis change stores the table from the refresh step and reuses it during commit, eliminating the redundant load. Concurrent modification detection remains intact via ETag-based optimistic locking in the BigQuery API.
BigQuery API calls per commit:
doRefresh→ loads tabledoRefresh→ loads tableupdateTable→ loads table againThis improves commit latency and reduces tables.get quota consumption.
Changes:
metastoreTablefor reuse during commitPreconditions.checkState()to ensure table is loaded before commitETagcheck