Skip to content

Conversation

@joyhaldar
Copy link
Contributor

@joyhaldar joyhaldar commented Dec 28, 2025

The current commit path loads the BigQuery table twice:

  1. During table refresh to get metadata location
  2. During commit to get ETag for the update call

This change stores the table from the refresh step and reuses it during commit, eliminating the redundant load. Concurrent modification detection remains intact via ETag-based optimistic locking in the BigQuery API.

BigQuery API calls per commit:

Before After
doRefresh → loads table doRefresh → loads table
updateTable → loads table again reuses table from refresh

This improves commit latency and reduces tables.get quota consumption.

Changes:

  • Store table loaded during refresh as metastoreTable for reuse during commit
  • Use Preconditions.checkState() to ensure table is loaded before commit
  • Remove metadata location comparison which is redundant with ETag check
  • Update test to verify ETag-based conflict detection

… calls

Cache the Table object loaded in doRefresh() for reuse in updateTable(),
eliminating a redundant tables.get call per commit. Concurrent modification
detection is preserved via ETag based optimistic locking in tables.patch.
@github-actions github-actions bot added the GCP label Dec 28, 2025
try {
metadataLocation =
loadMetadataLocationOrThrow(client.load(tableReference).getExternalCatalogTableOptions());
Table table = client.load(tableReference);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this local variable?

Copy link
Contributor Author

@joyhaldar joyhaldar Dec 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this local variable?

Thank you for your review Manu. I used the local variable for readability, but happy to inline if you think it's a good idea.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that it's unnecessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

ExternalCatalogTableOptions options = table.getExternalCatalogTableOptions();
addConnectionIfProvided(table, metadata.properties());

// If `metadataLocationFromMetastore` is different from metadata location of base, it means
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this check removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this check removed?

Thank you for your review Manu.

This check becomes redundant with caching.

Before:

  1. doRefresh() loads table -> metadata location = "v1"
  2. Someone else commits -> metadata location = "v2"
  3. updateTable() loads table again -> sees "v2"
  4. Check catches: "v1" != "v2" -> fail

With caching:

  1. doRefresh() loads table -> metadata location = "v1", cached
  2. Someone else commits -> metadata location = "v2"
  3. updateTable() uses cached table -> still sees "v1"
  4. Check passes: "v1" == "v1" (compares against itself)
  5. tables.patch fails with HTTP 412 (ETag mismatch) -> Iceberg retries

The ETag check in tables.patch catches the same conflict, so this check no longer adds value.


@Test
public void failWhenMetadataLocationDiff() throws Exception {
public void failWhenConcurrentModificationDetected() throws Exception {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you verify table is only loaded once?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the review Manu. Sorry about that, I have added verification to confirm table is loaded only once in this commit.

Verify table is loaded only once in test
@joyhaldar
Copy link
Contributor Author

Hello @talatuyarer, @rambleraptor, could I please request you to review this PR when you have some time?

Copy link
Contributor

@rambleraptor rambleraptor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concerns are mostly around making sure that concurrent changes are respected. I agree we can use the ETag for this purpose, so this sounds good to me! Thanks for writing this

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dont think we should couple table refresh with table update. Maybe i'm missing nuisance of this implementation. @talatuyarer wdyt?

Comment on lines 159 to 165
Table table = this.refreshedTable;
if (table == null) {
LOG.warn("Table not set from doRefresh() for {}, loading from BigQuery", tableName());
table = client.load(tableReference);
}

this.refreshedTable = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this feels like an anti-pattern to me, and should not live in updateTable. i think we should separate concerns for table refresh and update.
Looking at JdbcTableOperations, updateTable should just be an atomic operation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would agree with @kevinjqliu's comment here. I think it's safe to assume that the member reference is not null (or we can add a check, but I don't see the scenario where we would reload). We also don't need the local variable in this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed, removed the defensive reload and replaced with Preconditions.checkState(). Let me know if you'd like to skip the check entirely.

Copy link
Contributor

@danielcweeks danielcweeks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments, but I think we either want to just track the metastoreTable explicitly or just use the ETag which appears to be equivalent in the current usage.

private final TableReference tableReference;

/** Table loaded in doRefresh() for reuse in updateTable() to avoid redundant API call. */
private volatile Table refreshedTable;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the full table here? It looks like what we're doing is replacing the location check with an ETag check, which is fine, but then we just need the ETag, correct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if we want to preserve the table, I think we should change the name to metastoreTable since it technically just refers to the metastore's representation of the table and even though the method is called refresh, it's also used just for the initial load.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, renamed to metastoreTable.

We need the full Table in my understanding because updateTable() calls getExternalCatalogTableOptions(), addConnectionIfProvided(), and passes it to client.update().

try {
metadataLocation =
loadMetadataLocationOrThrow(client.load(tableReference).getExternalCatalogTableOptions());
Table table = client.load(tableReference);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that it's unnecessary.

String oldMetadataLocation, String newMetadataLocation, TableMetadata metadata) {
Table table = client.load(tableReference);
private void updateTable(String newMetadataLocation, TableMetadata metadata) {
Table table = this.refreshedTable;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we only use the this. member reference on assignment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

- Rename refreshedTable to metastoreTable
- Remove unnecessary local variables in doRefresh() and updateTable()
- Replace defensive reload with Preconditions.checkState()
- Use this. prefix only on assignment
@joyhaldar joyhaldar changed the title BigQuery: Reuse table from refresh during commit to reduce API calls BigQuery: Eliminate redundant table load by using ETag for conflict detection Jan 9, 2026
Copy link
Contributor

@danielcweeks danielcweeks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @joyhaldar !

@danielcweeks danielcweeks merged commit 38cc881 into apache:main Jan 15, 2026
46 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants