Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: Remove deprecated method from BaseMetadataTable #9298

Merged
merged 2 commits into from
Jan 18, 2024

Conversation

ajantha-bhat
Copy link
Member

No description provided.

@@ -105,6 +105,8 @@ private String metadataFileLocation(Table table) {
if (table instanceof HasTableOperations) {
TableOperations ops = ((HasTableOperations) table).operations();
return ops.current().metadataFileLocation();
} else if (table instanceof BaseMetadataTable) {
return ((BaseMetadataTable) table).table().operations().current().metadataFileLocation();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is needed now since the metadata table won't enter above check of HasTableOperations

@ajantha-bhat ajantha-bhat marked this pull request as draft December 14, 2023 14:25
@ajantha-bhat
Copy link
Member Author

Looks like some tests are directly casting metadata tables with HasTableOperations.
So, some more work is needed for this PR. Let me work on it and ping when it is ready.

@github-actions github-actions bot added the spark label Dec 15, 2023
@ajantha-bhat ajantha-bhat force-pushed the dep_ops branch 2 times, most recently from 5d3825c to 005ecf8 Compare December 15, 2023 12:22
@ajantha-bhat ajantha-bhat marked this pull request as ready for review December 15, 2023 13:05
@ajantha-bhat
Copy link
Member Author

@nastra, @Fokko: PR is ready for review.

@@ -948,6 +950,17 @@ public static org.apache.spark.sql.catalyst.TableIdentifier toV1TableIdentifier(
return org.apache.spark.sql.catalyst.TableIdentifier.apply(table, database);
}

static String tableUUID(org.apache.iceberg.Table table) {
if (table instanceof HasTableOperations) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just call table.uuid() in all of those places?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because BaseMetadataTable gives new UUID instead of base table's UUID.
Shall I fix that method to return base table's UUID ?

public UUID uuid() {
return UUID.randomUUID();
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think while adding UUID interface we concluded that we should not use base table's UUID
#8800 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the argument there is that the metadata table can be considered as a separate table and should therefore have it's own unique identifier compared to the base table.

But I think @nastra point still stands, even if it's different then the base table UUID, why does that matter here? I think we just want the table.uuid() right? or do we need the metadata table's underlying table's UUID?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried using table.uuid(), many testcase failed as the scan task of metadata table expects UUID of the base table not the metadata table.

java.lang.IllegalArgumentException: No scan tasks found for 2c44000a-aa24-479a-8666-292cee70b95f

Does it make sense to return base table's UUID for the metadata table? (That is change the behaviour from #8800?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. Rebased.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, looks like SparkStagedScan is expecting base table's uuid for cache for metadata tables.

Either we need to change that logic or return base table uuid. I will dig deeper next week.

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar Dec 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'd check out that logic further and we can see what the right behavior is here. I still think the change that was made in #9310 is definitely the right fix from an API perspective (even if we decide not to use that API here). The main issue that was solved there was semantically the metadata table UUID should be the same for the same reference.

In other words, imo I would not change the UUID API semantics to fit whatever the caching logic relies on.

If we need the base table UUID for the caching logic, then maybe MetadataTable specifically can expose another API for exposing the underlying base table's UUID. Or alternatively keep it as is, and just expose the underlying Table (but that seems to expose too much imo).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there is a tight correlation between metadata table scan tasks and main table UUID from multiple classes.
If we need to change it, it can be handled in a separate PR (Issue) as it is nothing to do with this deprecated method removal.

Hence, I went back to reverting using table.uuid()

So, this PR can go ahead.
cc: @nastra, @amogh-jahagirdar

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nastra: Thoughts?

@ajantha-bhat ajantha-bhat marked this pull request as draft December 16, 2023 12:25
@ajantha-bhat
Copy link
Member Author

Just rebased to resolve conflict.

@ajantha-bhat ajantha-bhat added this to the Iceberg 1.5.0 milestone Jan 3, 2024
@amogh-jahagirdar
Copy link
Contributor

Sorry for the delay in review on this @ajantha-bhat , I'll take a look at this tomorrow.

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it looks close @ajantha-bhat just a colmment on the tableUUID implementation returning null when we don't know what kind of table it is.

I think it's good to preserve the metadataTable UUID behavior and not return the base table. The rationale is that a UUID for a table should be unique per table and metadata tables are no different in this regard.

I think the way that it's implemented in this PR is fine.

} else if (table instanceof BaseMetadataTable) {
return ((BaseMetadataTable) table).table().operations().current().uuid();
} else {
return null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should probably throw an exception instead of returning null.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also instead of tableUUID maybe baseTableUUID sine that's what we're really getting.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -72,18 +70,12 @@ public void clearRewrite(Table table, String fileSetId) {

public Set<String> fetchSetIds(Table table) {
return resultMap.keySet().stream()
.filter(e -> e.first().equals(tableUUID(table)))
.filter(e -> e.first().equals(Spark3Util.tableUUID(table)))
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar Jan 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think it would be a bit cleaner just to import Spark3Util.tableUUID and then just use tableUUID here (the diff on line 73 and other places would essentially go away in favor of just a new import statement). But that's nbd, if we do the method rename like I suggested we lose this benefit anyways.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I renamed to baseTableUUID and I am not sure about the guidelines on static import. Some place we use it and some place we don't. So, I left it as it is.

} else if (table instanceof BaseMetadataTable) {
return ((BaseMetadataTable) table).table().operations().current().uuid();
} else {
throw new UnsupportedOperationException("Cannot fetch table operations for " + table.name());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be replicated across all Spark versions? Also I would probably update the error msg to Cannot retrieve UUID for table ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@ajantha-bhat
Copy link
Member Author

Retriggering the build due to flaky test in Flink.

@ajantha-bhat ajantha-bhat reopened this Jan 7, 2024
TableOperations ops = ((HasTableOperations) table).operations();
return ops.current().uuid();
} else if (table instanceof BaseMetadataTable) {
return ((BaseMetadataTable) table).table().operations().current().uuid();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

call table on table looks strange. It would be better to have a method baseTable().

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function can be called for main table or metadata table. So, the varibale name is table.

Agree that BaseMetadataTable can have a public interface as baseTable(). But current interface table() is a public interface, we can't rename directly. It has to be deprecated first and new interface.

So, I think we can leave it as it is as of now (out of scope for this PR).

@ajantha-bhat
Copy link
Member Author

PR is ready.

@ajantha-bhat
Copy link
Member Author

ping.
Anything else needed for this PR?

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me now @ajantha-bhat thanks for the follow up. I'll wait for a bit in case others have any comments before merging.

@amogh-jahagirdar amogh-jahagirdar merged commit 6e7702d into apache:main Jan 18, 2024
42 checks passed
geruh pushed a commit to geruh/iceberg that referenced this pull request Jan 26, 2024
adnanhemani pushed a commit to adnanhemani/iceberg that referenced this pull request Jan 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants