Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove support for catalog_name in table identifier string #963

Merged
merged 6 commits into from
Dec 19, 2024

Conversation

sungwy
Copy link
Collaborator

@sungwy sungwy commented Jul 25, 2024

Currently, we optionally support catalog names being specified in the identifier string. In other words, this means that we currently support identifier names that look like catalog_name.table_namespace.table_name or table_namespace.table_name being passed into catalog.load_table() method.

This PR proposes to remove this optional support, and change the Table class's identifier to be (table_namespace, table_name) instead of (catalog_name, table_namespace, table_name). The reasons are as follows:

  1. The Identifier is specified as namespace + table_name according to the REST Catalog Open API Spec
  2. Table has the attribute catalog - a Table is always instantiated with a Catalog instance. Hence the catalog name is accessible as: tbl.catalog.name, and I feel that prepending the catalog name to the identifier is redundant. If there is value in keeping a long identifier including the catalog name, it may be better to leave the identifier field as namespace + name pair, and update the name attribute to be self.catalog.name + self.identifier instead.
  3. identifier_to_tuple_without_catalog is called in all catalog implementations, and is currently crucial in allowing Table to use the long identifier including the catalog_name. However, this function is imperfect, and will result in errors if the the table_namespace uses a hierarchical name that is repeats the catalog name. For example, namespace: default.lower and catalog_name: default. When we pass default.lower.table_name to refer to the table table_name in namespace default.lower on the catalog default, we will run into an unexpected exception as it will remove the first hierarchy of the namespace, and look for lower.table_name.
  4. Fixes: Rest Catalog: catalog.name should not be part of namespace #742

Thankfully, we have not explicitly documented that we support catalog_name in the identifier string in our API documentation, and other engine partners do not seem to be relying on this feature as well (checking one example of Daft)

It would be great to get this change in for 0.7.0 release as certain rest catalog implementations are having issues with the current approach: lakekeeper/lakekeeper#18

However this is backward incompatible change that takes away an optional feature, so I would like to put this PR up as a first step in facilitating the discussion.

@sungwy sungwy requested review from Fokko, kevinjqliu and HonahX July 25, 2024 01:51
@sungwy
Copy link
Collaborator Author

sungwy commented Jul 25, 2024

There is some more work to be done on this PR, including:

  • deprecating identifier_to_tuple_without_catalog public method on the catalog
  • removing its usage across the board, like in drop_table, etc.

I'd like to seek some feedback from the community before doing the work 🙂

@sungwy sungwy added this to the PyIceberg 0.7.0 release milestone Jul 25, 2024
@Fokko
Copy link
Contributor

Fokko commented Jul 25, 2024

Hey @sungwy thanks for raising this. Some historical context was added from the beginning to copy some of the concepts of how it works in Java/Spark, but to be frank, I don't think it works in PyIceberg since it (not yet) is SQL-based like Spark/Trino/etc:

INSERT INTO dev.schema.table SELECT * FROM prod.schema.table

The Identifier is specified as namespace + table_name according to the REST Catalog Open API Spec

The name is only prepended in PyIceberg and should not be sent to the REST api.

I would be open to removing this, but I'm concerned that it might break existing users. Does anyone have any ideas on how to mitigate this? We could put it behind a flag and make it the default behavior on the next version? But it is already rather messy.

@HonahX
Copy link
Contributor

HonahX commented Jul 25, 2024

Thanks for raising the PR! I am also +1 on removing this. This seems unnecessary and will inevitably lead to the issue as @sungwy mentioned in the third point:

identifier_to_tuple_without_catalog is called in all catalog implementations, and is currently crucial in allowing Table to use the long identifier including the catalog_name. However, this function is imperfect, and will result in errors if the the table_namespace uses a hierarchical...

if we want to support identifiers with/without catalog's name.

However, removing this means we will break the compatibility.

We could put it behind a flag and make it the default behavior on the next version? But it is already rather messy.

+1 A flag might help, but I also worry that it may still be disruptive if users are unaware of the flag and the decision to remove the feature.

How about sending an email to the dev-list for further discussion and feedback on this change? This could also help us understand the potential impact of this removal on users.

That said, I am inclined to postpone the actual removal to version 0.8.0. There are many updates and excellent new features in version 0.7.0, and I do not want this change to discourage people from upgrading.

I understand that it would be beneficial for version 0.7.0 to address #964, so I have also attempted #964 without removing the catalog name from the identifier. I hope this can further mitigate the issue and unblock version 0.7.0.

Would love to hear your thoughts on this!

@sungwy
Copy link
Collaborator Author

sungwy commented Jul 25, 2024

Hi @HonahX and @Fokko thank you for the detailed reviews! Given those concerns, I think it would make sense to:

  1. Introduce the fix @HonahX proposed to address Rest Catalog: catalog.name should not be part of namespace #742 without introducing a backward incompatible change
  2. Propose this change to the dev list after this release (0.7.0)

In the long run, I am in favor of this change as I think it makes logical sense and is less error prone. And as @Fokko mentioned above, the current approach is already messy, and removing this optional support would clean up the intersection of our catalog / table.

So I love the idea of aiming tactical mitigation in the short term, and advocating for a strategic fix soon after

@sungwy sungwy removed this from the PyIceberg 0.7.0 release milestone Jul 25, 2024
@kevinjqliu
Copy link
Contributor

+1 to remove the catalog name as part of the table identifier. #964 is a good workaround to get into the 0.7.0 release.

This PR is great as a long-term plan to deprecate the catalog name entirely.

@sungwy
Copy link
Collaborator Author

sungwy commented Aug 3, 2024

I opened up the mail list discussion: https://lists.apache.org/thread/9zr19hxnbt3hg7lt55t6dfg6otv7zjz2

Unfortunately, there hasn't been a lot of community engagement on the mail list thread.
One user responded on the slack channel to express support for the change.

Based on the feedback on this PR, I've opened up #994 to mark this usage pattern with a deprecation warning in 0.8.0. That'll give some time for users to test and prepare for the feature to be removed in 0.9.0.

@kevinjqliu
Copy link
Contributor

So #994 will be included in the 0.8.0 release, which will still allow catalog name but with a deprecation warning.
And this PR will be included in the 0.9.0 release and will fully remove the catalog name

@Fokko
Copy link
Contributor

Fokko commented Aug 28, 2024

While working on #1112, I fully support this.

@sungwy
Copy link
Collaborator Author

sungwy commented Dec 7, 2024

Now that the deprecation is out for 0.8.0 - I'll get working on this large PR again

@sungwy sungwy marked this pull request as ready for review December 7, 2024 04:57
@sungwy
Copy link
Collaborator Author

sungwy commented Dec 10, 2024

Hi @Fokko @kevinjqliu - could I ask for your review on this PR? I think we'll be able to introduce this in 0.9.0 now that the deprecation notice has been sent out :)

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for going through removing this.

I looked up the other removed_in="0.9.0" and found these two related catalog identifier functions that we'd want to remove

Neither are used in the codebase

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the print, it looks great! Thanks for driving this @sungwy

@Fokko
Copy link
Contributor

Fokko commented Dec 18, 2024

And there is something very off with the commits, we want to fix that as well, otherwise we'll get a commit with all the authors combined I'm afraid 😅

@sungwy
Copy link
Collaborator Author

sungwy commented Dec 18, 2024

@Fokko and @kevinjqliu - Thank you both! I'll take a look at fixing the commit history later today

@sungwy
Copy link
Collaborator Author

sungwy commented Dec 19, 2024

  • identifier

Thanks for the sharp eye @kevinjqliu - I've removed these as well in the newest commit

@sungwy
Copy link
Collaborator Author

sungwy commented Dec 19, 2024

And there is something very off with the commits, we want to fix that as well, otherwise we'll get a commit with all the authors combined I'm afraid 😅

Yeah I've tried rebasing it a few times, and didn't fare well since its been a long time since this PR had been up. Had to break out from current main and force push instead to get the commit history issue resolved 🚀

sungwy and others added 3 commits December 19, 2024 07:32
Co-authored-by: Fokko Driesprong <fokko@apache.org>
Co-authored-by: Fokko Driesprong <fokko@apache.org>
Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I think we got everything

@sungwy sungwy merged commit 1278e88 into apache:main Dec 19, 2024
7 checks passed
@sungwy sungwy deleted the rest-issue branch December 19, 2024 16:01
@sungwy
Copy link
Collaborator Author

sungwy commented Dec 19, 2024

@kevinjqliu and @Fokko - thank you for your reviews!

sungwy added a commit to sungwy/iceberg-python that referenced this pull request Dec 24, 2024
* remove catalog identifier

* remove deprecated identifier method

* lint

* nit

Co-authored-by: Fokko Driesprong <fokko@apache.org>

* nit

Co-authored-by: Fokko Driesprong <fokko@apache.org>

* Revert "nit"

---------

Co-authored-by: Fokko Driesprong <fokko@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rest Catalog: catalog.name should not be part of namespace
4 participants