-
Notifications
You must be signed in to change notification settings - Fork 229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove support for catalog_name in table identifier string #963
Conversation
There is some more work to be done on this PR, including:
I'd like to seek some feedback from the community before doing the work 🙂 |
Hey @sungwy thanks for raising this. Some historical context was added from the beginning to copy some of the concepts of how it works in Java/Spark, but to be frank, I don't think it works in PyIceberg since it (not yet) is SQL-based like Spark/Trino/etc: INSERT INTO dev.schema.table SELECT * FROM prod.schema.table
The name is only prepended in PyIceberg and should not be sent to the REST api. I would be open to removing this, but I'm concerned that it might break existing users. Does anyone have any ideas on how to mitigate this? We could put it behind a flag and make it the default behavior on the next version? But it is already rather messy. |
Thanks for raising the PR! I am also +1 on removing this. This seems unnecessary and will inevitably lead to the issue as @sungwy mentioned in the third point:
if we want to support identifiers with/without catalog's name. However, removing this means we will break the compatibility.
+1 A flag might help, but I also worry that it may still be disruptive if users are unaware of the flag and the decision to remove the feature. How about sending an email to the dev-list for further discussion and feedback on this change? This could also help us understand the potential impact of this removal on users. That said, I am inclined to postpone the actual removal to version 0.8.0. There are many updates and excellent new features in version 0.7.0, and I do not want this change to discourage people from upgrading. I understand that it would be beneficial for version 0.7.0 to address #964, so I have also attempted #964 without removing the catalog name from the identifier. I hope this can further mitigate the issue and unblock version 0.7.0. Would love to hear your thoughts on this! |
Hi @HonahX and @Fokko thank you for the detailed reviews! Given those concerns, I think it would make sense to:
In the long run, I am in favor of this change as I think it makes logical sense and is less error prone. And as @Fokko mentioned above, the current approach is already messy, and removing this optional support would clean up the intersection of our catalog / table. So I love the idea of aiming tactical mitigation in the short term, and advocating for a strategic fix soon after |
+1 to remove the catalog name as part of the table identifier. #964 is a good workaround to get into the 0.7.0 release. This PR is great as a long-term plan to deprecate the catalog name entirely. |
I opened up the mail list discussion: https://lists.apache.org/thread/9zr19hxnbt3hg7lt55t6dfg6otv7zjz2 Unfortunately, there hasn't been a lot of community engagement on the mail list thread. Based on the feedback on this PR, I've opened up #994 to mark this usage pattern with a deprecation warning in 0.8.0. That'll give some time for users to test and prepare for the feature to be removed in 0.9.0. |
So #994 will be included in the 0.8.0 release, which will still allow catalog name but with a deprecation warning. |
While working on #1112, I fully support this. |
Now that the deprecation is out for 0.8.0 - I'll get working on this large PR again |
Hi @Fokko @kevinjqliu - could I ask for your review on this PR? I think we'll be able to introduce this in 0.9.0 now that the deprecation notice has been sent out :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for going through removing this.
I looked up the other removed_in="0.9.0"
and found these two related catalog identifier functions that we'd want to remove
Neither are used in the codebase
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from the print, it looks great! Thanks for driving this @sungwy
And there is something very off with the commits, we want to fix that as well, otherwise we'll get a commit with all the authors combined I'm afraid 😅 |
@Fokko and @kevinjqliu - Thank you both! I'll take a look at fixing the commit history later today |
Thanks for the sharp eye @kevinjqliu - I've removed these as well in the newest commit |
Yeah I've tried rebasing it a few times, and didn't fare well since its been a long time since this PR had been up. Had to break out from current main and force push instead to get the commit history issue resolved 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I think we got everything
@kevinjqliu and @Fokko - thank you for your reviews! |
* remove catalog identifier * remove deprecated identifier method * lint * nit Co-authored-by: Fokko Driesprong <fokko@apache.org> * nit Co-authored-by: Fokko Driesprong <fokko@apache.org> * Revert "nit" --------- Co-authored-by: Fokko Driesprong <fokko@apache.org>
Currently, we optionally support catalog names being specified in the identifier string. In other words, this means that we currently support identifier names that look like
catalog_name.table_namespace.table_name
ortable_namespace.table_name
being passed intocatalog.load_table()
method.This PR proposes to remove this optional support, and change the
Table
class's identifier to be(table_namespace, table_name)
instead of(catalog_name, table_namespace, table_name)
. The reasons are as follows:Table
has the attributecatalog
- a Table is always instantiated with a Catalog instance. Hence the catalog name is accessible as:tbl.catalog.name
, and I feel that prepending the catalog name to the identifier is redundant. If there is value in keeping a long identifier including the catalog name, it may be better to leave theidentifier
field as namespace + name pair, and update thename
attribute to beself.catalog.name
+self.identifier
instead.identifier_to_tuple_without_catalog
is called in allcatalog
implementations, and is currently crucial in allowingTable
to use the long identifier including the catalog_name. However, this function is imperfect, and will result in errors if the the table_namespace uses a hierarchical name that is repeats the catalog name. For example, namespace:default.lower
and catalog_name:default
. When we passdefault.lower.table_name
to refer to the tabletable_name
in namespacedefault.lower
on the catalogdefault
, we will run into an unexpected exception as it will remove the first hierarchy of the namespace, and look forlower.table_name
.catalog.name
should not be part of namespace #742Thankfully, we have not explicitly documented that we support catalog_name in the identifier string in our API documentation, and other engine partners do not seem to be relying on this feature as well (checking one example of Daft)
It would be great to get this change in for 0.7.0 release as certain rest catalog implementations are having issues with the current approach: lakekeeper/lakekeeper#18
However this is backward incompatible change that takes away an optional feature, so I would like to put this PR up as a first step in facilitating the discussion.