-
Notifications
You must be signed in to change notification settings - Fork 229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat: replace sort order #1500
base: main
Are you sure you want to change the base?
Feat: replace sort order #1500
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
Generally lgtm, added a few nit comments
pyiceberg/table/__init__.py
Outdated
@@ -404,6 +405,20 @@ def update_schema(self, allow_incompatible_changes: bool = False, case_sensitive | |||
name_mapping=self.table_metadata.name_mapping(), | |||
) | |||
|
|||
def replace_sort_order(self, case_sensitive: bool = True) -> ReplaceSortOrder: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: wdyt about update_sort_order
instead?
from the spec,
A data or delete file is associated with a sort order by the sort order's id within [a manifest](https://iceberg.apache.org/spec/#manifests). Therefore, the table must declare all the sort orders for lookup. A table could also be configured with a default sort order id, indicating how the new data should be sorted by default. Writers should use this default sort order to sort the data on write, but are not required to if the default order is prohibitively expensive, as it would be for streaming writes.
All sort orders are kept in the manifest in the sort-orders
field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Updated class & method names
SortField(source_id=1, transform=IdentityTransform(), direction=SortDirection.ASC, null_order=NullOrder.NULLS_FIRST), | ||
SortField(source_id=2, transform=IdentityTransform(), direction=SortDirection.DESC, null_order=NullOrder.NULLS_LAST), | ||
order_id=1, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a test for modifying sort order of a table with existing sort order?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With modifying do you mean updating it so that there's a new sort order or do you mean to modify an existing one without creating a new sort order. I've added a test doing the former.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modify an existing one. but i see you added both. thanks!
pyiceberg/table/update/sorting.py
Outdated
from pyiceberg.table import Transaction | ||
|
||
|
||
class SortOrderBuilder: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: do we need the builder here? Or just follow UpdateSchema's pattern
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need it but personally I like it because it's easier to read. I'll leave it up to you to decide if you want this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored this.
Thanks for the comments 🙌 @kevinjqliu . Will look at your comments this weekend. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great @JasperHG90 Sorry for the late review, it slipped through on my mailbox.
Could you also add a few docs: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/api.md#partition-evolution
fix: set default Co-authored-by: Fokko Driesprong <fokko@apache.org>
chore: update signature Co-authored-by: Fokko Driesprong <fokko@apache.org>
…0/iceberg-python into feat/update-sort-order
No problem @Fokko . Thanks for your comments and your review. |
Added some docs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @JasperHG90 apologies for the delay, ive been tied up with some other stuff.
I've compared the current implementation with java's implementation and it seems like there are a few differences. I've documented them in the comments below.
SortField(source_id=1, transform=IdentityTransform(), direction=SortDirection.ASC, null_order=NullOrder.NULLS_FIRST), | ||
SortField(source_id=2, transform=IdentityTransform(), direction=SortDirection.DESC, null_order=NullOrder.NULLS_LAST), | ||
order_id=1, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modify an existing one. but i see you added both. thanks!
|
||
The API to use when updating a sort order is the `update_sort_order` API on the table. | ||
|
||
Sort orders can only be updated by adding a new sort order. They cannot be deleted or modified. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
im 90% sure this is right, but want to call it out and see if others agree. cc @Fokko
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is the case today. There are some efforts going on right now around removing old schemas and partition specs, but if you do this, you want to make sure that there is no reference anymore to the partition spec in the metadata tree. Often you don't see that many sort orders, since it is less likely than adding a new field, so it isn't a problem that you cannot remove them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only added above for clarity.
|
||
class UpdateSortOrder(UpdateTableMetadata["UpdateSortOrder"]): | ||
_transaction: Transaction | ||
_last_assigned_order_id: int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not used
_last_assigned_order_id: int |
"""Return the sort order.""" | ||
return SortOrder(*self._fields, order_id=self._last_sort_order_id + 1) | ||
|
||
def _commit(self) -> UpdatesAndRequirements: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this implementation defers from the java one. heres the code path for the java implementation 1, 2
In particular, the java implementation tries to retry sort order whenever possible.
i dont think self._last_sort_order_id: int = transaction.table_metadata.default_sort_order_id
and then using order_id=self._last_sort_order_id + 1
is correct since default_sort_order_id
might be always be the highest sort order.
_last_sort_order_id
should default to null and only changed when a new sort order is added.
No problem @kevinjqliu . Thanks for getting back to me on this. I'll have a closer look at the Java implementation and docs and address your comments this week. Have a good week! |
(pytest.lazy_fixture("session_catalog_hive"), "2"), | ||
], | ||
) | ||
def test_replace_sort_order(catalog: Catalog, format_version: str, table_schema_simple: Schema) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self: rename the test to update
This PR adds functionality to replace a table's sort order. Closes #1245
Some basic tests are implemented but need to be expanded.
Currently, a new sort order ID is assigned. This adds a sort order to the list of available sort orders rather than replacing it.
@Fokko do you mind taking a look at this and telling me what you think? Thanks in advance!