Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: replace sort order #1500

Open
wants to merge 40 commits into
base: main
Choose a base branch
from

Conversation

JasperHG90
Copy link

This PR adds functionality to replace a table's sort order. Closes #1245

Some basic tests are implemented but need to be expanded.

Currently, a new sort order ID is assigned. This adds a sort order to the list of available sort orders rather than replacing it.

@Fokko do you mind taking a look at this and telling me what you think? Thanks in advance!

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!
Generally lgtm, added a few nit comments

@@ -404,6 +405,20 @@ def update_schema(self, allow_incompatible_changes: bool = False, case_sensitive
name_mapping=self.table_metadata.name_mapping(),
)

def replace_sort_order(self, case_sensitive: bool = True) -> ReplaceSortOrder:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: wdyt about update_sort_order instead?

from the spec,

A data or delete file is associated with a sort order by the sort order's id within [a manifest](https://iceberg.apache.org/spec/#manifests). Therefore, the table must declare all the sort orders for lookup. A table could also be configured with a default sort order id, indicating how the new data should be sorted by default. Writers should use this default sort order to sort the data on write, but are not required to if the default order is prohibitively expensive, as it would be for streaming writes.

All sort orders are kept in the manifest in the sort-orders field.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Updated class & method names

SortField(source_id=1, transform=IdentityTransform(), direction=SortDirection.ASC, null_order=NullOrder.NULLS_FIRST),
SortField(source_id=2, transform=IdentityTransform(), direction=SortDirection.DESC, null_order=NullOrder.NULLS_LAST),
order_id=1,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a test for modifying sort order of a table with existing sort order?

Copy link
Author

@JasperHG90 JasperHG90 Jan 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With modifying do you mean updating it so that there's a new sort order or do you mean to modify an existing one without creating a new sort order. I've added a test doing the former.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modify an existing one. but i see you added both. thanks!

from pyiceberg.table import Transaction


class SortOrderBuilder:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: do we need the builder here? Or just follow UpdateSchema's pattern

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need it but personally I like it because it's easier to read. I'll leave it up to you to decide if you want this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored this.

@JasperHG90
Copy link
Author

Thanks for the comments 🙌 @kevinjqliu . Will look at your comments this weekend.

@JasperHG90 JasperHG90 requested a review from kevinjqliu January 24, 2025 10:08
@JasperHG90 JasperHG90 changed the title [WIP] Feat: replace sort order Feat: replace sort order Jan 28, 2025
Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great @JasperHG90 Sorry for the late review, it slipped through on my mailbox.

Could you also add a few docs: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/api.md#partition-evolution

@JasperHG90
Copy link
Author

No problem @Fokko . Thanks for your comments and your review.

@JasperHG90
Copy link
Author

This looks great @JasperHG90 Sorry for the late review, it slipped through on my mailbox.

Could you also add a few docs: https://github.com/apache/iceberg-python/blob/main/mkdocs/docs/api.md#partition-evolution

Added some docs.

@JasperHG90 JasperHG90 requested a review from Fokko February 16, 2025 14:50
Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @JasperHG90 apologies for the delay, ive been tied up with some other stuff.

I've compared the current implementation with java's implementation and it seems like there are a few differences. I've documented them in the comments below.

SortField(source_id=1, transform=IdentityTransform(), direction=SortDirection.ASC, null_order=NullOrder.NULLS_FIRST),
SortField(source_id=2, transform=IdentityTransform(), direction=SortDirection.DESC, null_order=NullOrder.NULLS_LAST),
order_id=1,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modify an existing one. but i see you added both. thanks!


The API to use when updating a sort order is the `update_sort_order` API on the table.

Sort orders can only be updated by adding a new sort order. They cannot be deleted or modified.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im 90% sure this is right, but want to call it out and see if others agree. cc @Fokko

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is the case today. There are some efforts going on right now around removing old schemas and partition specs, but if you do this, you want to make sure that there is no reference anymore to the partition spec in the metadata tree. Often you don't see that many sort orders, since it is less likely than adding a new field, so it isn't a problem that you cannot remove them.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only added above for clarity.


class UpdateSortOrder(UpdateTableMetadata["UpdateSortOrder"]):
_transaction: Transaction
_last_assigned_order_id: int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not used

Suggested change
_last_assigned_order_id: int

"""Return the sort order."""
return SortOrder(*self._fields, order_id=self._last_sort_order_id + 1)

def _commit(self) -> UpdatesAndRequirements:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this implementation defers from the java one. heres the code path for the java implementation 1, 2

In particular, the java implementation tries to retry sort order whenever possible.
i dont think self._last_sort_order_id: int = transaction.table_metadata.default_sort_order_id and then using order_id=self._last_sort_order_id + 1 is correct since default_sort_order_id might be always be the highest sort order.
_last_sort_order_id should default to null and only changed when a new sort order is added.

@JasperHG90
Copy link
Author

Hey @JasperHG90 apologies for the delay, ive been tied up with some other stuff.

I've compared the current implementation with java's implementation and it seems like there are a few differences. I've documented them in the comments below.

No problem @kevinjqliu . Thanks for getting back to me on this. I'll have a closer look at the Java implementation and docs and address your comments this week.

Have a good week!

(pytest.lazy_fixture("session_catalog_hive"), "2"),
],
)
def test_replace_sort_order(catalog: Catalog, format_version: str, table_schema_simple: Schema) -> None:
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: rename the test to update

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[feat] Support update table's sort order
3 participants