Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(python)!: simplify marshalling of Fragment, DataFile, Operation, Transaction #3240

Merged
merged 9 commits into from
Dec 20, 2024

Conversation

wjones127
Copy link
Contributor

@wjones127 wjones127 commented Dec 12, 2024

BREAKING CHANGE: DataFile.deletion_file is now a property, not a method.

For Fragment and Operation, we had a sort of intermediate inner layer to handle translating between Rust struct and Python objects. This worked fine in isolation, but once you need to convert at the top of a hierarchy it became tedious. This was the case for Transaction. Transaction contained an Operation, which could contain many Fragments, which contains many DataFiles.

These structures are primarily data holders, so they've been made into dataclasses. A newtype wrapper struct PyLance<T> is used to provide implementations of FromPyObject and ToPyObject. This makes signatures more readable, and makes the wrappers thinner. For example, instead of a special Python FragmentMetadata struct, we just have a PyLance<Fragment>, where Fragment is from the Rust crate.

Comment on lines -2318 to -2330
merged = Transaction(**merged)
# This logic is specific to append, which is all that should
# be returned here.
# TODO: generalize this to all other transaction types.
merged.operation["fragments"] = [
FragmentMetadata.from_metadata(f) for f in merged.operation["fragments"]
]
merged.operation = LanceOperation.Append(**merged.operation)
if merged.blobs_op:
merged.blobs_op["fragments"] = [
FragmentMetadata.from_metadata(f) for f in merged.blobs_op["fragments"]
]
merged.blobs_op = LanceOperation.Append(**merged.blobs_op)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the messy marshaling logic that made me want this refactor. As I extend to other transaction types, this would have become completely unmanageable.

@wjones127 wjones127 changed the title refactor(python): simplify marshalling of Fragment, DataFile, Operation, Transaction refactor(python)!: simplify marshalling of Fragment, DataFile, Operation, Transaction Dec 12, 2024
@wjones127 wjones127 marked this pull request as ready for review December 13, 2024 17:55
Copy link
Contributor

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A good cleanup, thanks! Is the goal to eventually start moving more things into the PyLance<T> pattern?

Comment on lines +122 to +129
Operation::Append { ref fragments } => {
let fragments = export_vec(py, fragments.as_slice());
let cls = namespace
.getattr("Append")
.expect("Failed to get Append class");
cls.call1((fragments,)).unwrap().to_object(py)
}
_ => todo!(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually use ToPyObject for Append or is this more just an example of how you might do this in case we need to later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dataset::commit_batch returns a PyLance<Transaction>, which will contain an operation. Currently just Append.

Comment on lines 263 to 271
pub fn extract_vec<'a, T>(ob: &Bound<'a, PyAny>) -> PyResult<Vec<T>>
where
PyLance<T>: FromPyObject<'a>,
{
ob.extract::<Vec<PyLance<T>>>()
.map(|v| v.into_iter().map(|t| t.0).collect())
}

pub fn export_vec<'a, T>(py: Python<'a>, vec: &'a [T]) -> Vec<PyObject>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we document these methods? extract_vec I guess is pretty obvious since extract is well-used in pyo3 but export_vec is maybe a touch confusing.

Comment on lines +110 to +112
row_id_meta = json_data.get("row_id_meta")
if row_id_meta is not None:
row_id_meta = RowIdMeta(**row_id_meta)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I've fallen behind the times again. What is row_id_meta? Is this where we put a row id mapping file for stable row id stuff?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, each fragment has it's own row id index stored here. Either an inline index (if it's small), or a reference to a separate file.

Comment on lines 132 to 135
fields : List[int]
The field ids of the columns in this file.
column_indices : List[int]
The column indices in the original schema.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fields : List[int]
The field ids of the columns in this file.
column_indices : List[int]
The column indices in the original schema.
fields : List[int]
The ids of the fields in this file.
column_indices : List[int]
The column indices where the fields are stored in the file. Will have the same length
as `fields`.

Minor terminology nit (pic for pedantic reference):

image

file_major_version: int = 0,
file_minor_version: int = 0,
):
# TODO: only we eliminate the path method, we can remove this
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# TODO: only we eliminate the path method, we can remove this
# TODO: once we eliminate the path method, we can remove this

Maybe?

What does it mean to eliminate the path method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously path was a method DataFile.path(). But now, it's a property: DataFile.path. I go through a lot of hoops to support both, but eventually I'd like to remove the method support, so it's only accessible as a property. Then all of this code will be much more straightforward.

Comment on lines +136 to +139
file_major_version : int
The major version of the data storage format.
file_minor_version : int
The minor version of the data storage format.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we store the file version on the dataset these properties are kind of redundant and I guess we can deprecate them at some point. Not sure that affects this PR at all but figured I'd mention it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I forgot we are deprecating them.

@wjones127
Copy link
Contributor Author

Is the goal to eventually start moving more things into the PyLance<T> pattern?

I think it's a reasonable pattern for any class that is a data holder. For objects that have a lot of methods (such as Dataset), I think this isn't a great pattern.

I don't have anything additional in mind to migrate immediately, but if there are others where it would be useful to transition, we should.

@wjones127 wjones127 merged commit 10e6454 into lancedb:main Dec 20, 2024
11 of 12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants