Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor PyArrow DataFiles Projection functions #1043

Merged
merged 7 commits into from
Aug 20, 2024

Conversation

sungwy
Copy link
Collaborator

@sungwy sungwy commented Aug 12, 2024

No description provided.

@sungwy sungwy changed the title Draft: Refactoring PyArrow DataFiles Projection functions Draft: Refactor PyArrow DataFiles Projection functions Aug 12, 2024
@sungwy sungwy changed the title Draft: Refactor PyArrow DataFiles Projection functions Refactor PyArrow DataFiles Projection functions Aug 12, 2024
Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! I like the overall approach. Added a few comments.

I'm curious what the difference between project_table and project_batches from an API standpoint (and if we should document that)

@sungwy sungwy marked this pull request as ready for review August 14, 2024 01:02
Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I added a few comments.
@Fokko @HonahX do you mind taking a look when you get a chance?

Comment on lines +1312 to +1328
def _fs_from_file_path(file_path: str, io: FileIO) -> FileSystem:
scheme, netloc, _ = _parse_location(file_path)
if isinstance(io, PyArrowFileIO):
return io.fs_by_scheme(scheme, netloc)
else:
try:
from pyiceberg.io.fsspec import FsspecFileIO

if isinstance(io, FsspecFileIO):
from pyarrow.fs import PyFileSystem

return PyFileSystem(FSSpecHandler(io.get_fs(scheme)))
else:
raise ValueError(f"Expected PyArrowFileIO or FsspecFileIO, got: {io}")
except ModuleNotFoundError as e:
# When FsSpec is not installed
raise ValueError(f"Expected PyArrowFileIO or FsspecFileIO, got: {io}") from e
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copied over from project_table

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, copied it over since it'll be deprecated!

Returns an Iterator of pa.RecordBatch with data from the Iceberg table
by resolving the right columns that match the current table schema.
Only data that matches the provided row_filter expression is returned.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add a snippet about why use this instead of project_table

) -> None:
self._table_metadata = table_metadata
self._io = io
self._fs = _fs_from_file_path(table_metadata.location, io) # TODO: use different FileSystem per file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!! a few minor comments.

Co-authored-by: Kevin Liu <kevinjqliu@users.noreply.github.com>
Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@sungwy
Copy link
Collaborator Author

sungwy commented Aug 20, 2024

LGTM! I added a few comments. @Fokko @HonahX do you mind taking a look when you get a chance?

Yes, would love to get your blessings on this refactoring @Fokko and @HonahX - please let us know what you think

_limit: Optional[int]
"""Scan the Iceberg Table and create an Arrow construct.

Attributes:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: More of a style thing. I think it is more valuable to have a docstring at the __init__ method since that's wat probably will show up people's IDE.

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, some much needed refactoring, thanks @sungwy for picking this up, and thanks @corleyma, @kevinjqliu and @ndrluis for the review!

@Fokko Fokko merged commit eec13a6 into apache:main Aug 20, 2024
7 checks passed
sungwy added a commit to sungwy/iceberg-python that referenced this pull request Dec 7, 2024
* refactoring

* refactor more

* docstring

* apache#1042

* adopt review feedback

* thanks Kevin!

Co-authored-by: Kevin Liu <kevinjqliu@users.noreply.github.com>

---------

Co-authored-by: Kevin Liu <kevinjqliu@users.noreply.github.com>
sungwy added a commit to sungwy/iceberg-python that referenced this pull request Dec 7, 2024
* refactoring

* refactor more

* docstring

* apache#1042

* adopt review feedback

* thanks Kevin!

Co-authored-by: Kevin Liu <kevinjqliu@users.noreply.github.com>

---------

Co-authored-by: Kevin Liu <kevinjqliu@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants