-
Notifications
You must be signed in to change notification settings - Fork 229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor PyArrow DataFiles Projection functions #1043
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this! I like the overall approach. Added a few comments.
I'm curious what the difference between project_table
and project_batches
from an API standpoint (and if we should document that)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def _fs_from_file_path(file_path: str, io: FileIO) -> FileSystem: | ||
scheme, netloc, _ = _parse_location(file_path) | ||
if isinstance(io, PyArrowFileIO): | ||
return io.fs_by_scheme(scheme, netloc) | ||
else: | ||
try: | ||
from pyiceberg.io.fsspec import FsspecFileIO | ||
|
||
if isinstance(io, FsspecFileIO): | ||
from pyarrow.fs import PyFileSystem | ||
|
||
return PyFileSystem(FSSpecHandler(io.get_fs(scheme))) | ||
else: | ||
raise ValueError(f"Expected PyArrowFileIO or FsspecFileIO, got: {io}") | ||
except ModuleNotFoundError as e: | ||
# When FsSpec is not installed | ||
raise ValueError(f"Expected PyArrowFileIO or FsspecFileIO, got: {io}") from e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copied over from project_table
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, copied it over since it'll be deprecated!
Returns an Iterator of pa.RecordBatch with data from the Iceberg table | ||
by resolving the right columns that match the current table schema. | ||
Only data that matches the provided row_filter expression is returned. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add a snippet about why use this instead of project_table
) -> None: | ||
self._table_metadata = table_metadata | ||
self._io = io | ||
self._fs = _fs_from_file_path(table_metadata.location, io) # TODO: use different FileSystem per file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!! a few minor comments.
Co-authored-by: Kevin Liu <kevinjqliu@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
_limit: Optional[int] | ||
"""Scan the Iceberg Table and create an Arrow construct. | ||
|
||
Attributes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: More of a style thing. I think it is more valuable to have a docstring at the __init__
method since that's wat probably will show up people's IDE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good, some much needed refactoring, thanks @sungwy for picking this up, and thanks @corleyma, @kevinjqliu and @ndrluis for the review!
* refactoring * refactor more * docstring * apache#1042 * adopt review feedback * thanks Kevin! Co-authored-by: Kevin Liu <kevinjqliu@users.noreply.github.com> --------- Co-authored-by: Kevin Liu <kevinjqliu@users.noreply.github.com>
* refactoring * refactor more * docstring * apache#1042 * adopt review feedback * thanks Kevin! Co-authored-by: Kevin Liu <kevinjqliu@users.noreply.github.com> --------- Co-authored-by: Kevin Liu <kevinjqliu@users.noreply.github.com>
No description provided.