-
Notifications
You must be signed in to change notification settings - Fork 601
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support read_parquet for backend with no native support #9744
base: main
Are you sure you want to change the base?
Conversation
@jitingxu1 This PR has a lot of failures. Can you take a look so we can decide how to move forward? |
ibis/backends/tests/test_register.py
Outdated
table = con.read_parquet(tmp_path / f"*.{ext}") | ||
if con.name == "clickhouse": | ||
# clickhouse does not support read directory | ||
table = con.read_parquet(tmp_path / f"*.{ext}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem like the right approach. You're changing what's being tested. Why can't you leave this code unchanged here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the pyarrow read_table
cannot read things like tmp_path/*.parquet
and clickhouse cannot read the directory
.
We have three kinds of read_parquet:
- backends use pyarrow
read_table
could read single file and directory, but not glob pattern without specify the filesystem. - duckdb, and some other accepts all the above three formats
- clickhouse does not accept directory.
Maybe we could add something before read_table
to convert the path/*.parquet
--> path
, then it accepts all
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test is whether the backend can read a glob of parquet files, the answer to that seems to be "no", so it should be marked as notyet
rewrite it to support urls as the input:
@cpcloud for review again, Thanks |
increased the test coverage. @cpcloud |
ibis/backends/__init__.py
Outdated
self.create_table(table_name, table) | ||
return self.table(table_name) | ||
|
||
def _get_pyarrow_table_from_path(self, path: str | Path, **kwargs) -> pa.Table: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't the implementation of this just be:
return pq.read_table(path, **kwargs)
Did you try that already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried that in my first commit, it cannot handle all the cases:
such as glob pattern and Parquet files hosted on some uri: i.e HTTPS SFTP
Pyarrow implements natively the following filesystem subclasses:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cpcloud does this make sense to you?
HI @cpcloud , I got several timeout error in the CI for this PR, is there something we need to fix in another PR
is it related to trino setup in the CI? |
Hi @cpcloud In this PR, I have the trino/impala test on Should I skip the Trino and Impala in this test too? Or do you have better way to handle this?
|
006cfa7
to
0d55190
Compare
I'm going to try something here to see if I can isolate which test is leaving us in a (sometimes) broken state only on the nix osx runs |
Ok, that's one pass for the nix mac osx job. I'm going to cycle it a few times to make sure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of my comments are inline -- I think we should start off offering this with a simplified implementation. It's nice to offer users options, but I think we should be measured in how much we enable directly.
ibis/backends/__init__.py
Outdated
When reading data from cloud storage (such as Amazon S3 or Google Cloud Storage), | ||
credentials can be provided via the `filesystem` argument by creating an appropriate | ||
filesystem object (e.g., `pyarrow.fs.S3FileSystem`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is true, and additionally, pq.read_table
also supports the standard aws auth patterns (environment variables, or aws SSO credentials, or instance credentials)
ibis/backends/__init__.py
Outdated
import pyarrow.parquet as pq | ||
|
||
path = str(path) | ||
# handle url | ||
if util.is_url(path): | ||
import fsspec | ||
|
||
credentials = kwargs.pop("credentials", {}) | ||
with fsspec.open(path, **credentials) as f: | ||
with BytesIO(f.read()) as reader: | ||
return pq.read_table(reader) | ||
|
||
# handle fsspec compatible url | ||
if util.is_fsspec_url(path): | ||
return pq.read_table(path, **kwargs) | ||
|
||
# Handle local file paths or patterns | ||
paths = glob.glob(path) | ||
if not paths: | ||
raise ValueError(f"No files found matching pattern: {path!r}") | ||
elif len(paths) == 1: | ||
paths = paths[0] | ||
|
||
return pq.read_table(paths, **kwargs) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should reconsider handling all of these cases -- this sort of branching logic means that when a user reports an error, we'll have any number of possible culprits to consider, and it makes it harder to debug for everyone.
I think (and I could be wrong) that nearly all of these cases are covered by pq.read_table
by itself, and that's much easier to document and debug.
read_table
also has support for being passed an fsspec
object, so if someone needs to read from a hypertext url, they can use fsspec
as a shim for that. (This is something we can add a note about in the docstring).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pq.read_table
could handle most of the cases, I will simplify the logic, to see how much cases could be covered. Thanks for your suggestion.
ibis/backends/__init__.py
Outdated
path = str(path) | ||
# handle url | ||
if util.is_url(path): | ||
import fsspec |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fsspec
is not a dependency of Ibis (it's in our test suite) so this would need extra import handling if we leave it in (but see my other comments)
Ok, skipping the mocked URL test on DuckDB seems to have resolved the nested transaction failures on the nix osx CI job |
Thank you so much. |
@util.experimental | ||
def read_parquet( | ||
self, path: str | Path | BytesIO, table_name: str | None = None, **kwargs: Any | ||
) -> ir.Table: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of BytesIO, I could pass the fsspec object, It could be HTTPFile if we pass an HTTP url. Not sure what is the best way to handle the type of path
@gforsyth any suggestion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think fsspec
is a good option.
"""Register a parquet file as a table in the current backend. | ||
|
||
This function reads a Parquet file and registers it as a table in the current | ||
backend. Note that for Impala and Trino backends, the performance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
backend. Note that for Impala and Trino backends, the performance | |
backend. Note that for the Impala and Trino backends, the performance |
|
||
table_name = table_name or util.gen_name("read_parquet") | ||
paths = list(glob.glob(str(path))) | ||
if paths: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add a comment here indicating that this is to help with reading from remote file locations
else: | ||
table = pq.read_table(path, **kwargs) | ||
|
||
self.create_table(table_name, table) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the read_csv
PR, this should probably be a memtable
so we don't create a persistent table by default
Description of changes
Support
read_parquet
for backends that do not have native support (like duckdb). This implementation leverages the PyArrow read_table function.If a backend does not have its own version, it will fall back on this pyarrow implementation.
Issues closed
This addresses part of issue #9448. Additional tasks related to this issue will be completed and submitted individually.