-
Notifications
You must be signed in to change notification settings - Fork 601
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support read_parquet for backend with no native support #9744
base: main
Are you sure you want to change the base?
Changes from all commits
ab2ad16
661f50d
e16f1bb
eaec7a2
9106ad8
27d7a08
ac6117f
3ce9674
24530ca
bb238af
12cfc7d
2cf597a
b4cf0ea
2ba5002
6f2c754
24bfe38
6a50c46
4579bff
d1ed444
b01bc6a
e70de2f
413ada7
c3fba44
8b6b3c6
0d55190
fda5493
71ebb8e
2473c02
3ab60a8
59c03e0
c0c1fd1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -4,6 +4,7 @@ | |||||
import collections.abc | ||||||
import contextlib | ||||||
import functools | ||||||
import glob | ||||||
import importlib.metadata | ||||||
import keyword | ||||||
import re | ||||||
|
@@ -22,6 +23,7 @@ | |||||
|
||||||
if TYPE_CHECKING: | ||||||
from collections.abc import Iterable, Iterator, Mapping, MutableMapping | ||||||
from io import BytesIO | ||||||
from urllib.parse import ParseResult | ||||||
|
||||||
import pandas as pd | ||||||
|
@@ -1269,6 +1271,100 @@ def has_operation(cls, operation: type[ops.Value]) -> bool: | |||||
f"{cls.name} backend has not implemented `has_operation` API" | ||||||
) | ||||||
|
||||||
@util.experimental | ||||||
def read_parquet( | ||||||
self, path: str | Path | BytesIO, table_name: str | None = None, **kwargs: Any | ||||||
) -> ir.Table: | ||||||
"""Register a parquet file as a table in the current backend. | ||||||
|
||||||
This function reads a Parquet file and registers it as a table in the current | ||||||
backend. Note that for Impala and Trino backends, the performance | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
may be suboptimal. | ||||||
|
||||||
Parameters | ||||||
---------- | ||||||
path | ||||||
The data source. May be a path to a file, glob pattern to match Parquet files, | ||||||
directory of parquet files, or BytseIO. | ||||||
table_name | ||||||
An optional name to use for the created table. This defaults to | ||||||
a sequentially generated name. | ||||||
**kwargs | ||||||
Additional keyword arguments passed to the pyarrow loading function. | ||||||
See https://arrow.apache.org/docs/python/generated/pyarrow.parquet.read_table.html | ||||||
for more information. | ||||||
|
||||||
Returns | ||||||
------- | ||||||
ir.Table | ||||||
The just-registered table | ||||||
|
||||||
Examples | ||||||
-------- | ||||||
Connect to a SQLite database: | ||||||
|
||||||
>>> con = ibis.sqlite.connect() | ||||||
|
||||||
Read a single parquet file: | ||||||
|
||||||
>>> table = con.read_parquet("path/to/file.parquet") | ||||||
|
||||||
Read all parquet files in a directory: | ||||||
|
||||||
>>> table = con.read_parquet("path/to/parquet_directory/") | ||||||
|
||||||
Read parquet files with a glob pattern | ||||||
|
||||||
>>> table = con.read_parquet("path/to/parquet_directory/data_*.parquet") | ||||||
|
||||||
Read from Amazon S3 | ||||||
|
||||||
>>> table = con.read_parquet("s3://bucket-name/path/to/file.parquet") | ||||||
|
||||||
Read from Google Cloud Storage | ||||||
|
||||||
>>> table = con.read_parquet("gs://bucket-name/path/to/file.parquet") | ||||||
|
||||||
Read with a custom table name | ||||||
|
||||||
>>> table = con.read_parquet("s3://bucket/data.parquet", table_name="my_table") | ||||||
|
||||||
Read with additional pyarrow options | ||||||
|
||||||
>>> table = con.read_parquet("gs://bucket/data.parquet", columns=["col1", "col2"]) | ||||||
|
||||||
Read from Amazon S3 with secret info | ||||||
|
||||||
>>> from pyarrow import fs | ||||||
>>> s3_fs = fs.S3FileSystem( | ||||||
... access_key="YOUR_ACCESS_KEY", secret_key="YOUR_SECRET_KEY", region="YOUR_AWS_REGION" | ||||||
... ) | ||||||
>>> table = con.read_parquet("s3://bucket/data.parquet", filesystem=s3_fs) | ||||||
|
||||||
Read from HTTPS URL | ||||||
|
||||||
>>> import fsspec | ||||||
>>> from io import BytesIO | ||||||
>>> url = "https://example.com/data/file.parquet" | ||||||
>>> credentials = {} | ||||||
>>> f = fsspec.open(url, **credentials).open() | ||||||
>>> reader = BytesIO(f.read()) | ||||||
>>> table = con.read_parquet(reader) | ||||||
>>> reader.close() | ||||||
>>> f.close() | ||||||
""" | ||||||
import pyarrow.parquet as pq | ||||||
|
||||||
table_name = table_name or util.gen_name("read_parquet") | ||||||
paths = list(glob.glob(str(path))) | ||||||
if paths: | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would add a comment here indicating that this is to help with reading from remote file locations |
||||||
table = pq.read_table(paths, **kwargs) | ||||||
else: | ||||||
table = pq.read_table(path, **kwargs) | ||||||
|
||||||
self.create_table(table_name, table) | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Similar to the |
||||||
return self.table(table_name) | ||||||
|
||||||
def _transpile_sql(self, query: str, *, dialect: str | None = None) -> str: | ||||||
# only transpile if dialect was passed | ||||||
if dialect is None: | ||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of BytesIO, I could pass the fsspec object, It could be HTTPFile if we pass an HTTP url. Not sure what is the best way to handle the type of
path
@gforsyth any suggestion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think
fsspec
is a good option.