-
Notifications
You must be signed in to change notification settings - Fork 4k
GH-33986: [Python] Add a minimal protocol for datasets #35568
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,124 @@ | ||||||||||||||||||||
| .. Licensed to the Apache Software Foundation (ASF) under one | ||||||||||||||||||||
| .. or more contributor license agreements. See the NOTICE file | ||||||||||||||||||||
| .. distributed with this work for additional information | ||||||||||||||||||||
| .. regarding copyright ownership. The ASF licenses this file | ||||||||||||||||||||
| .. to you under the Apache License, Version 2.0 (the | ||||||||||||||||||||
| .. "License"); you may not use this file except in compliance | ||||||||||||||||||||
| .. with the License. You may obtain a copy of the License at | ||||||||||||||||||||
|
|
||||||||||||||||||||
| .. http://www.apache.org/licenses/LICENSE-2.0 | ||||||||||||||||||||
|
|
||||||||||||||||||||
| .. Unless required by applicable law or agreed to in writing, | ||||||||||||||||||||
| .. software distributed under the License is distributed on an | ||||||||||||||||||||
| .. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||||||||||||||||||||
| .. KIND, either express or implied. See the License for the | ||||||||||||||||||||
| .. specific language governing permissions and limitations | ||||||||||||||||||||
| .. under the License. | ||||||||||||||||||||
|
|
||||||||||||||||||||
| Extending PyArrow Datasets | ||||||||||||||||||||
| ========================== | ||||||||||||||||||||
|
|
||||||||||||||||||||
| .. warn:: | ||||||||||||||||||||
|
|
||||||||||||||||||||
| This protocol is currently experimental. | ||||||||||||||||||||
|
|
||||||||||||||||||||
| PyArrow provides a core protocol for datasets, so third-party libraries can both | ||||||||||||||||||||
| produce and consume classes that conform to useful subset of the PyArrow dataset | ||||||||||||||||||||
| API. This subset provides enough functionality to provide projection | ||||||||||||||||||||
| pushdown. The subset of the API is contained in ``pyarrow.dataset.protocol``. | ||||||||||||||||||||
|
|
||||||||||||||||||||
|
||||||||||||||||||||
| pushdown. The subset of the API is contained in ``pyarrow.dataset.protocol``. | |
| pushdown. The subset of the API is contained in ``pyarrow.dataset.protocol``. | |
| Producers are scanner implementations. For example, table formats like Delta | |
| Lake and Iceberg might provide their own dataset implementations. Consumers | |
| are typically query engines, such as DuckDB, DataFusion, Polars, and Dask. | |
| Providing a common API avoids a situation where supporting ``N`` dataset | |
| formats in ``M`` query engines requires ``N * M`` integrations. | |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is "ok" but the definition of producer and consumer here are reversed from what they are in Substrait which confused me for a while. Maybe we can go with "Data producer" and "Data consumer"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha it finally clicked why I myself find these terms confusing 🤣. It feels backwards! I'll think of new names.
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,158 @@ | ||||||||||||||||||||||||||||||||||||||||||||||
| # Licensed to the Apache Software Foundation (ASF) under one | ||||||||||||||||||||||||||||||||||||||||||||||
| # or more contributor license agreements. See the NOTICE file | ||||||||||||||||||||||||||||||||||||||||||||||
| # distributed with this work for additional information | ||||||||||||||||||||||||||||||||||||||||||||||
| # regarding copyright ownership. The ASF licenses this file | ||||||||||||||||||||||||||||||||||||||||||||||
| # to you under the Apache License, Version 2.0 (the | ||||||||||||||||||||||||||||||||||||||||||||||
| # "License"); you may not use this file except in compliance | ||||||||||||||||||||||||||||||||||||||||||||||
| # with the License. You may obtain a copy of the License at | ||||||||||||||||||||||||||||||||||||||||||||||
| # | ||||||||||||||||||||||||||||||||||||||||||||||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||||||||||||||||||||||||||||||||||||||||||||||
| # | ||||||||||||||||||||||||||||||||||||||||||||||
| # Unless required by applicable law or agreed to in writing, | ||||||||||||||||||||||||||||||||||||||||||||||
| # software distributed under the License is distributed on an | ||||||||||||||||||||||||||||||||||||||||||||||
| # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||||||||||||||||||||||||||||||||||||||||||||||
| # KIND, either express or implied. See the License for the | ||||||||||||||||||||||||||||||||||||||||||||||
| # specific language governing permissions and limitations | ||||||||||||||||||||||||||||||||||||||||||||||
| # under the License. | ||||||||||||||||||||||||||||||||||||||||||||||
| """Protocol definitions for pyarrow.dataset | ||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||
| These provide the abstract interface for a dataset. Other libraries may implement | ||||||||||||||||||||||||||||||||||||||||||||||
| this interface to expose their data, without having to extend PyArrow's classes. | ||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||
| Applications and libraries that want to consume datasets should accept datasets | ||||||||||||||||||||||||||||||||||||||||||||||
| that implement these protocols, rather than requiring the specific | ||||||||||||||||||||||||||||||||||||||||||||||
| PyArrow classes. | ||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||
| The pyarrow.dataset.Dataset class itself implements this protocol. | ||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||
| See Extending PyArrow Datasets for more information: | ||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||
| https://arrow.apache.org/docs/python/integration/dataset.html | ||||||||||||||||||||||||||||||||||||||||||||||
| """ | ||||||||||||||||||||||||||||||||||||||||||||||
| from abc import abstractmethod | ||||||||||||||||||||||||||||||||||||||||||||||
| from typing import Iterator, List, Optional | ||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||
| # TODO: remove once we drop support for Python 3.7 | ||||||||||||||||||||||||||||||||||||||||||||||
| if sys.version_info >= (3, 8): | ||||||||||||||||||||||||||||||||||||||||||||||
| from typing import Protocol, runtime_checkable | ||||||||||||||||||||||||||||||||||||||||||||||
| else: | ||||||||||||||||||||||||||||||||||||||||||||||
| from typing_extensions import Protocol, runtime_checkable | ||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||
| from pyarrow.dataset import Expression | ||||||||||||||||||||||||||||||||||||||||||||||
| from pyarrow import Table, RecordBatchReader, Schema | ||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||
| @runtime_checkable | ||||||||||||||||||||||||||||||||||||||||||||||
| class Scanner(Protocol): | ||||||||||||||||||||||||||||||||||||||||||||||
wjones127 marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||||||||||||||||||||||||||||||||||||||||||||||
| """ | ||||||||||||||||||||||||||||||||||||||||||||||
| A scanner implementation for a dataset. | ||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||
| This may be a scan of a whole dataset, or a scan of a single fragment. | ||||||||||||||||||||||||||||||||||||||||||||||
| """ | ||||||||||||||||||||||||||||||||||||||||||||||
| @abstractmethod | ||||||||||||||||||||||||||||||||||||||||||||||
| def count_rows(self) -> int: | ||||||||||||||||||||||||||||||||||||||||||||||
| """ | ||||||||||||||||||||||||||||||||||||||||||||||
| Count the number of rows in this dataset or fragment. | ||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||
| Implementors may provide optimized code paths that compute this from metadata. | ||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||
| Returns | ||||||||||||||||||||||||||||||||||||||||||||||
| ------- | ||||||||||||||||||||||||||||||||||||||||||||||
| int | ||||||||||||||||||||||||||||||||||||||||||||||
| The number of rows in the dataset or fragment. | ||||||||||||||||||||||||||||||||||||||||||||||
| """ | ||||||||||||||||||||||||||||||||||||||||||||||
| ... | ||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||
| @abstractmethod | ||||||||||||||||||||||||||||||||||||||||||||||
| def head(self, num_rows: int) -> Table: | ||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||
| """ | ||||||||||||||||||||||||||||||||||||||||||||||
| Get the first ``num_rows`` rows of the dataset or fragment. | ||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||
| Parameters | ||||||||||||||||||||||||||||||||||||||||||||||
| ---------- | ||||||||||||||||||||||||||||||||||||||||||||||
| num_rows : int | ||||||||||||||||||||||||||||||||||||||||||||||
| The number of rows to return. | ||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||
| Returns | ||||||||||||||||||||||||||||||||||||||||||||||
| ------- | ||||||||||||||||||||||||||||||||||||||||||||||
| Table | ||||||||||||||||||||||||||||||||||||||||||||||
| A table containing the first ``num_rows`` rows of the dataset or fragment. | ||||||||||||||||||||||||||||||||||||||||||||||
| """ | ||||||||||||||||||||||||||||||||||||||||||||||
| ... | ||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||
| @abstractmethod | ||||||||||||||||||||||||||||||||||||||||||||||
| def to_reader(self) -> RecordBatchReader: | ||||||||||||||||||||||||||||||||||||||||||||||
| """ | ||||||||||||||||||||||||||||||||||||||||||||||
| Create a Record Batch Reader for this scan. | ||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||
| This is used to read the data in chunks. | ||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||
| Returns | ||||||||||||||||||||||||||||||||||||||||||||||
| ------- | ||||||||||||||||||||||||||||||||||||||||||||||
| RecordBatchReader | ||||||||||||||||||||||||||||||||||||||||||||||
| """ | ||||||||||||||||||||||||||||||||||||||||||||||
| ... | ||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||||||||||||||
| @runtime_checkable | ||||||||||||||||||||||||||||||||||||||||||||||
| class Scannable(Protocol): | ||||||||||||||||||||||||||||||||||||||||||||||
| @abstractmethod | ||||||||||||||||||||||||||||||||||||||||||||||
| def scanner(self, columns: Optional[List[str]] = None, | ||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||
| def scanner(self, columns: Optional[List[str]] = None, | |
| def scanner(self, columns: Optional[Tuple[str, ...]] = None, |
Nit, I prefer to use Tuples over Lists because:
- They are immutable
- And therefore also hash-able:
>>> hash((1,2,3))
529344067295497451
>>> hash([1,2,3])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't disagree in principle, but also trying to keep this somewhat compatible. Though maybe we can loosen to Sequence[str]?
Actually, it currently supports List[str] | Dict[str, Expression]. Do we want to support the dictionary generally? or keep the protocol more narrow than that?
arrow/python/pyarrow/_dataset.pyx
Lines 3182 to 3201 in af38263
| if columns is not None: | |
| if isinstance(columns, dict): | |
| for expr in columns.values(): | |
| if not isinstance(expr, Expression): | |
| raise TypeError( | |
| "Expected an Expression for a 'column' dictionary " | |
| "value, got {} instead".format(type(expr)) | |
| ) | |
| c_exprs.push_back((<Expression> expr).unwrap()) | |
| check_status( | |
| builder.Project(c_exprs, [tobytes(c) for c in columns.keys()]) | |
| ) | |
| elif isinstance(columns, list): | |
| check_status(builder.ProjectColumns([tobytes(c) for c in columns])) | |
| else: | |
| raise ValueError( | |
| "Expected a list or a dict for 'columns', " | |
| "got {} instead.".format(type(columns)) | |
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difference between Sequence[str] and Dict[str, Expression] is significant. The former only allows you to pick which columns to load. The latter introduces the concept of project expressions which is big.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to #35568 (comment), we might want to only allow Sequence[str] here so we only support column selection and reordering. That way we don't need to require consumers to perform projections. I don't think any existing consumer relies on this.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should batch_size be part of to_reader instead of scanner?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm open to that. But I don't want to add that to the protocol without also implementing it in PyArrow Datasets. So if we think this is important, I'll remove this for now.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe expand that the default value is not only implementation specific but might not even be consistent between batches?
Also, if this is a min/max or is it a maximum-only? In other words, if batch_size is is 1_000_000 and the source is a parquet file with 10 row groups of 100_000 rows does the scanner need to accumulate the rows or is it acceptable to return smaller batches?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good detail to think about. I don't think we should require it to be exact; for example, if the reader can't read in exactly that batch size I don't think it should error. But I do think readers should make their best effort to be close the batch size as possible, even if that means splitting row groups into chunks, for example.
Though you might have more informed opinions here; what do you think is reasonable to expect here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This parameter has lost importance in arrow-c++ datasets. It used to be an important tuning parameter that affected the size of the batches used internally by the C++ implementation. However, it didn't make sense for the user to pick the correct value (and there are multiple batch sizes in the C++ and the right value might even depend on the schema and be quite difficult to calculate).
I think it still has value, especially "max batch size". The user needs someway to say "don't give me 20GB of data all at once".
So I think it needs to be a hard upper limit but it can be a soft lower limit. We could either call it max_batch_size (and ignore it as an upper limit entirely) or preferred_batch_size (and explain that only the upper limit is strictly enforced). I don't think using this as an upper limit is overly burdensome as slicing tables/batches should be pretty easy and lightweight. The reverse (concatenating batches) is more complicated and expensive.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a some things I would like to have here, as a user, but I understand we are just getting started and trying to be minimal. So take these as suggestions:
__repr__ <-- converting a fragment to string is very useful for debugging
estimated_cost <-- I get why this one isn't there but a fragment might be 5 rows or it might be 5 million rows, and that could be valuable for figuring out how to distribute a dataset workload. Still, there is no universal way of estimating cost, so perhaps we can leave this for an extension.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,29 @@ | ||
| # Licensed to the Apache Software Foundation (ASF) under one | ||
| # or more contributor license agreements. See the NOTICE file | ||
| # distributed with this work for additional information | ||
| # regarding copyright ownership. The ASF licenses this file | ||
| # to you under the Apache License, Version 2.0 (the | ||
| # "License"); you may not use this file except in compliance | ||
| # with the License. You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, | ||
| # software distributed under the License is distributed on an | ||
| # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
| # KIND, either express or implied. See the License for the | ||
| # specific language governing permissions and limitations | ||
| # under the License. | ||
| """Test that PyArrow datasets conform to the protocol.""" | ||
| import pyarrow.dataset.protocol as protocol | ||
| import pyarrow.dataset as ds | ||
|
|
||
|
|
||
| def test_dataset_protocol(): | ||
| assert isinstance(ds.Dataset, protocol.Dataset) | ||
| assert isinstance(ds.Fragment, protocol.Fragment) | ||
|
|
||
| assert isinstance(ds.Dataset, protocol.Scannable) | ||
| assert isinstance(ds.Fragment, protocol.Scannable) | ||
|
|
||
| assert isinstance(ds.Scanner, protocol.Scanner) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we say that this is currently experimental, and list the things that we know are on the roadmap?