1818 Extending PyArrow Datasets
1919==========================
2020
21+ .. warn ::
22+
23+ This protocol is currently experimental.
24+
2125PyArrow provides a core protocol for datasets, so third-party libraries can both
2226produce and consume classes that conform to useful subset of the PyArrow dataset
23- API. This subset provides enough functionality to provide predicate and filter
27+ API. This subset provides enough functionality to provide projection
2428pushdown. The subset of the API is contained in ``pyarrow.dataset.protocol ``.
2529
2630.. image :: pyarrow_dataset_protocol.svg
@@ -38,18 +42,24 @@ Consumers are responsible for calling methods on the protocol to get the data
3842out of the dataset. The protocol supports getting data as a single stream or
3943as a series of tasks which may be distributed.
4044
41- From the perspective of a user, the code looks like:
45+ As an example, from the perspective of the user this is what the code looks like
46+ to retrieve a Delta Lake table as a dataset and use it in DuckDB:
4247
4348.. code-block :: python
49+ :emphasize- lines: 2 ,6
50+
51+ from deltalake import DeltaTable
52+ table = DeltaTable(" path/to/table" )
53+ dataset = table.to_pyarrow_dataset()
4454
45- dataset = producer_library.get_dataset( ... )
46- df = consumer_library.read_dataset (dataset)
47- df.filter( " x > 0 " ).select (" y" )
55+ import duckdb
56+ df = duckdb.arrow (dataset)
57+ df.project (" y" )
4858
49- Here, the consumer would pass the filter `` x > 0 `` and the projection of ``y `` down
50- to the producer through the dataset protocol. Thus, the user gets to enjoy the
51- performance benefits of pushing down filters and projections while being able
52- to specify those in their preferred query engine.
59+ Here, the DuckDB would pass the the projection of ``y `` down to the producer
60+ through the dataset protocol. The deltalake scanner would then only read the
61+ column `` y ``. Thus, the user gets to enjoy the performance benefits of pushing
62+ down projections while being able to specify those in their preferred query engine.
5363
5464
5565Dataset Producers
@@ -60,24 +70,6 @@ produce a PyArrow-compatible dataset. Your dataset could be backed by the classe
6070implemented in PyArrow or you could implement your own classes. Either way, you
6171should implement the protocol below.
6272
63- When implementing the dataset, consider the following:
64-
65- * Filters passed down should be fully executed. While other systems have scanners
66- that are "best-effort", only executing the parts of the filter that it can, PyArrow
67- datasets should always remove all rows that don't match the filter. If the
68- implementation cannot execute the filter, it should raise an exception. A
69- limited set of expressions are allowed in these filters for the general
70- protocol. See the docstrings for ``Scannable `` below for details.
71- * The API does not require that a dataset has metadata about all fragments
72- loaded into memory. Indeed, to scale to very large Datasets, don't eagerly
73- load all the fragment metadata into memory. Instead, load fragment metadata
74- once a filter is passed. This allows you to skip loading metadata about
75- fragments that aren't relevant to queries. For example, if you have a dataset
76- that uses Hive-style paritioning for a column ``date `` and the user passes a
77- filter for ``date=2023-01-01 ``, then you can skip listing directory for HIVE
78- partitions that don't match that date.
79-
80-
8173Dataset Consumers
8274-----------------
8375
@@ -92,7 +84,7 @@ There are two general patterns for consuming PyArrow datasets: reading a single
9284stream or creating a scan task per fragment.
9385
9486If you have a streaming execution model, you can receive a single stream
95- of data by calling ``dataset.scanner(filter=..., columns=...).to_reader() ``.
87+ of data by calling ``dataset.scanner(columns=...).to_reader() ``.
9688This will return a RecordBatchReader, which can be exported over the
9789:ref: `C Stream Interface <c-stream-interface >`. The record batches yield
9890from the stream can then be passed to worker threads for parallelism.
@@ -103,7 +95,7 @@ and readers. In this case, the code looks more like:
10395
10496.. code-block :: python
10597
106- fragments = list (dataset.get_fragments(filter = ... , columns = ... ))
98+ fragments = list (dataset.get_fragments(columns = ... ))
10799
108100 def scan_partition (i ):
109101 fragment = fragments[i]
@@ -113,13 +105,11 @@ and readers. In this case, the code looks more like:
113105 Fragments are pickleable, so they can be passed to remote workers in a
114106distributed system.
115107
116- If your engine supports predicate (filter) and projection (column) pushdown,
108+ If your engine supports projection (column) pushdown,
117109you can pass those down to the dataset by passing them to the ``scanner ``.
118110Column pushdown is limited to selecting a subset of columns from the schema.
119111Some implementations, including PyArrow may also support projecting and
120- renaming columns, but this is not part of the protocol. Predicate pushdown
121- is limited to a subset of expressions. See the docstrings for ``Scannable ``
122- for the allowed expressions.
112+ renaming columns, but this is not part of the protocol.
123113
124114
125115The protocol
0 commit comments