Skip to content

Commit 829cf60

Browse files
authored
PDEP-10: Add pyarrow as a required dependency (#52711)
1 parent 564b7d2 commit 829cf60

File tree

1 file changed

+215
-0
lines changed

1 file changed

+215
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,215 @@
1+
# PDEP-10: PyArrow as a required dependency for default string inference implementation
2+
3+
- Created: 17 April 2023
4+
- Status: Accepted
5+
- Discussion: [#52711](https://github.com/pandas-dev/pandas/pull/52711)
6+
[#52509](https://github.com/pandas-dev/pandas/issues/52509)
7+
- Author: [Matthew Roeschke](https://github.com/mroeschke)
8+
[Patrick Hoefler](https://github.com/phofl)
9+
- Revision: 1
10+
11+
## Abstract
12+
13+
This PDEP proposes that:
14+
15+
- PyArrow becomes a required runtime dependency starting with pandas 3.0
16+
- The minimum version of PyArrow supported starting with pandas 3.0 is version 7 of PyArrow.
17+
- When the minimum version of PyArrow is bumped, PyArrow will be bumped to the highest version that has
18+
been released for at least 2 years.
19+
- The pandas 2.1 release notes will have a big warning that PyArrow will become a required dependency starting
20+
with pandas 3.0. We will pin a feedback issue on the pandas issue tracker. The note in the release notes will point
21+
to that issue.
22+
- Starting in pandas 2.2, pandas raises a ``FutureWarning`` when PyArrow is not installed in the users
23+
environment when pandas is imported. This will ensure that only one warning is raised and users can
24+
easily silence it if necessary. This warning will point to the feedback issue.
25+
- Starting in pandas 3.0, the default type inferred for string data will be `ArrowDtype` with `pyarrow.string`
26+
instead of `object`. Additionally, we will infer all dtypes that are listed below as well instead of storing as object.
27+
28+
This will bring **immediate benefits to users**, as well as opening up the door for significant further
29+
benefits in the future.
30+
31+
## Background
32+
33+
PyArrow is an optional dependency of pandas that provides a wide range of supplemental features to pandas:
34+
35+
- Since pandas 0.21.0, PyArrow provided I/O reading functionality for Parquet
36+
- Since pandas 1.2.0, pandas integrated PyArrow into the `ExtensionArray` interface to provide an
37+
optional string data type backed by PyArrow
38+
- Since pandas 1.4.0, PyArrow provided I/0 reading functionality for CSV
39+
- Since pandas 1.5.0, pandas provided an `ArrowExtensionArray` and `ArrowDtype` to support all PyArrow
40+
data types within the `ExtensionArray` interface
41+
- Since pandas 2.0.0, all I/O readers have the option to return PyArrow-backed data types, and many methods
42+
now utilize PyArrow compute functions to
43+
accelerate PyArrow-backed data in pandas, notibly string and datetime types.
44+
45+
As of pandas 2.0, one can feasibly utilize PyArrow as an alternative data representation to NumPy with advantages such as:
46+
47+
1. Consistent `NA` support for all data types;
48+
2. Broader support of data types such as `decimal`, `date` and nested types;
49+
3. Better interoperability with other dataframe libraries based on Arrow.
50+
51+
## Motivation
52+
53+
While all the functionality described in the previous paragraph is currently optional, PyArrow has significant
54+
integration into many areas of pandas. With our roadmap noting that pandas strives for better Apache Arrow
55+
interoperability [^1] and many projects [^2], within or beyond the Python ecosystem, adopting or interacting with
56+
the Arrow format, making PyArrow a required dependency provides an additional signal of confidence in the Arrow
57+
ecosystem (as well as improving interoperability with it).
58+
59+
### Immediate User Benefit 1: pyarrow strings
60+
61+
Currently, when users pass string data into pandas constructors without specifying a data type, the resulting data type
62+
is `object`, which has significantly much worse memory usage and performance as compared to pyarrow strings.
63+
With pyarrow string support available since 1.2.0, requiring pyarrow for 3.0 will allow pandas to default
64+
the inferred type to the more efficient pyarrow string type.
65+
66+
```python
67+
In [1]: import pandas as pd
68+
69+
In [2]: pd.Series(["a"]).dtype
70+
# Current behavior
71+
Out[2]: dtype('O')
72+
73+
# Future behavior in 3.0
74+
Out[2]: string[pyarrow]
75+
```
76+
77+
Dask developers investigated performance and memory of pyarrow strings [here](https://www.coiled.io/blog/pyarrow-strings-in-dask-dataframes),
78+
and found them to be a significant improvement over the current `object` dtype.
79+
80+
Little demo:
81+
```python
82+
import string
83+
import random
84+
85+
import pandas as pd
86+
87+
88+
def random_string() -> str:
89+
return "".join(random.choices(string.printable, k=random.randint(10, 100)))
90+
91+
92+
ser_object = pd.Series([random_string() for _ in range(1_000_000)])
93+
ser_string = ser_object.astype("string[pyarrow]")\
94+
```
95+
96+
PyArrow backed strings are significantly faster than NumPy object strings:
97+
98+
*str.len*
99+
100+
```python
101+
In[1]: %timeit ser_object.str.len()
102+
118 ms ± 260 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
103+
104+
In[2]: %timeit ser_string.str.len()
105+
24.2 ms ± 187 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
106+
```
107+
108+
*str.startswith*
109+
110+
```python
111+
In[3]: %timeit ser_object.str.startswith("a")
112+
136 ms ± 300 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
113+
114+
In[4]: %timeit ser_string.str.startswith("a")
115+
11 ms ± 19.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
116+
```
117+
118+
### Immediate User Benefit 2: Nested Datatypes
119+
120+
Currently, if you try storing `dict`s in a pandas `Series`, you will again get the horrendeous `object` dtype:
121+
```python
122+
In [6]: pd.Series([{'a': 1, 'b': 2}, {'a': 2, 'b': 99}])
123+
Out[6]:
124+
0 {'a': 1, 'b': 2}
125+
1 {'a': 2, 'b': 99}
126+
dtype: object
127+
```
128+
129+
If `pyarrow` were required, this could have been auto-inferred to be `pyarrow.struct`, which again
130+
would come with memory and performance improvements.
131+
132+
### Immediate User Benefit 3: Interoperability
133+
134+
Other Arrow-backed dataframe libraries are growing in popularity. Having the same memory representation
135+
would improve interoperability with them, as operations such as:
136+
```python
137+
import pandas as pd
138+
import polars as pl
139+
140+
df = pd.DataFrame(
141+
{
142+
'a': ['one', 'two'],
143+
'b': [{'name': 'Billy', 'age': 3}, {'name': 'Bob', 'age': 4}],
144+
}
145+
)
146+
pl.from_pandas(df)
147+
```
148+
could be zero-copy. Users making use of multiple dataframe libraries would more easily be able to
149+
switch between them.
150+
151+
### Future User Benefits:
152+
153+
Requiring PyArrow would simplify the related development within pandas and potentially improve NumPy
154+
functionality that would be better suited by PyArrow including:
155+
156+
- Avoiding runtime checking if PyArrow is available to perform PyArrow object inference during constructor or indexing operations
157+
158+
- NumPy object dtype will be avoided as much as possible. This means that every dtype that has a PyArrow equivalent is inferred automatically as such. This includes:
159+
- decimal
160+
- binary
161+
- nested types (list or dict data)
162+
- strings
163+
- time
164+
- date
165+
166+
#### Developer benefits
167+
168+
First, this would simplify development of pyarrow-backed datatypes, as it would avoid
169+
optional dependency checks.
170+
171+
Second, it could potentially remove redundant functionality:
172+
- fastparquet engine in `read_parquet`;
173+
- potentially simplifying the `read_csv` logic (needs more investigation);
174+
- factorization;
175+
- datetime/timezone ops.
176+
177+
## Drawbacks
178+
179+
Including PyArrow would naturally increase the installation size of pandas. For example, installing pandas and PyArrow
180+
using pip from wheels, numpy and pandas requires about `70MB`, and including PyArrow requires an additional `120MB`.
181+
An increase of installation size would have negative implication using pandas in space-constrained development or deployment environments
182+
such as AWS Lambda.
183+
184+
Additionally, if a user is installing pandas in an environment where wheels are not available through a `pip install` or `conda install`,
185+
the user will need to also build Arrow C++ and related dependencies when installing from source. These environments include
186+
187+
- Alpine linux (commonly used as a base for Docker containers)
188+
- WASM (pyodide and pyscript)
189+
- Python development versions
190+
191+
Lastly, pandas development and releases will need to be mindful of PyArrow's development and release cadance. For example when
192+
supporting a newly released Python version, pandas will also need to be mindful of PyArrow's wheel support for that Python version
193+
before releasing a new pandas version.
194+
195+
## F.A.Q.
196+
197+
**Q: Why can't pandas just use numpy string and numpy void datatypes instead of pyarrow string and pyarrow struct?**
198+
199+
**A**: NumPy strings aren't yet available, whereas pyarrow strings are. NumPy void datatype would be different to pyarrow struct,
200+
not bringing the same interoperabitlity benefit with other arrow-based dataframe libraries.
201+
202+
**Q: Are all pyarrow dtypes ready? Isn't it too soon to make them the default?**
203+
204+
**A**: They will likely be ready by 3.0 - however, we're not making them the default (yet).
205+
For example, `pd.Series([1, 2, 3])` will continue to be auto-inferred to be
206+
`np.int64`. We will only change the default for dtypes which currently have no `numpy`-backed equivalent and which are
207+
stored as `object` dtype, such as strings and nested datatypes.
208+
209+
### PDEP-10 History
210+
211+
- 17 April 2023: Initial version
212+
- 8 May 2023: Changed proposal to make pyarrow required in pandas 3.0 instead of 2.1
213+
214+
[^1] <https://pandas.pydata.org/docs/development/roadmap.html#apache-arrow-interoperability>
215+
[^2] <https://arrow.apache.org/powered_by/>

0 commit comments

Comments
 (0)