|
| 1 | +# PDEP-10: PyArrow as a required dependency for default string inference implementation |
| 2 | + |
| 3 | +- Created: 17 April 2023 |
| 4 | +- Status: Accepted |
| 5 | +- Discussion: [#52711](https://github.com/pandas-dev/pandas/pull/52711) |
| 6 | + [#52509](https://github.com/pandas-dev/pandas/issues/52509) |
| 7 | +- Author: [Matthew Roeschke](https://github.com/mroeschke) |
| 8 | + [Patrick Hoefler](https://github.com/phofl) |
| 9 | +- Revision: 1 |
| 10 | + |
| 11 | +## Abstract |
| 12 | + |
| 13 | +This PDEP proposes that: |
| 14 | + |
| 15 | +- PyArrow becomes a required runtime dependency starting with pandas 3.0 |
| 16 | +- The minimum version of PyArrow supported starting with pandas 3.0 is version 7 of PyArrow. |
| 17 | +- When the minimum version of PyArrow is bumped, PyArrow will be bumped to the highest version that has |
| 18 | + been released for at least 2 years. |
| 19 | +- The pandas 2.1 release notes will have a big warning that PyArrow will become a required dependency starting |
| 20 | + with pandas 3.0. We will pin a feedback issue on the pandas issue tracker. The note in the release notes will point |
| 21 | + to that issue. |
| 22 | +- Starting in pandas 2.2, pandas raises a ``FutureWarning`` when PyArrow is not installed in the users |
| 23 | + environment when pandas is imported. This will ensure that only one warning is raised and users can |
| 24 | + easily silence it if necessary. This warning will point to the feedback issue. |
| 25 | +- Starting in pandas 3.0, the default type inferred for string data will be `ArrowDtype` with `pyarrow.string` |
| 26 | + instead of `object`. Additionally, we will infer all dtypes that are listed below as well instead of storing as object. |
| 27 | + |
| 28 | +This will bring **immediate benefits to users**, as well as opening up the door for significant further |
| 29 | +benefits in the future. |
| 30 | + |
| 31 | +## Background |
| 32 | + |
| 33 | +PyArrow is an optional dependency of pandas that provides a wide range of supplemental features to pandas: |
| 34 | + |
| 35 | +- Since pandas 0.21.0, PyArrow provided I/O reading functionality for Parquet |
| 36 | +- Since pandas 1.2.0, pandas integrated PyArrow into the `ExtensionArray` interface to provide an |
| 37 | + optional string data type backed by PyArrow |
| 38 | +- Since pandas 1.4.0, PyArrow provided I/0 reading functionality for CSV |
| 39 | +- Since pandas 1.5.0, pandas provided an `ArrowExtensionArray` and `ArrowDtype` to support all PyArrow |
| 40 | + data types within the `ExtensionArray` interface |
| 41 | +- Since pandas 2.0.0, all I/O readers have the option to return PyArrow-backed data types, and many methods |
| 42 | + now utilize PyArrow compute functions to |
| 43 | +accelerate PyArrow-backed data in pandas, notibly string and datetime types. |
| 44 | + |
| 45 | +As of pandas 2.0, one can feasibly utilize PyArrow as an alternative data representation to NumPy with advantages such as: |
| 46 | + |
| 47 | +1. Consistent `NA` support for all data types; |
| 48 | +2. Broader support of data types such as `decimal`, `date` and nested types; |
| 49 | +3. Better interoperability with other dataframe libraries based on Arrow. |
| 50 | + |
| 51 | +## Motivation |
| 52 | + |
| 53 | +While all the functionality described in the previous paragraph is currently optional, PyArrow has significant |
| 54 | +integration into many areas of pandas. With our roadmap noting that pandas strives for better Apache Arrow |
| 55 | +interoperability [^1] and many projects [^2], within or beyond the Python ecosystem, adopting or interacting with |
| 56 | +the Arrow format, making PyArrow a required dependency provides an additional signal of confidence in the Arrow |
| 57 | +ecosystem (as well as improving interoperability with it). |
| 58 | + |
| 59 | +### Immediate User Benefit 1: pyarrow strings |
| 60 | + |
| 61 | +Currently, when users pass string data into pandas constructors without specifying a data type, the resulting data type |
| 62 | +is `object`, which has significantly much worse memory usage and performance as compared to pyarrow strings. |
| 63 | +With pyarrow string support available since 1.2.0, requiring pyarrow for 3.0 will allow pandas to default |
| 64 | +the inferred type to the more efficient pyarrow string type. |
| 65 | + |
| 66 | +```python |
| 67 | +In [1]: import pandas as pd |
| 68 | + |
| 69 | +In [2]: pd.Series(["a"]).dtype |
| 70 | +# Current behavior |
| 71 | +Out[2]: dtype('O') |
| 72 | + |
| 73 | +# Future behavior in 3.0 |
| 74 | +Out[2]: string[pyarrow] |
| 75 | +``` |
| 76 | + |
| 77 | +Dask developers investigated performance and memory of pyarrow strings [here](https://www.coiled.io/blog/pyarrow-strings-in-dask-dataframes), |
| 78 | +and found them to be a significant improvement over the current `object` dtype. |
| 79 | + |
| 80 | +Little demo: |
| 81 | +```python |
| 82 | +import string |
| 83 | +import random |
| 84 | + |
| 85 | +import pandas as pd |
| 86 | + |
| 87 | + |
| 88 | +def random_string() -> str: |
| 89 | + return "".join(random.choices(string.printable, k=random.randint(10, 100))) |
| 90 | + |
| 91 | + |
| 92 | +ser_object = pd.Series([random_string() for _ in range(1_000_000)]) |
| 93 | +ser_string = ser_object.astype("string[pyarrow]")\ |
| 94 | +``` |
| 95 | + |
| 96 | +PyArrow backed strings are significantly faster than NumPy object strings: |
| 97 | + |
| 98 | +*str.len* |
| 99 | + |
| 100 | +```python |
| 101 | +In[1]: %timeit ser_object.str.len() |
| 102 | +118 ms ± 260 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) |
| 103 | + |
| 104 | +In[2]: %timeit ser_string.str.len() |
| 105 | +24.2 ms ± 187 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) |
| 106 | +``` |
| 107 | + |
| 108 | +*str.startswith* |
| 109 | + |
| 110 | +```python |
| 111 | +In[3]: %timeit ser_object.str.startswith("a") |
| 112 | +136 ms ± 300 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) |
| 113 | + |
| 114 | +In[4]: %timeit ser_string.str.startswith("a") |
| 115 | +11 ms ± 19.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) |
| 116 | +``` |
| 117 | + |
| 118 | +### Immediate User Benefit 2: Nested Datatypes |
| 119 | + |
| 120 | +Currently, if you try storing `dict`s in a pandas `Series`, you will again get the horrendeous `object` dtype: |
| 121 | +```python |
| 122 | +In [6]: pd.Series([{'a': 1, 'b': 2}, {'a': 2, 'b': 99}]) |
| 123 | +Out[6]: |
| 124 | +0 {'a': 1, 'b': 2} |
| 125 | +1 {'a': 2, 'b': 99} |
| 126 | +dtype: object |
| 127 | +``` |
| 128 | + |
| 129 | +If `pyarrow` were required, this could have been auto-inferred to be `pyarrow.struct`, which again |
| 130 | +would come with memory and performance improvements. |
| 131 | + |
| 132 | +### Immediate User Benefit 3: Interoperability |
| 133 | + |
| 134 | +Other Arrow-backed dataframe libraries are growing in popularity. Having the same memory representation |
| 135 | +would improve interoperability with them, as operations such as: |
| 136 | +```python |
| 137 | +import pandas as pd |
| 138 | +import polars as pl |
| 139 | + |
| 140 | +df = pd.DataFrame( |
| 141 | + { |
| 142 | + 'a': ['one', 'two'], |
| 143 | + 'b': [{'name': 'Billy', 'age': 3}, {'name': 'Bob', 'age': 4}], |
| 144 | + } |
| 145 | +) |
| 146 | +pl.from_pandas(df) |
| 147 | +``` |
| 148 | +could be zero-copy. Users making use of multiple dataframe libraries would more easily be able to |
| 149 | +switch between them. |
| 150 | + |
| 151 | +### Future User Benefits: |
| 152 | + |
| 153 | +Requiring PyArrow would simplify the related development within pandas and potentially improve NumPy |
| 154 | +functionality that would be better suited by PyArrow including: |
| 155 | + |
| 156 | +- Avoiding runtime checking if PyArrow is available to perform PyArrow object inference during constructor or indexing operations |
| 157 | + |
| 158 | +- NumPy object dtype will be avoided as much as possible. This means that every dtype that has a PyArrow equivalent is inferred automatically as such. This includes: |
| 159 | + - decimal |
| 160 | + - binary |
| 161 | + - nested types (list or dict data) |
| 162 | + - strings |
| 163 | + - time |
| 164 | + - date |
| 165 | + |
| 166 | +#### Developer benefits |
| 167 | + |
| 168 | +First, this would simplify development of pyarrow-backed datatypes, as it would avoid |
| 169 | +optional dependency checks. |
| 170 | + |
| 171 | +Second, it could potentially remove redundant functionality: |
| 172 | +- fastparquet engine in `read_parquet`; |
| 173 | +- potentially simplifying the `read_csv` logic (needs more investigation); |
| 174 | +- factorization; |
| 175 | +- datetime/timezone ops. |
| 176 | + |
| 177 | +## Drawbacks |
| 178 | + |
| 179 | +Including PyArrow would naturally increase the installation size of pandas. For example, installing pandas and PyArrow |
| 180 | +using pip from wheels, numpy and pandas requires about `70MB`, and including PyArrow requires an additional `120MB`. |
| 181 | +An increase of installation size would have negative implication using pandas in space-constrained development or deployment environments |
| 182 | +such as AWS Lambda. |
| 183 | + |
| 184 | +Additionally, if a user is installing pandas in an environment where wheels are not available through a `pip install` or `conda install`, |
| 185 | +the user will need to also build Arrow C++ and related dependencies when installing from source. These environments include |
| 186 | + |
| 187 | +- Alpine linux (commonly used as a base for Docker containers) |
| 188 | +- WASM (pyodide and pyscript) |
| 189 | +- Python development versions |
| 190 | + |
| 191 | +Lastly, pandas development and releases will need to be mindful of PyArrow's development and release cadance. For example when |
| 192 | +supporting a newly released Python version, pandas will also need to be mindful of PyArrow's wheel support for that Python version |
| 193 | +before releasing a new pandas version. |
| 194 | + |
| 195 | +## F.A.Q. |
| 196 | + |
| 197 | +**Q: Why can't pandas just use numpy string and numpy void datatypes instead of pyarrow string and pyarrow struct?** |
| 198 | + |
| 199 | +**A**: NumPy strings aren't yet available, whereas pyarrow strings are. NumPy void datatype would be different to pyarrow struct, |
| 200 | + not bringing the same interoperabitlity benefit with other arrow-based dataframe libraries. |
| 201 | + |
| 202 | +**Q: Are all pyarrow dtypes ready? Isn't it too soon to make them the default?** |
| 203 | + |
| 204 | +**A**: They will likely be ready by 3.0 - however, we're not making them the default (yet). |
| 205 | + For example, `pd.Series([1, 2, 3])` will continue to be auto-inferred to be |
| 206 | + `np.int64`. We will only change the default for dtypes which currently have no `numpy`-backed equivalent and which are |
| 207 | + stored as `object` dtype, such as strings and nested datatypes. |
| 208 | + |
| 209 | +### PDEP-10 History |
| 210 | + |
| 211 | +- 17 April 2023: Initial version |
| 212 | +- 8 May 2023: Changed proposal to make pyarrow required in pandas 3.0 instead of 2.1 |
| 213 | + |
| 214 | +[^1] <https://pandas.pydata.org/docs/development/roadmap.html#apache-arrow-interoperability> |
| 215 | +[^2] <https://arrow.apache.org/powered_by/> |
0 commit comments