Skip to content

DOC/ENH: Add documentation and whatsnew entry for ArrowDtype and ArrowExtensionArray #47854

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 18 commits into from
Aug 18, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
65 changes: 52 additions & 13 deletions doc/source/reference/arrays.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,19 +19,20 @@ objects contained with a :class:`Index`, :class:`Series`, or
For some data types, pandas extends NumPy's type system. String aliases for these types
can be found at :ref:`basics.dtypes`.

=================== ========================= ================== =============================
Kind of Data pandas Data Type Scalar Array
=================== ========================= ================== =============================
TZ-aware datetime :class:`DatetimeTZDtype` :class:`Timestamp` :ref:`api.arrays.datetime`
Timedeltas (none) :class:`Timedelta` :ref:`api.arrays.timedelta`
Period (time spans) :class:`PeriodDtype` :class:`Period` :ref:`api.arrays.period`
Intervals :class:`IntervalDtype` :class:`Interval` :ref:`api.arrays.interval`
Nullable Integer :class:`Int64Dtype`, ... (none) :ref:`api.arrays.integer_na`
Categorical :class:`CategoricalDtype` (none) :ref:`api.arrays.categorical`
Sparse :class:`SparseDtype` (none) :ref:`api.arrays.sparse`
Strings :class:`StringDtype` :class:`str` :ref:`api.arrays.string`
Boolean (with NA) :class:`BooleanDtype` :class:`bool` :ref:`api.arrays.bool`
=================== ========================= ================== =============================
=================== ========================= ============================= =============================
Kind of Data pandas Data Type Scalar Array
=================== ========================= ============================= =============================
TZ-aware datetime :class:`DatetimeTZDtype` :class:`Timestamp` :ref:`api.arrays.datetime`
Timedeltas (none) :class:`Timedelta` :ref:`api.arrays.timedelta`
Period (time spans) :class:`PeriodDtype` :class:`Period` :ref:`api.arrays.period`
Intervals :class:`IntervalDtype` :class:`Interval` :ref:`api.arrays.interval`
Nullable Integer :class:`Int64Dtype`, ... (none) :ref:`api.arrays.integer_na`
Categorical :class:`CategoricalDtype` (none) :ref:`api.arrays.categorical`
Sparse :class:`SparseDtype` (none) :ref:`api.arrays.sparse`
Strings :class:`StringDtype` :class:`str` :ref:`api.arrays.string`
Boolean (with NA) :class:`BooleanDtype` :class:`bool` :ref:`api.arrays.bool`
PyArrow :class:`ArrowDtype` Python Scalars or :class:`NA` :ref:`api.arrays.arrow`
=================== ========================= ============================= =============================

pandas and third-party libraries can extend NumPy's type system (see :ref:`extending.extension-types`).
The top-level :meth:`array` method can be used to create a new array, which may be
Expand All @@ -42,6 +43,44 @@ stored in a :class:`Series`, :class:`Index`, or as a column in a :class:`DataFra

array

.. _api.arrays.arrow:

PyArrow
-------

.. warning::

This feature is experimental, and the API can change in a future release without warning.

The :class:`arrays.ArrowExtensionArray` is backed by a :external+pyarrow:py:class:`pyarrow.ChunkedArray` with a
:external+pyarrow:py:class:`pyarrow.DataType` instead of a NumPy array and data type. The ``.dtype`` of a :class:`arrays.ArrowExtensionArray`
is an :class:`ArrowDtype`.

`Pyarrow <https://arrow.apache.org/docs/python/index.html>`__ provides similar array and `data type <https://arrow.apache.org/docs/python/api/datatypes.html>`__
support as NumPy including first-class nullability support for all data types, immutability and more.

.. note::

For string types (``pyarrow.string()``, ``string[pyarrow]``), PyArrow support is still facilitated
by :class:`arrays.ArrowStringArray` and ``StringDtype("pyarrow")``. See the :ref:`string section <api.arrays.string>`
below.

While individual values in an :class:`arrays.ArrowExtensionArray` are stored as a PyArrow objects, scalars are **returned**
as Python scalars corresponding to the data type, e.g. a PyArrow int64 will be returned as Python int, or :class:`NA` for missing
values.

.. autosummary::
:toctree: api/
:template: autosummary/class_without_autosummary.rst

arrays.ArrowExtensionArray
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be core.arrays.arrow.ArrowExtensionArray? Seems like we don't have ArrowExtensionArray imported in arrays/__init__.py.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this PR is depends on #47818 where these will be defined.


.. autosummary::
:toctree: api/
:template: autosummary/class_without_autosummary.rst

ArrowDtype
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be core.arrays.arrow.ArrowDtype?

CI is red, I guess this is the cause: https://github.com/pandas-dev/pandas/runs/7616698322?check_suite_focus=true#step:6:153

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this PR is depends on #47818 where these will be defined.


.. _api.arrays.datetime:

Datetimes
Expand Down
34 changes: 34 additions & 0 deletions doc/source/whatsnew/v1.5.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,40 @@ https://github.com/pandas-dev/pandas-stubs for more information.

We thank VirtusLab and Microsoft for their initial, significant contributions to ``pandas-stubs``

.. _whatsnew_150.enhancements.arrow:

Native PyArrow-backed ExtensionArray
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

With `Pyarrow <https://arrow.apache.org/docs/python/index.html>`__ installed, users can now create pandas objects
that are backed by a ``pyarrow.ChunkedArray`` and ``pyarrow.DataType``.

The ``dtype`` argument can accept a string of a `pyarrow data type <https://arrow.apache.org/docs/python/api/datatypes.html>`__
with ``pyarrow`` in brackets e.g. ``"int64[pyarrow]"`` or, for pyarrow data types that take parameters, a :class:`ArrowDtype`
initialized with a ``pyarrow.DataType``.

.. ipython:: python

import pyarrow as pa
ser_float = pd.Series([1.0, 2.0, None], dtype="float32[pyarrow]")
ser_float

list_of_int_type = pd.ArrowDtype(pa.list_(pa.int64()))
ser_list = pd.Series([[1, 2], [3, None]], dtype=list_of_int_type)
ser_list

ser_list.take([1, 0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like variables are cleaned from one block to the next, and sphinx is not happy here for ser_list not being defined. Maybe we can merge both blocks and move the comment in between after the block?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like variables are cleaned from one block to the next

FYI, that's not the case, all variables are kept for (at least) a full file, so you can define a variable in one ipython code block, and use it in a next.
(the reason it was failing was because ArrowDtype wasn't actually yet exposed in the main namespace)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up just combining both blocks anyways.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, and that's perfect here, just wanted to clarify ;)

ser_float * 5
ser_float.mean()
ser_float.dropna()

Most operations are supported and have been implemented using `pyarrow compute <https://arrow.apache.org/docs/python/api/compute.html>`__ functions.
We recommend installing the latest version of PyArrow to access the most recently implemented compute functions.

.. warning::

This feature is experimental, and the API can change in a future release without warning.

.. _whatsnew_150.enhancements.dataframe_interchange:

DataFrame interchange protocol implementation
Expand Down
2 changes: 2 additions & 0 deletions pandas/arrays/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
See :ref:`extending.extension-types` for more.
"""
from pandas.core.arrays import (
ArrowExtensionArray,
ArrowStringArray,
BooleanArray,
Categorical,
Expand All @@ -19,6 +20,7 @@
)

__all__ = [
"ArrowExtensionArray",
"ArrowStringArray",
"BooleanArray",
"Categorical",
Expand Down
43 changes: 41 additions & 2 deletions pandas/core/arrays/arrow/array.py
Original file line number Diff line number Diff line change
Expand Up @@ -159,8 +159,47 @@ def to_pyarrow_type(

class ArrowExtensionArray(OpsMixin, ExtensionArray):
"""
Base class for ExtensionArray backed by Arrow ChunkedArray.
"""
Pandas ExtensionArray backed by a PyArrow ChunkedArray.

.. warning::

ArrowExtensionArray is considered experimental. The implementation and
parts of the API may change without warning.

Parameters
----------
values : pyarrow.Array or pyarrow.ChunkedArray

Attributes
----------
None

Methods
-------
None

Returns
-------
ArrowExtensionArray

Notes
-----
Most methods are implemented using `pyarrow compute functions. <https://arrow.apache.org/docs/python/api/compute.html>`__
Some methods may either raise an exception or raise a ``PerformanceWarning`` if an
associated compute function is not available based on the installed version of PyArrow.

Please install the latest version of PyArrow to enable the best functionality and avoid
potential bugs in prior versions of PyArrow.

Examples
--------
Create an ArrowExtensionArray with :func:`pandas.array`:

>>> pd.array([1, 1, None], dtype="int64[pyarrow]")
<ArrowExtensionArray>
[1, 1, <NA>]
Length: 3, dtype: int64[pyarrow]
""" # noqa: E501 (http link too long)

_data: pa.ChunkedArray
_dtype: ArrowDtype
Expand Down
47 changes: 44 additions & 3 deletions pandas/core/arrays/arrow/dtype.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,9 +20,47 @@
@register_extension_dtype
class ArrowDtype(StorageExtensionDtype):
"""
Base class for dtypes for ArrowExtensionArray.
Modeled after BaseMaskedDtype
"""
An ExtensionDtype for PyArrow data types.

.. warning::

ArrowDtype is considered experimental. The implementation and
parts of the API may change without warning.

While most ``dtype`` arguments can accept the "string"
constructor, e.g. ``"int64[pyarrow]"``, ArrowDtype is useful
if the data type contains parameters like ``pyarrow.timestamp``.

Parameters
----------
pyarrow_dtype : pa.DataType
An instance of a `pyarrow.DataType <https://arrow.apache.org/docs/python/api/datatypes.html#factory-functions>`__.

Attributes
----------
pyarrow_dtype

Methods
-------
None

Returns
-------
ArrowDtype

Examples
--------
>>> import pyarrow as pa
>>> pd.ArrowDtype(pa.int64())
int64[pyarrow]

Types with parameters must be constructed with ArrowDtype.

>>> pd.ArrowDtype(pa.timestamp("s", tz="America/New_York"))
timestamp[s, tz=America/New_York][pyarrow]
>>> pd.ArrowDtype(pa.list_(pa.int64()))
list<item: int64>[pyarrow]
""" # noqa: E501

_metadata = ("storage", "pyarrow_dtype") # type: ignore[assignment]

Expand All @@ -37,6 +75,9 @@ def __init__(self, pyarrow_dtype: pa.DataType) -> None:
)
self.pyarrow_dtype = pyarrow_dtype

def __repr__(self) -> str:
return self.name

@property
def type(self):
"""
Expand Down
1 change: 1 addition & 0 deletions scripts/validate_rst_title_capitalization.py
Original file line number Diff line number Diff line change
Expand Up @@ -150,6 +150,7 @@
"LZMA",
"Numba",
"Timestamp",
"PyArrow",
}

CAP_EXCEPTIONS_DICT = {word.lower(): word for word in CAPITALIZATION_EXCEPTIONS}
Expand Down