Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EA: support basic 2D operations #27142

Closed
wants to merge 36 commits into from
Closed
Show file tree
Hide file tree
Changes from 29 commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
d673008
un-xfail tests, xfail instead of skip, minor cleanup
jbrockmendel Jun 28, 2019
b2a837b
Merge branch 'master' of https://github.com/pandas-dev/pandas into xf…
jbrockmendel Jun 28, 2019
c2fd1b1
Merge branch 'master' of https://github.com/pandas-dev/pandas into xf…
jbrockmendel Jun 29, 2019
bed5563
REF: derive __len__ from shape instead of vice-versa
jbrockmendel Jun 30, 2019
29d9dc2
Merge branch 'master' of https://github.com/pandas-dev/pandas into ar…
jbrockmendel Jun 30, 2019
f3ce13c
update docstring
jbrockmendel Jun 30, 2019
4fd24c1
remove duplicated methods
jbrockmendel Jun 30, 2019
c0505ee
Merge branch 'master' of https://github.com/pandas-dev/pandas into ar…
jbrockmendel Jul 1, 2019
c81daeb
implement shape in terms of size, with implement_2d decorator
jbrockmendel Jul 1, 2019
4d77dbe
move implement_2d, implement view
jbrockmendel Jul 1, 2019
d43ef30
port tests
jbrockmendel Jul 2, 2019
91c979b
Merge branch 'master' of https://github.com/pandas-dev/pandas into ar…
jbrockmendel Jul 2, 2019
bc220c3
shape patching, tests
jbrockmendel Jul 2, 2019
421b5a3
Merge branch 'master' of https://github.com/pandas-dev/pandas into ar…
jbrockmendel Jul 2, 2019
7c6df89
Merge branch 'master' of https://github.com/pandas-dev/pandas into ar…
jbrockmendel Jul 5, 2019
203504c
blackify
jbrockmendel Jul 5, 2019
bc12f01
Merge branch 'master' of https://github.com/pandas-dev/pandas into ar…
jbrockmendel Jul 8, 2019
2f18dae
shape fixups
jbrockmendel Jul 8, 2019
eb0645d
blackify+isort
jbrockmendel Jul 8, 2019
2540933
property read-write
jbrockmendel Jul 8, 2019
5b60fb5
add docstring
jbrockmendel Jul 9, 2019
96f3ae2
Merge branch 'master' of https://github.com/pandas-dev/pandas into ar…
jbrockmendel Jul 22, 2019
92a0a56
implement base class view
jbrockmendel Jul 22, 2019
81191bf
Merge branch 'master' of https://github.com/pandas-dev/pandas into ar…
jbrockmendel Jul 22, 2019
91639dd
use base view
jbrockmendel Jul 22, 2019
34cc9e9
patch take, getitem
jbrockmendel Jul 22, 2019
444f9f7
blackify
jbrockmendel Jul 22, 2019
7c15b74
isort fixup
jbrockmendel Jul 22, 2019
768d75d
patch iter
jbrockmendel Jul 22, 2019
3b7b2b2
slice handling cleanup
jbrockmendel Jul 22, 2019
177bfb0
Merge branch 'master' of https://github.com/pandas-dev/pandas into ar…
jbrockmendel Jul 29, 2019
f5cba22
Merge branch 'master' of https://github.com/pandas-dev/pandas into ar…
jbrockmendel Aug 2, 2019
174a1da
Merge branch 'master' of https://github.com/pandas-dev/pandas into ar…
jbrockmendel Aug 3, 2019
1d78fbe
Merge branch 'master' of https://github.com/pandas-dev/pandas into ar…
jbrockmendel Aug 5, 2019
41b49d9
Merge branch 'master' of https://github.com/pandas-dev/pandas into ar…
jbrockmendel Aug 9, 2019
fc331b8
dummy to force CI
jbrockmendel Aug 9, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.25.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -888,6 +888,7 @@ Other API changes
- :meth:`ExtensionArray.argsort` places NA values at the end of the sorted array. (:issue:`21801`)
- :meth:`DataFrame.to_hdf` and :meth:`Series.to_hdf` will now raise a ``NotImplementedError`` when saving a :class:`MultiIndex` with extention data types for a ``fixed`` format. (:issue:`7775`)
- Passing duplicate ``names`` in :meth:`read_csv` will now raise a ``ValueError`` (:issue:`17346`)
- :meth:`Categorical.ravel` will now return a :class:`Categorical` instead of a NumPy array. (:issue:`27153`)
TomAugspurger marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this change in 0.25? (IIRC yes?), if not can you break it out

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we only recently deprecated the old behavior. will double-check and see whats appropriate


.. _whatsnew_0250.deprecations:

Expand Down
1 change: 1 addition & 0 deletions pandas/core/arrays/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
from ._reshaping import implement_2d # noqa:F401
from .array_ import array # noqa: F401
from .base import ( # noqa: F401
ExtensionArray,
Expand Down
236 changes: 236 additions & 0 deletions pandas/core/arrays/_reshaping.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,236 @@
"""
Utilities for implementing 2D compatibility for 1D ExtensionArrays.
"""
from functools import wraps
from typing import Tuple

import numpy as np

from pandas._libs.lib import is_integer


def implement_2d(cls):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the reasoning behind making this a decorator as opposed to just having base class method? it seems much simpler and more obvious.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main pushback on this idea was the onus put on EA authors who only support 1D. ATM this decorator mainly just renames __len__ to size so that it can re-define shape in a 2D-compatible way. So asking authors to use the decorator instead of doing that re-definition themselves is kind of a wash.

But the step after this is to patch __getitem__, __setitem__, take, __iter__, all of which are easier to do with the decorator than by asking authors to do it themselves.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the plan still to have an "opt me out of this, I natively support 2D arrays"? If we go down the route of putting every Block.values inside an EA (with PandasArray for the current NumPy-backed Blocks), then we'll want that, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see you're applying this to subclasses, rather than ExtensionArray itself. Motivation for that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Motivation for that?

less magic than a metaclass (and my first attempt at using a metaclass failed)

Is the plan still to have an "opt me out of this, I natively support 2D arrays"?

Defined EA._allows_2d = False. Authors set that to True if they implement this themselves. This decorator should be updated to check that and be a no-op in that case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, right, because simply decorating

@implements_2d
class ExtensionArray:

would work for subclasses right? It has to be a metaclass. I'll poke around at the meta class approach to see what's going on. May be too magical.

But if we do go with a

"""
A decorator to take a 1-dimension-only ExtensionArray subclass and make
it support limited 2-dimensional operations.
"""
from pandas.core.arrays import ExtensionArray

# For backwards-compatibility, if an EA author implemented __len__
# but not size, we use that __len__ method to get an array's size.
has_size = cls.size is not ExtensionArray.size
has_shape = cls.shape is not ExtensionArray.shape
has_len = cls.__len__ is not ExtensionArray.__len__

if not has_size and has_len:
cls.size = property(cls.__len__)
cls.__len__ = ExtensionArray.__len__

elif not has_size and has_shape:

@property
def size(self) -> int:
return np.prod(self.shape)

cls.size = size

orig_copy = cls.copy
TomAugspurger marked this conversation as resolved.
Show resolved Hide resolved

@wraps(orig_copy)
def copy(self):
result = orig_copy(self)
result._shape = self._shape
return result

cls.copy = copy

orig_getitem = cls.__getitem__

def __getitem__(self, key):
if self.ndim == 1:
return orig_getitem(self, key)

key = expand_key(key, self.shape)
if is_integer(key[0]):
assert key[0] in [0, -1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this at a glance. Who does it have to be in [0, -1]? (I get that we're only allowing N x 1 and 1 X N arrays, but I don't get why key[0] is the one being checked).

Just from looking at it, I would expect this to break this __getitem__ on something like

arr = pd.array([1, 2, 3], dtype="Int64")
# arr = arra.reshape(...)
arr[1]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I wrote this with only (1, N) in mind

result = orig_getitem(self, key[1])
return result

if isinstance(key[0], slice):
if slice_contains_zero(key[0]):
result = orig_getitem(self, key[1])
result._shape = (1, result.size)
return result

raise NotImplementedError(key)
# TODO: ellipses?
raise NotImplementedError(key)

cls.__getitem__ = __getitem__

orig_take = cls.take

# kwargs for compat with Interval
# allow_fill=None instead of False is for compat with Categorical
def take(self, indices, allow_fill=None, fill_value=None, axis=0, **kwargs):
if self.ndim == 1 and axis == 0:
return orig_take(
self, indices, allow_fill=allow_fill, fill_value=fill_value, **kwargs
)

if self.ndim != 2 or self.shape[0] != 1:
raise NotImplementedError
if axis not in [0, 1]:
raise ValueError(axis)
if kwargs:
raise ValueError(
"kwargs should not be passed in the 2D case, "
"are only included for compat with Interval"
)

if axis == 1:
result = orig_take(
self, indices, allow_fill=allow_fill, fill_value=fill_value
)
result._shape = (1, result.size)
return result

# For axis == 0, because we only support shape (1, N)
# there are only limited indices we can accept
if len(indices) != 1:
# TODO: we could probably support zero-len here
raise NotImplementedError

def take_item(n):
if n == -1:
seq = [fill_value] * self.shape[1]
return type(self)._from_sequence(seq)
else:
return self[n, :]

arrs = [take_item(n) for n in indices]
result = type(self)._concat_same_type(arrs)
result.shape = (len(indices), self.shape[1])
return result

cls.take = take

orig_iter = cls.__iter__

def __iter__(self):
if self.ndim == 1:
for obj in orig_iter(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does just return orig_iter(self) work?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC from one of the proofs-of-concept this actually mattered, but I dont remember which direction it mattered in

yield obj
else:
for n in range(self.shape[0]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this case be written as something like

reduced = self[:, 0]
return orig_iter(reduced)`

? I worry a bit about the cost of calling __getitem__ on each iteration.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely dont want to call orig_iter here since we want to yield ndim-1 arrays

yield self[n]

cls.__iter__ = __iter__

return cls
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we are still doing this approach, but if we are

I would break out each of the implementations into a well named function that takes cls. then impelemented_2d is pretty straightforward to read w/o having to understand the details, you can immediately see what is being changed and the details exist in the functions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

ATM the relevant discussion about how to proceed is in Tom's PR here



def slice_contains_zero(slc: slice) -> bool:
if slc == slice(None):
return True
if slc == slice(0, None):
return True
if slc == slice(0, 1):
return True
raise NotImplementedError(slc)


def expand_key(key, shape):
ndim = len(shape)
if ndim != 2 or shape[0] != 1:
raise NotImplementedError
if not isinstance(key, tuple):
key = (key, slice(None))
if len(key) != 2:
raise ValueError(key)

if is_integer(key[0]) and key[0] not in [0, -1]:
raise ValueError(key)

return key


def can_safe_ravel(shape: Tuple[int, ...]) -> bool:
"""
Check if an array with the given shape can be ravelled unambiguously
regardless of column/row order.

Parameters
----------
shape : tuple[int]

Returns
-------
bool
"""
if len(shape) == 1:
return True
if len(shape) > 2:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this come up in practice? I'm having a hard time judging whether NotImplementedError or False is the right behavior in this case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think NotImplementedError is more accurate, but for the current implementation this should never be reached

raise NotImplementedError(shape)
if shape[0] == 1 or shape[1] == 1:
# column-like or row-like
return True
return False


def tuplify_shape(size: int, shape, restrict=True) -> Tuple[int, ...]:
"""
Convert a passed shape into a valid tuple.
Following ndarray.reshape, we accept either `reshape(a, b)` or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't suppose NumPy has a function we can borrow here? I'm not aware of one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numpy/numpy#13768 is the closest thing I found, but I agree it seems like the kind of thing that should exist.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a somewhat dumb approach:

def tuplify_shape(size, shape):
    return np.broadcast_to(None, size).reshape(*shape).shape

Tested as:

In [20]: tuplify_shape(100, (2, -1, 5))
Out[20]: (2, 10, 5)

In [21]: tuplify_shape(100, ((2, -1, 5)))
Out[21]: (2, 10, 5)

In [22]: tuplify_shape(100, ((2, 11, 5)))
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-22-f71a2f488ea8> in <module>()
----> 1 tuplify_shape(100, ((2, 11, 5)))

<ipython-input-19-6b4139eccdff> in tuplify_shape(size, shape)
      1 def tuplify_shape(size, shape):
----> 2     return np.broadcast_to(None, size).reshape(*shape).shape

ValueError: cannot reshape array of size 100 into shape (2,11,5)

`reshape((a, b))`, the latter being canonical.

Parameters
----------
size : int
shape : tuple
restrict : bool, default True
Whether to restrict to shapes (N), (1,N), and (N,1)

Returns
-------
tuple[int, ...]
"""
if len(shape) == 0:
raise ValueError("shape must be a non-empty tuple of integers", shape)

if len(shape) == 1:
if is_integer(shape[0]):
pass
else:
shape = shape[0]
if not isinstance(shape, tuple):
raise ValueError("shape must be a non-empty tuple of integers", shape)

if not all(is_integer(x) for x in shape):
raise ValueError("shape must be a non-empty tuple of integers", shape)

if any(x < -1 for x in shape):
raise ValueError("Invalid shape {shape}".format(shape=shape))

if -1 in shape:
if shape.count(-1) != 1:
raise ValueError("Invalid shape {shape}".format(shape=shape))
idx = shape.index(-1)
others = [n for n in shape if n != -1]
prod = np.prod(others)
dim = size // prod
shape = shape[:idx] + (dim,) + shape[idx + 1 :]

if np.prod(shape) != size:
raise ValueError(
"Product of shape ({shape}) must match "
"size ({size})".format(shape=shape, size=size)
)

num_gt1 = len([x for x in shape if x > 1])
if num_gt1 > 1 and restrict:
raise ValueError(
"The default `reshape` implementation is limited to "
"shapes (N,), (N,1), and (1,N), not {shape}".format(shape=shape)
)
return shape
79 changes: 73 additions & 6 deletions pandas/core/arrays/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@

from pandas._typing import ArrayLike
from pandas.core import ops
from pandas.core.arrays._reshaping import can_safe_ravel, tuplify_shape
from pandas.core.sorting import nargsort

_not_implemented_message = "{} does not implement {}."
Expand Down Expand Up @@ -80,7 +81,7 @@ class ExtensionArray:
* _from_sequence
* _from_factorized
* __getitem__
* __len__
* __len__ *or* size
* dtype
* nbytes
* isna
Expand Down Expand Up @@ -157,6 +158,12 @@ class ExtensionArray:
# Don't override this.
_typ = "extension"

# Whether this class supports 2D arrays natively. If so, set _allows_2d
# to True and override reshape, ravel, and T. Otherwise, apply the
# `implement_2d` decorator to use default implementations of limited
# 2D functionality.
_allows_2d = False

# ------------------------------------------------------------------------
# Constructors
# ------------------------------------------------------------------------
Expand Down Expand Up @@ -316,7 +323,7 @@ def __len__(self) -> int:
-------
length : int
"""
raise AbstractMethodError(self)
return self.shape[0]

def __iter__(self):
"""
Expand All @@ -331,6 +338,7 @@ def __iter__(self):
# ------------------------------------------------------------------------
# Required attributes
# ------------------------------------------------------------------------
_shape = None

@property
def dtype(self) -> ExtensionDtype:
Expand All @@ -344,14 +352,33 @@ def shape(self) -> Tuple[int, ...]:
"""
Return a tuple of the array dimensions.
"""
return (len(self),)
if self._shape is not None:
return self._shape

# Default to 1D
length = self.size
return (length,)

@shape.setter
def shape(self, value):
size = np.prod(value)
if size != self.size:
raise ValueError("Implied size must match actual size.")
self._shape = value

@property
def ndim(self) -> int:
"""
Extension Arrays are only allowed to be 1-dimensional.
"""
return 1
return len(self.shape)

@property
def size(self) -> int:
"""
The number of elements in this array.
"""
raise AbstractMethodError(self)

@property
def nbytes(self) -> int:
Expand Down Expand Up @@ -867,6 +894,24 @@ def copy(self) -> ABCExtensionArray:
"""
raise AbstractMethodError(self)

def view(self, dtype=None) -> ABCExtensionArray:
"""
Return a view on the array.

Returns
-------
ExtensionArray

Notes
-----
- This must return a *new* object, not self.
- The only case that *must* be implemented is with dtype=None,
giving a view with the same dtype as self.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can / should we make any requirements on this being no-copy? i.e.

a = my_array(...)
b = a.view(dtype=int)
b[0] = 10
assert a[0] == b[0]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

definitely needs to be no-copy; i'll add that to the docstring

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a test for the no-copy bit

"""
if dtype is not None:
raise NotImplementedError(dtype)
return self[:]

# ------------------------------------------------------------------------
# Printing
# ------------------------------------------------------------------------
Expand Down Expand Up @@ -932,6 +977,26 @@ def _formatting_values(self) -> np.ndarray:
# Reshaping
# ------------------------------------------------------------------------

def reshape(self, *shape):
"""
Return a view on this array with the given shape.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to document that only (1, N) and (N, 1) is allowed here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good thinking. ATM I wrote most of this with only (1, N) in mind

"""
# numpy accepts either a single tuple or an expanded tuple
shape = tuplify_shape(self.size, shape)
result = self.view()
result._shape = shape
return result

@property
def T(self) -> ABCExtensionArray:
"""
Return a transposed view on self.
"""
if not can_safe_ravel(self.shape):
raise NotImplementedError
shape = self.shape[::-1]
return self.reshape(shape)

def ravel(self, order="C") -> ABCExtensionArray:
"""
Return a flattened view on this array.
Expand All @@ -946,10 +1011,12 @@ def ravel(self, order="C") -> ABCExtensionArray:

Notes
-----
- Because ExtensionArrays are 1D-only, this is a no-op.
- The "order" argument is ignored, is for compatibility with NumPy.
"""
return self
if not can_safe_ravel(self.shape):
raise NotImplementedError
shape = (self.size,)
return self.reshape(shape)

@classmethod
def _concat_same_type(
Expand Down
Loading