Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EA: implement+test EA.view #27633

Merged
merged 17 commits into from
Aug 9, 2019
Merged

Conversation

jbrockmendel
Copy link
Member

Broken off from #27142, plus some type annotations

-------
ExtensionArray

Notes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been inconsistent about this, be in general I've tried to keep docstrings user-facing. For implementation notes I've just used regular comments.

What are the consequences of .view returning self?

Do we have any restrictions on this being zero-copy?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any restrictions on this being zero-copy?

The test added in this PR will fail if we don't get an actual view.

What are the consequences of .view returning self?

view is going to be used by the default implementation of reshape, so returning self would cause all kinds of trouble.

The default implementation should Just Work as long as self[:] returns a view, which should be the case anyway (JSONArray ATM returns a copy)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Convert this to normal comments as Tom mentioned?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jul 29, 2019 via email

@jorisvandenbossche
Copy link
Member

view is going to be used by the default implementation of reshape, so returning self would cause all kinds of trouble.

Can you be a bit more specific about this "all kings of trouble" ?

@@ -1773,9 +1766,10 @@ def view(self):
Returns
-------
view : Categorical
Returns `self`!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you update the doc-string (or just inherit it)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my comment above, dtype needs to be added (but best to just remove the doc-string completely and inherit it)

pandas/core/arrays/datetimelike.py Show resolved Hide resolved

def test_view(self, data):
# view with no dtype should return a shallow copy, *not* the same
# object
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you need to also test with a dtype != None? (e.g. that this raises NIE)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess to make sure the kwarg is accepted, sure

pandas/tests/extension/test_interval.py Show resolved Hide resolved
@jreback jreback added ExtensionArray Extending pandas with custom dtypes or arrays. API Design labels Jul 30, 2019
pandas/core/arrays/base.py Outdated Show resolved Hide resolved
@jbrockmendel
Copy link
Member Author

Can you be a bit more specific about this "all kinds of trouble" ?

ser = pd.Series(some_ea)
df = ser.to_frame()
df.iloc[0, 0] = whatever
assert df.iloc[0] == ser.iloc[0]

would fail

@jorisvandenbossche
Copy link
Member

Why would that fail if view would return self instead of a new object with a view on the same data?

@jbrockmendel
Copy link
Member Author

Why would that fail if view would return self instead of a new object with a view on the same data?

I could have been clearer: the default reshape is going to look like:

_shape = None

@property
def shape(self):
    if self._shape is not None:
         return self._shape
    return (self.size,)

def reshape(self, shape):
     out = self.view()
     out._shape = shape
     return out

So returning self instead of a new object would end up changing the shape of self. That's a different failure mode from what would happen if view returned a copy, which I could have been clearer on.

@jreback jreback added this to the 1.0 milestone Jul 31, 2019
pandas/core/arrays/base.py Show resolved Hide resolved
@@ -1773,9 +1766,10 @@ def view(self):
Returns
-------
view : Categorical
Returns `self`!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my comment above, dtype needs to be added (but best to just remove the doc-string completely and inherit it)

pandas/core/arrays/datetimelike.py Show resolved Hide resolved
pandas/tests/extension/base/interface.py Outdated Show resolved Hide resolved
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm ex the comment on the test. ping on green.

@jbrockmendel
Copy link
Member Author

test simplified, green

@@ -354,7 +355,7 @@ def ndim(self) -> int:
"""
Extension Arrays are only allowed to be 1-dimensional.
"""
return 1
return len(self.shape)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this change will help. There can be EAs out there that define this themselves and override this with a hardcoded 1, so we will still need to define a wrapper I think?
(so therefore I would maybe rather leave it as is, to ensure we cover that use case)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im not sure I understand the problem here. is there a case in which this wont be correct?

Copy link
Member

@jorisvandenbossche jorisvandenbossche Aug 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I supposed you made this change because for the other PR you are patching self.shape to return (N, 1) or (1, N), and so with this change, ndim automatically follows that.
But in general, you can't rely on the fact that self.ndim already is correctly following self.shape, so you will always have to patch ndim as well.

(exact terminology of "patching" might not be fully reflect to other PR on 2D EAs, need to update myself on that, but hope to give the idea)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will revert so we'll handle it in the next pass

-------
ExtensionArray

Notes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Convert this to normal comments as Tom mentioned?

pandas/core/arrays/base.py Outdated Show resolved Hide resolved
def view(self, dtype=None):
if dtype is not None:
raise NotImplementedError(dtype)
return self._constructor(values=self._codes, dtype=self.dtype, fastpath=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the default implementation does not work here? (or is this more efficient?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more efficient, yes (note the fastpath kwarg)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a note for that? (eg "override base implementation to use fastpath")

With the specified `dtype`.
"""
if dtype is None or dtype is self.dtype:
return type(self)(self._data, dtype=self.dtype)
return self._data.view(dtype=dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not returning an EA?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct, the current implementation is only used to return an ndarray.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But so that is "violating" the spec? (it should return a new EA (not self), but not an ndarray)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but its already in place and used extensively. I guess we could alter the spec to allow returning ndarray

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we could alter the spec to allow returning ndarray

Sorry to further bother on this PR, but I would not alter the spec. For the interface, it should just be a new EA of the same type, no?

Shouldn't we (ideally, at some point) change our own implementation to return an EA as well for consistency?

@jreback
Copy link
Contributor

jreback commented Aug 2, 2019

lgtm. ping on resolution of @jorisvandenbossche comments.

@jbrockmendel
Copy link
Member Author

@jorisvandenbossche i think ive addressed your comments. let me know if i missed anything


Returns
-------
ExtensionArray or np.ndarray
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not document this. For implementors, it is simply an EA that it should be. I think our own array not doing can be seen as an historical artifact that ideally will be fixed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverted

With the specified `dtype`.
"""
if dtype is None or dtype is self.dtype:
return type(self)(self._data, dtype=self.dtype)
return self._data.view(dtype=dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we could alter the spec to allow returning ndarray

Sorry to further bother on this PR, but I would not alter the spec. For the interface, it should just be a new EA of the same type, no?

Shouldn't we (ideally, at some point) change our own implementation to return an EA as well for consistency?

@jorisvandenbossche
Copy link
Member

@jbrockmendel can you also answer to my comment / questions? (and not just follow it; I would like to have discussion about this)

@jbrockmendel
Copy link
Member Author

can you also answer to my comment / questions? (and not just follow it; I would like to have discussion about this)

The remaining question/comment I see is about DTA/TDA/PA view sometimes returning ndarray and the possibility of making it conform by always returning EA. I like that idea eventually, but we're a ways away from that in terms of having PandasArray support throughout the codebase. In a number of places where we currently do DTA.view, we expect to get an ndarray back.

@@ -862,6 +863,27 @@ def copy(self) -> ABCExtensionArray:
"""
raise AbstractMethodError(self)

def view(self, dtype=None) -> Union[ABCExtensionArray, np.ndarray]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would this return a np.ndarray OR an EA for an EA? (is this @jorisvandenbossche question)?

when / why would this be the case? this is pretty confusing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is the same as Joris's question. The answer is that it is probably better to return EA-only, but ATM DTA/TDA/PA have existing implementations that return ndarray

@jorisvandenbossche
Copy link
Member

about DTA/TDA/PA view sometimes returning ndarray and the possibility of making it conform by always returning EA. I like that idea eventually, but we're a ways away from that in terms of having PandasArray support throughout the codebase. In a number of places where we currently do DTA.view, we expect to get an ndarray back.

But I don't think we need to wait on all-EA to have EA.view to consistently return an EA? For those cases where you expect an array, you can explicitly access the numpy array and take a view of that (or in the different order)?

Eg one case where it is called is

result = op(self.view("i8"), other.view("i8"))

We can use self._data.view('i8') or in this case self.asi8 or other ways to ensure that the output is a ndarray

@jbrockmendel
Copy link
Member Author

@jorisvandenbossche I'll try this locally and see what it would take to make make it EA-only.

@jbrockmendel
Copy link
Member Author

Yah this causes 230 test failures. I'm on board with the idea of fixing these, bout would like to do so in a separate PR(s) so as to keep momentum here.

@jorisvandenbossche
Copy link
Member

Yah this causes 230 test failures.

It might be that they are all coming from a rather limited number of call sites (the number of failures does not always indicate the number of lines to change to fix it ;))

I'm on board with the idea of fixing these, bout would like to do so in a separate PR(s) so as to keep momentum here.

That's fine for me (although I would prefer to see it done before the next release, as this is kind-of public API). Do you open an issue for it?

@jbrockmendel
Copy link
Member Author

It might be that they are all coming from a rather limited number of call sites

Skimming through the test output, all I can confidently state is that the number of relevant call sites is less than 230.

@jbrockmendel
Copy link
Member Author

@jorisvandenbossche opened #27831 to change DTA.view signature. Anything else?

@jorisvandenbossche
Copy link
Member

Nope, all good!

@jorisvandenbossche jorisvandenbossche merged commit 0227e69 into pandas-dev:master Aug 9, 2019
@jbrockmendel jbrockmendel deleted the ac1 branch August 9, 2019 14:34
quintusdias pushed a commit to quintusdias/pandas_dev that referenced this pull request Aug 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants