Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datetimelike Array Refactor #23185

Closed
TomAugspurger opened this issue Oct 16, 2018 · 72 comments · Fixed by #24024
Closed

Datetimelike Array Refactor #23185

TomAugspurger opened this issue Oct 16, 2018 · 72 comments · Fixed by #24024
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Master Tracker High level tracker for similar issues

Comments

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Oct 16, 2018

A master issue, to help keep track of things.

High-level, I think we have two things to sort out

  1. Design our datetimelike Arrays (DatetimeArray, TimedeltaArray, PeriodArray)
  2. A plan for getting from master to an implementation of that design.

1. Design

We have a few things to sort out

a. Composition vs. inheritance of Index / Series and array classes
b. ...

2. Implementation Plan

A few options

a. Big PR implementing one or more arrays, followed by smaller cleanup PRs
b. Incremental PRs (implementing most of the interface on the *ArrayMixin classes), followed by a final PR making the switch
c. ...

Project board with the relevant discussions / PRs: https://github.com/pandas-dev/pandas/projects/4

@TomAugspurger TomAugspurger added Master Tracker High level tracker for similar issues ExtensionArray Extending pandas with custom dtypes or arrays. labels Oct 16, 2018
@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Oct 16, 2018

cc @jreback @jorisvandenbossche @jbrockmendel

OK I wanted to open this issue because I've had trouble keeping up with all the
goings-on around this (and I was getting tired of rebasing the WIP PeriodArray
PR). So I'm hoping we can come to a rough consensus on these outstanding
discussion, and a concrete plan of how to get from master to there.

If necessary, we can move to a google doc or something to collaboratively edit
the design doc. I don't have a preference. Feel free to add discussion points
to the "outstanding discussions" list.

I'll try to write up my current thoughts later today.

@jbrockmendel
Copy link
Member

You're right (and Joris has expressed this elsewhere) that this conversation has splintered across a lot of places and centralization will help.

I'm a big fan of breaking up big problems into smaller more manageable problems (as evidenced by #23159, small-step tslibs refactor, doomed improvement efforts in statsmodels, ...). Are there any logically independent parts of the problem that can be split off?

FWIW:
#23173 I opened specifically to split discussion-requiring pieces off from the other PRs
#22535 does touch the mentioned files, but should be orthogonal to the topics under discussion.
#23113 I hope we can push through relatively easily, since it is needed for DatetimeArray + DateOffset addition to work.

@jorisvandenbossche
Copy link
Member

I mentioned it on one of the PRs, but is it OK for you guys to agree on not merging any of the open PRs, before we have some agreement on the way forward?

@jbrockmendel I understand you want to keep working on those PRs you opened (and it's also great that you do so much for pandas!), but let's maybe take a short pause doing new PRs until we agree on how we want to get to the finish-line here (in that sense: can you answer my mail regarding thursday?)
Of course, doing some PRs might help to explore certain options, show how things can be done, etc. So that is certainly fine, just not finalizing / merging.

@jbrockmendel
Copy link
Member

Sounds good.

@TomAugspurger
Copy link
Contributor Author

Composition vs. inheritance.

The case for composition:

  1. I see EAs as being at the same level as ndararys. Series & Index box an array
    (ndarray or EA). I think it'd be strange for some indexes be instances
    of an EA, depending on the dtype.
  2. I think that the calls have only been for specific Index classes to inherit
    from EA. We'll limit the rest of this discussion to just Index, but I think
    that fact is an argument for composition.
  3. Indexes have names, arrays don't. That means we'll have similar, but
    different, function signatures and calls to pass names where they need to go.
  4. Some methods common to both Index and EA have different semantics (e.g.
    shift).

The case against:

  1. Additional boilerplate code for dispatching an operation from the Series
    or Index to the EA.
  2. ... (make your case here)

Since I'm pro-composition, I'll point out that I think we're required to have
that dispatching code anyway. If we're a Series / Index backed by an ndarray,
then we dispatch to numpy. If we're backed by an EA, then we dispatch to it. Or,
to put it another way, Index / Series ops simply dispatch to the ops of the
underlying array.

@jbrockmendel
Copy link
Member

If there is a nice solution to the immutability/caching issue, then I can get on board with just-composition. Until then, I think both is the way to go for now. i.e. PeriodIndex subclasses PeriodArray and PeriodIndex.values returns a PeriodArray.

Your point 1) is the one I find most compelling. It would be really nice if .values had a consistent definition as always-lossless.

@TomAugspurger
Copy link
Contributor Author

I'm not sure how doing both would work in practice. I don't have a good sense for what complications that's likely to create.

But, I think that won't be necessary. I think we can manage to cache attributes like hasnans or _isnan on the Array, and invalidate that cache when necessary (__setitem__).

It would be really nice if .values had a consistent definition as always-lossless.

Agreed. At this point, I think that's our only hope of having a consistent definition for what .values is.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Oct 16, 2018

On the caching point, this seems to work

diff --git a/pandas/core/arrays/period.py b/pandas/core/arrays/period.py
index 24d4b6e55..23a5845f0 100644
--- a/pandas/core/arrays/period.py
+++ b/pandas/core/arrays/period.py
@@ -465,6 +465,14 @@ class PeriodArray(dtl.DatetimeLikeArrayMixin, ExtensionArray):
                    "Got '{}' instead.".format(type(value).__name__))
             raise TypeError(msg)
         self._data[key] = value
+        self._invalidate_cache()
+
+    def _invalidate_cache(self):
+        self._cache = {}
+
+    @cache_readonly
+    def hasnans(self):
+        return self.isna().any()
 
     def take(self, indices, allow_fill=False, fill_value=None):
         from pandas.core.algorithms import take

There may be edge cases, or cases were we could skip invalidating the cache, but it's at least feasible.

It didn't come across in the diff, but the call to _invalidate_cache comes from PeriodArray.__setitem__.

"tests":

without the invalidation

In [1]: import pandas as pd
ar
In [2]: arr = pd.core.arrays.period_array(['2000', None], 'D')

In [3]: arr.hasnans
Out[3]: True

In [4]: arr[1] = pd.Period(2000, 'D')

In [5]: arr.hasnans
Out[5]: True

with invalidation

In [1]: import pandas as pd

In [2]: arr = pd.core.arrays.period_array(['2000', None], 'D')

In [3]: arr.hasnans
Out[3]: True

In [4]: arr[1] = pd.Period(2000, 'D')

In [5]: arr.hasnans
Out[5]: False

@jorisvandenbossche
Copy link
Member

It overlaps with the first point of Tom, but an additional case for composition / disadvantage of inheritance:

  • A clear data model that you can explain and reason about (Series and Index are containers of array-likes), while the current inheritance is overly complicated and very opaque IMO.

I personally also don't see any compelling case for inheritance (also the dispatching is not a reason IMO, as we need to it anyway for Series)

So personally, unless someone now actually makes an extensive and detailed argument for inheritance, I would propose to leave this discussion behind us and focus on how to solve possible remaining issues with the composition structure (the constructors, the caching).

@jorisvandenbossche
Copy link
Member

For the inheritance vs composition, we actually have examples to look at in practice: interval and categorical already do composition, while the datetimelikes are currently a kind of inheritance. I think the current datetimelike implementation shows the complexity of this.

@jorisvandenbossche
Copy link
Member

Related to the caching issue, I think there are several options:

  1. Assess if it is actually significant (eg to decide if it is a blocker on the PeriodArray PR). From a quick timing, calculating the _isnan property the first time takes about 5% of the time to do DatetimeIndex() + Timedelta() arithmetic operation (whether it has a similar impact when we actually would no longer cache is not necessarily the case of course, should be checked further)

  2. Implement caching on the Array objects. I think this is certainly feasible, as Tom also explores above.

  3. Factor the operations that make use of caching out into functions instead of methods (like the current _add_delta methods), that is shared between Index and Array, and where eg the (cached or not) mask can be passed as an argument.

So I think there is certainly a solution possible, and my first point about "is this important" is then more to know if this is essential to already have in an initial "big split" PR like the PeriodArray PR, or if this can be left for a follow-up PR.

@jorisvandenbossche
Copy link
Member

Warning, long post coming.

My proposal for a possible way forward, let's call it the "minimally big" PR with follow-up PRs proposal:

  • For each Array class, we do a "minimally big-split" PR:
    • a PR that does the actual split of the current inheritance structure into separate Index and Array objects
    • but as minimal as possible:
      • as a minimum get the tests passing
      • and with agreement on the bigger design questions (different class structures, the constructors, some minimal design questions we can identify we want to agree on at this stage)
    • I think this should try to mainly be "moving around code" (and adding some needed dispatching), to the extent possible of course to satisfy the above points, but meaning here that the goal of those PRs should not be to clean-up implementations, or to focus on deduplication, smart decorators, etc
    • We do this for each Array class, and I think we should try to keep those as independent as possible for each Array class (Period, Datetime, maybe Timedelta?). This will of course mean some duplication between each Array, but at this stage this is OK. Doing this will keep the picture as clear as possible in each PR, and this can be cleaned-up later on.
    • I think the current PeriodArray PR of Tom is a good start for such a PR.
  • During reviewing those PRs, we identify and track follow-up issues we agree on we want, but are OK to keep for a next stage. Things like: consolidating common patterns that can be shared between the EA based Index classes, or more specifically patterns that can be shared between the datetimelike Array classes; clean-up of the Index constructors (the simple_new, shallow_copy(_with_infer) story); restructuring of the tests (how to share tests between Index/Array?); other clean-up ideas, etc
  • After those "minimal big splits" are merged, we can start doing follow-up PRs based on the ideas that have formed during reviewing the previous PRs. This can now be smaller, more independent and easier to review PRs.
    Those follow-up PRs can be easier to review, having clear goals identified before, and fitting already in the final structure giving better context to review the PR.

My reasoning to go for the above way forward compared to a "first smaller clean-up PRs, then split", are the following:

  • The current smaller PRs are very difficult to judge and review because the big picture (the final code structure) is still missing
  • Even after doing PRs preparing the split, the actual split PR will still be big
  • Many of the PRs that are now open which Tom listed in the initial issue, are things like deduplicating, moving common code to helper functions, ... All good things! But those can also perfectly be done after the initial split PR.

This proposal would mean that master will be in a temporary "messy" state (but still green of course). If we find this a problem, we can always first merge those PRs to a refactor branch, and only after some of the follow-up PRs have been done to that branch as well, merge it into master.
(personally, I don't think that is needed as long as we keep master green, as it will only give additional complexity to keep master and that branch in sync).

Doing the above, would in practice mean: first focus on some of the design discussions (eg the design of the constructors, and other issues mentioned in the top post), focus on doing and reviewing the actual splits, and for now wait with the other smaller PRs.
And also in this proposal I think there is enough work to share for multiple people working at the same time (without trying to get things out of the big PRs to independent PRs). Doing those discussions, reviewing PRs, doing the Datetime or Timedelta Array, etc are also all things that need to be done and take time.


So please give your thoughts about this proposal, feedback, alternative proposals, ...

@jbrockmendel
Copy link
Member

If patching __setitem__ works robustly I think that would be a great solution. It would probably be useful elsewhere, too. The fact that it would be so useful makes me think it must be harder than it looks, otherwise it would have been done long ago. But I'd be happy to be proven wrong.

A clear data model that you can explain and reason about

If we were to go whole-hog on inheritance, the data model would be "An Index is an Array with some additional fancy indexing/join/set methods (and a Block is an Array with mgr_locs attribute)". That said, conditional on the __setitem__ thing working out, I'm happy to end this part of the conversation.

Factor the operations that make use of caching out into functions instead of methods

That's an interesting idea. I'd like to give it some thought before forming an opinion.

The current smaller PRs are very difficult to judge and review because the big picture (the final code structure) is still missing

Are they though? DatetimeArray does need to_period and to_perioddelta for arithmetic operations to work. Is there any scenario in which we dont want all three arrays to have take implemented? If anything, isolating them makes it clear what depends on what, avoids the phenomenon in big PRs where I have to ask "is this related to everything else?"


My main objection to the Big PR is that it precludes working in parallel. If there was something about the implementation that required it be done All At Once that would be another matter, but as it is there is a lot of non-difficult stuff we can get out of the way before making final decisions about caching and constructors. i.e. the "Minimal Big Split" can be made more minimal.

Suppose hypothetically that two things both turn out to be more difficult than expected: the __setitem__ patching and the arithmetic methods. In the Big/Minimally Big model, everything is on hold until a single PR gets all of it right. In a parallel model, I can figure out the arithmetic while Tom figures out the caching. (For the sake of the example I'm assuming that the ways in which they are unexpectedly difficult are logically independent)


The status quo is that non-#22862 PRs are on hold. While not my first choice, I'd rather see that move forward than go in circles here. Let's see what jreback has to say and reconnoiter.

@jreback
Copy link
Contributor

jreback commented Oct 17, 2018

Some general comments / points.

  • I am all for the composition pattern; we are already using this and it makes Index just a light wrapping of EA
  • It would be nice to share between EA implementations for DateArray / PeriodArray (which I think is accomplished by the DatetimelikeMixin), the concern is that this is also a mixin with Index? I don't have a problem with mixin's being shared between EA / Index, though I suspect as we move to a more dispatch oriented mechanism this will be less of an issue.
  • I would like to decouple the discussions of Blocks from this conversation for now. They will eventually be a composed set of Arrays with Indexes (kind of like they are now). Trying to shoe-horn shared functionaility between EA, Index and Block is bit too much to try to hold mentally now. We can easily decide later to design a proper Block / Container to hold the internals later.
  • I am ok with a big PR to move things, I agree it can be hard to envision things when smaller things are actually moved.
  • I really would like to avoid completely changing idioms at this stage, mainly adding constructors that are non-idiomatic to pandas (e.g. this happened with RangeIndex and this completely breaks all other patterns). Like it or not we are stuck with this pattern, breaking it is all downside for anyone reading the code because you have more than 1 pattern.
  • EA's need to be come a more of a drop-in replacement for ndarrays. I am not advocating adding all methods as numpy has too many, but certain basic things should just work / exist (talking about .repeat() here).
  • We must not drop caching in either the EA's or the Indexes. This can also be easily accomplished thru mix-in classes (or thru dispatch, which is almost the same thing). _isnan is SO important and used extensiviely. Since Index is immutable this offers a tremendous advantage (and @jorisvandenbossche your example is just a trivial case). This is a must do.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Oct 17, 2018

It would be nice to share between EA implementations for DateArray / PeriodArray (which I think is accomplished by the DatetimelikeMixin), the concern is that this is also a mixin with Index?

I'm not sure, but I share this concern. I expect that well have a mixin or base class for DatetimeLikeArray with these common ops, and a base class for DatetimelikeIndex that just does the dispatching. I'm hopeful we won't need a mixin for DatetimelikeIndex.

EA's need to be come a more of a drop-in replacement for ndarrays.

We must not drop caching in either the EA's or the Indexes.

These two are slightly in tension. AFAIK, right now the only way to update an ExtensionArray inplace is with __setitem__. If we add any more, we would need to remember to manually invalidate the cache.

edit: inplace ops is another, though nans usually (always for inplace?) propagate.

@TomAugspurger
Copy link
Contributor Author

The fact that it would be so useful makes me think it must be harder than it looks, otherwise it would have been done long ago.

I suspect this is because it's not on ndarrays, and we didn't have an intermediate array-like that could track these. ndarrays can be manipulated inplace in so many ways that a cache sounds infeasible.

@TomAugspurger
Copy link
Contributor Author

I feel like I still lack the information to make a judgement call on how to proceed here. So, how about I spend a chunk of time getting #22862 in to a reasonable state. I'll try to make it as minimal as possible while still passing, and in the process I'll identify pieces that can be reasonably split off.

I think the biggest outstanding discussion / PR is around constructors and whether #23140 should go first. I'll try to form some thoughts around that quickly.

@jorisvandenbossche
Copy link
Member

Thanks for the answers!

(and sorry again for my long answers :))

My main objection to the Big PR is that it precludes working in parallel.

Because this is an important point, I had a paragraph above trying to explain why I think this does not need to be the case, as I can also argue for the opposite:

  • The small PRs touching periods you were doing now made that Tom had a hard time continuing on the PeriodArray PR. In this sense, it also made it difficult to work in parallel.
  • I argued above for trying to keep the different array PRs as independent as possible. And if you look at the current PeriodArray PR of Tom, that hardly touches the DatetimeArray code. So I think a parallel PR on DatetimeArray or TimedeltaArray is feasible (of course, there will be merge conflicts once one is merged and the other needs to be updated, and discussions on the one may also be relevant for the other. But all those things we now also already have with the several PRs).
    Yes, that might mean some duplication initially between both Arrays, but let that be a perfect topic for a smaller, targetted follow-up PR.
    And even if the above turns out to be really difficult, if we all focus on the PeriodArray PR, I think this can be merged rather quickly.
  • What needs to be done is not only the coding. Thoroughly reviewing (going through in detail through the diff, looking at the full code instead of the diff to see the full picture, trying out the PR interactively with some toy examples, comparing to a previous release, testing some changes to the PR to see if a comment makes sense, ....) takes a lot of time, and also needs to be happen (as a sidenote, last week I spend two full time days reviewing the Sparse PR and the Period PR, although sparse is maybe a bit special as it has quite a bit specific behaviour).
    Further discussing and making proposals for the bigger design questions, also takes time. Making a detailed description or doing a proof of concept for eg the array constructors, or the caching mechanism also takes time (maybe we could do those with one of the already existing arrays to not get conflicts).
    And typically it are those things that are the bottleneck to get a PR merged, not the actual coding.

In a parallel model, I can figure out the arithmetic

What do you mean exactly here? I think @TomAugspurger already figured this out in his PeriodArray PR (at least a minimal working solution that gets the job done). Tom, correct me if I am wrong.
And also, I think it is still possible to explore the caching in parallel (try it on a different array, or do it as a patch on top of the period array PR's branch)

I would like to decouple the discussions of Blocks from this conversation for now.

Fully agree here. Apart from that we have an ExtensionBlock instead of the custom ones, there should not be much changes related to blocks.

It would be nice to share between EA implementations for DateArray / PeriodArray (which I think is accomplished by the DatetimelikeMixin), the concern is that this is also a mixin with Index?

I think we will typically end up with a base class / mixin for the Arrays to share functionality there, and a mixin / base class for the datetimelike indexes to share things. I think those two mixins can be completely separate (the current Datetimelike Mixin is indeed shared between Arrays and Index, but that is just the temporary confusing state where Index/Array is not yet splitted properly)

I really would like to avoid completely changing idioms at this stage, mainly adding constructors that are non-idiomatic to pandas (e.g. this happened with RangeIndex and this completely breaks all other patterns). Like it or not we are stuck with this pattern

We are stuck with it for Index, but IMO that should not mean we should follow the exact same pattern for the Arrays (in all the different PRs related to this we have several times tried to discuss / understand what the different __new__/_simple_new/_shallow_copy/_shallow_copy_with_infer do, I really think we can make it simpler at the Array level). But will open a separate issue about that to keep that discussion separeate, as it is quite orthogonal to the rest of the workflow discussion.

EA's need to be come a more of a drop-in replacement for ndarrays. I am not advocating adding all methods as numpy has too many, but certain basic things should just work / exist (talking about .repeat() here).

We can certainly discuss, but in light of keeping the initial big-split PRs as minimal as possible / not have more discussion than needed on it, I would personally keep this discussion for a follow-up (of course, as long as the method is not essential to have the EA interface working).

@jorisvandenbossche
Copy link
Member

I feel like I still lack the information to make a judgement call on how to proceed here. So, how about I spend a chunk of time getting #22862 in to a reasonable state. I'll try to make it as minimal as possible while still passing, and in the process I'll identify pieces that can be reasonably split off.

+1

I think the biggest outstanding discussion / PR is around constructors and whether #23140 should go first. I'll try to form some thoughts around that quickly.

If we focus first on the PeriodArray PR, I don't think that PR should be merged. But, it is the discussion that we had on one of the review comments (#23140 (comment)) about the constructors that ideally indeed should be resolved.
I will try to look at that now as well, and open a separate issue about it (unless you beat me to it).

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Oct 17, 2018

What do you mean exactly here? I think @TomAugspurger already figured this out in his PeriodArray PR (at least a minimal working solution that gets the job done). Tom, correct me if I am wrong.

Ops work on the PeriodArray PR by dispatch. Though maybe @jbrockmendel meant ops with caching.

        # PeriodIndex.__add__
        def __add__(self, other):
            # dispatch to ExtensionArray implementation
            result = self._data.__add__(other)
            return wrap_arithmetic_op(self, other, result)

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Oct 17, 2018

small FYI: I've updated the original post with a list of PeriodArray blockers that aren't touching any of the datetimelike files (e.g. #23155). Those are bugs in master that hopefully have a clear fix which won't lead away from our end goal.

@jorisvandenbossche
Copy link
Member

Though maybe @jbrockmendel meant ops with caching.

Since he mentioned both in the same sentence as two different things, I assumed this was not the case. So hence the question for clarification :-)

@jbrockmendel
Copy link
Member

It would be nice to share between EA implementations for DateArray / PeriodArray (which I think is accomplished by the DatetimelikeMixin), the concern is that this is also a mixin with Index?

The plan as I understand it is that DatetimeLikeArrayMixin will be mixed into the EA subclasses but will cease to be mixed in to DatetimeLikeIndexMixin.

Though maybe @jbrockmendel meant ops with caching.

I meant extending the tests in tests/arithmetic to include the EA subclasses. Since this was just a hypothetical example of "two things going wrong at the same time", let's not spend too much time on it.

I feel like I still lack the information to make a judgement call on how to proceed here. So, how about I spend a chunk of time getting #22862 in to a reasonable state. I'll try to make it as minimal as possible while still passing, and in the process I'll identify pieces that can be reasonably split off.

Sounds good.

In the interim, I'd like to get exceptions to the datetimelike PR moratorium for the following, which I think should have minimal overlap:

I'll hold off pending an explicit OK.

@TomAugspurger
Copy link
Contributor Author

I pushed an update to https://github.com/pandas-dev/pandas/pull/22862/files that reduced the scope somewhat. Outside of indexes/period.py and ararys/period.py, there shouldn't be any extraneous changes.

I can work to reduce the changes to indexes/period.py and ararys/period.py a bit, but I'd like to get the constructors nailed down first.

@TomAugspurger
Copy link
Contributor Author

Arithmetic methods in all four core.arrays files

Do you have a plan / WIP for this? My WIP PeriodArray PR doesn't do too much there I think.

@jbrockmendel
Copy link
Member

I do have a branch about ready for the arithmetic fixes, will open a PR later today.

@jorisvandenbossche
Copy link
Member

Do you still have segfaults in #23415 ?

There were never segfaults in that branch

Ah yes, sorry, was confusing them. What is the main difference between both branches? #23415 is not yet doing the inheritance/composition switch? (but the plan is to do it in that PR at some point?)

If this is about having the base EA tests

I am far more interested in the parametrized tests in tests/arithmetic.

Yes, I know :-) But we still want those EA base tests once they are actual EAs. Maybe this can be done separately in advance, but not sure (can maybe take a look at that). In any case, IMO we should not do it after making them actual EAs.

I was thinking it would be nice to have a GH Tag for segfaults (or other non-recoverable errors). There aren't that many of them, but they merit extra attention. Thoughs?

Are there currently open issues with reported segfaults? In case so, yes that sounds good.

@TomAugspurger
Copy link
Contributor Author

@jbrockmendel any open PRs from https://github.com/pandas-dev/pandas/pulls/jbrockmendel that could use review? (#23415 isn't quite ready, right?)

Any relevant pieces issues that can be offloaded?

I'm going to mess around with DatetimeBlock shortly, to see what pieces can be simplified there.

@jbrockmendel
Copy link
Member

any open PRs [...] that could use review?

Any relevant pieces issues that can be offloaded?

Not at the moment. Two coming up in the next hour or so: one extending timedelta64 arithmetic tests to TimedeltaArray, another implementing most of the rest of the EA interface for DTA/TDA. The latter doesn't yet have EA tests in place, which can absolutely be offloaded.

The other one coming up a little later is implementing DatetimeArray._from_sequence (or more specifically, the datetime64 analogue of #23539)

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Nov 14, 2018 via email

@jbrockmendel
Copy link
Member

thoughts on how where we should proceed for the next few steps?

Am I right in thinking that after #23643 the remaining pieces of the EA interface are just _from_sequence and optionally _values_for_argsort, argsort, and _reduce? _from_sequence I'm working on (#23675, #23702), and those other methods are orthogonal so could be worked on in parallel.

For the transition to composition, if someone wants to work on it before the EA interface is complete, the approach you're taking in the disown branch looks reasonable. I'm also seeing a lot of things in that branch that could be implemented immediately before the Arrays are disowned.

If somehow we find ourselves with excess labor to throw at related tasks, some orthogonal-ish topics:

  • TST: Test cleanup, parametrization for datetime64 arithmetic tests #23681 and TST: Extend timedelta64 arithmetic tests to TimedeltaArray #23642 make some progress towards extending the arithmetic tests to the array classes, but there is a lot more to be done to replace usages of box fixture with box_with_period, box_with_datetime, and box_with_timedelta
  • related: once pd.array (or a temporary kludge in e.g pd.util.testing) is in place those three separate fixtures can be boiled down to just box_with_array
  • reductions, move min/max/argmin/argmax from DatetimeIndexOpsMixin to DatetimeLikeArrayMixin (and wrap where appropriate). mean, std, and presumably others can be implemented in terms of the i8 values.
  • tangential: Series/DataFrame interpolation is broken for datetimelike dtypes. Implementing it on the arrays would close several issues
  • A bunch of problems are caused by a bug in DataFrame.transpose:
>>> dti = pd.date_range('2016-01-01', periods=3, tz='US/Pacific')
>>> pd.DataFrame(dti).T
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pandas/core/frame.py", line 2571, in transpose
    return super(DataFrame, self).transpose(1, 0, **kwargs)
  File "pandas/core/generic.py", line 686, in transpose
    new_values = self.values.transpose(axes_numbers)
  File "pandas/core/base.py", line 672, in transpose
    nv.validate_transpose(args, kwargs)
  File "pandas/compat/numpy/function.py", line 56, in __call__
    self.defaults)
  File "pandas/util/_validators.py", line 218, in validate_args_and_kwargs
    validate_kwargs(fname, kwargs, compat_args)
  File "pandas/util/_validators.py", line 157, in validate_kwargs
    _check_for_default_values(fname, kwds, compat_args)
  File "pandas/util/_validators.py", line 69, in _check_for_default_values
    format(fname=fname, arg=key)))
ValueError: the 'axes' parameter is not supported in the pandas implementation of transpose()

I'm not sure off the top of my head whether this bug affects the composition switchover, just that I've seen it a bunch recently.

@TomAugspurger is this helpful or too much of a grab-bag?

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Nov 14, 2018

the remaining pieces of the EA interface are just _from_sequence and optionally _values_for_argsort, argsort, and _reduce? _from_sequence I'm working on (#23675, #23702), and those other methods are orthogonal so could be worked on in parallel.

My main concern there is that until we inherit from ExtensionArray, those will be untested (or we'll have to duplicate tests). Unless... I suppose we could start inheriting the base tests, without actually inheriting from them yet? That may be worth exploring. Some things like pd.isna will fail hard until we inherit, but others should work. We could also put temporary checks in things like is_extension_array_dtype to also recognize DatetimeArray and TimedeltaArray.

I suspect the transpose bug will be fixed by inheriting from ExtensionArray.

@jorisvandenbossche
Copy link
Member

@TomAugspurger If you have currently the time to further work on that branch to try to switch to composition, I would say: let's push for it now and first (and thus temporarily hold on for other changes).
I should have time on Friday to thoroughly review it (and also a bit in the weekend), today/tomorrow I am at PyParis.

I already said that before, but the sooner we can actually switch to composition, the clearer the follow-up PRs will be (eg now the _concat_same_type was already wrong in the PR, possibly because it was simply not exercised in all our concat related tests, because as long it is not an actual extension array, it is not used.

Also all the datetime-arithmetic related issues, they can in principle be done after the split. As long as artithmetic works for Index/Series (for which it is already tested), it's fine (of course, before releasing we should also fix + test all arithmetic on the arrays as well, but just to say it is not necessary to do that first)

@jbrockmendel
Copy link
Member

let's push for it now and first (and thus temporarily hold on for other changes).

@jorisvandenbossche I would really appreciate it if you didn't advocate shutting down all progress on things that I'm putting a lot of time and effort into.

@TomAugspurger
Copy link
Contributor Author

That's not how I read Joris' comment. I read "temporary hold" as... just that. A pause, not a shutting down or throwing away. I think all the effort in open PRs (and possibly some unpushed work you've done) is still vital.

There are many paths from master (plus the open PRs) to DatetimeArray. To me, a path that frontloads the switch to composition makes sense, but it's hard to say ahead of time. I've had trouble thinking through all the ramifications of a diff, partly because the current class hierarchy "feels weird" to me, and partly because I'm not familiar with this section of the code base.


Anyway, I think that #23675
and #23642 are the next to go in. @jbrockmendel do you have other WIP branches changing the indexes or arrays datetimelike files that you'd like to push?

On my own availability: I'm going to be ramping up on a largish project for dask in the few days / weeks. I'll still have time for pandas, but not as much over the next month or so. So I have a slight window to dump all my time into pandas, that I'd like to take advantage of if possible.

@jorisvandenbossche
Copy link
Member

@jbrockmendel yes, sorry if that is the way it came over. I certainly think we can do parallel work, and not all the items you listed in #23185 (comment) would go in such a split PR, it was just about what to merge first as it will probably be easier to rebase the smaller PRs than the other way around. And also, if we do that push, there will also be a lot of reviewing work :)

@jorisvandenbossche
Copy link
Member

And also, it will depend on the PR of course. If it is something that doesn't overlap a lot (like certain test changes), for sure it doesn't need to wait with being merged.

@TomAugspurger
Copy link
Contributor Author

Have we had a design discussion on DatetimeDtype? A short proposal:

Rename to DatetimeDtype. We can keep the both the unit and tz arguments (in case we ever support resolutions other than ns in the future), but we continue to raise on non-ns unit.

class DatetimeDtype(ExtensionDtype):
    def __new__(self, unit='ns', tz: Optional[Union[str, tzinfo]]=None):
        ...

We remove the "magic" creation from string.

In [3]: pd.core.dtypes.dtypes.DatetimeTZDtype('datetime64[ns, utc]')
Out[3]: datetime64[ns, utc]

that would throw an error, since 'datetime64[ns, utc]' isn't a valid unit.


One question though... I'm worried about changing the .dtype exposed to users via Series.dtype and DatetimeIndex.dtype

In [3]: pd.Series(pd.date_range('2000', periods=4)).dtype
Out[3]: dtype('<M8[ns]')

I'm not sure what the ramifications of changing that to always be DatetimeDtype would be. Of course, when a timezone is involved we'll need to the dtype to be a DatetimeDtype, but should we use the numpy type when it suffices?

@jorisvandenbossche
Copy link
Member

One question though... I'm worried about changing the .dtype exposed to users via Series.dtype and DatetimeIndex.dtype

Yes. What I thought before about this is that we would need to have an if/else logic there, to still return the numpy dtype if there is no tz. I would put this logic on the Series/Index, and have the array always return the extension dtype.

However, there might be places where we check the dtype of values which can be coming from Series/Index to be an ExtensionDtype? (eg to take another code path for extension arrays)
So the above might be problematic for such code paths?

@TomAugspurger
Copy link
Contributor Author

I would put this logic on the Series/Index, and have the array always return the extension dtype.

Agreed that's the right place to do it. I'll see if any of your concerns come up (I suspect something will).

@jorisvandenbossche
Copy link
Member

I'll see if any of your concerns come up (I suspect something will).

I would expect many of the places where we use is_extension_array_dtype and pass it the actual dtype and not the array or container ..

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Nov 15, 2018 via email

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Nov 19, 2018 via email

@jbrockmendel
Copy link
Member

@TomAugspurger thanks for the update, and for handling the tricky dtype stuff. Aside from review, is there anything the rest of us can do to be helpful?

On my end, I'm preparing to push a branch that fixes the last of the arithmetic tests (mainly with DateOffset) for DTA/TDA (without this, these arithmetic ops would fail on the Index classes after the switch to composition). #23675 needs some edits+rebase, is otherwise close to the finish line, will put DatetimeArray._from_sequence within reach.

Added a bunch of Issues to the "DatetimeArray Refactor" Project, most of them non-blockers, e.g. reduction methods we can get around to eventually.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Nov 21, 2018 via email

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Nov 27, 2018 via email

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Nov 28, 2018 via email

@jbrockmendel
Copy link
Member

Excellent, looking forward to taking a look. Are the skips segfault-free?

With a little luck many of the xfails will be fixed by implementing the remaining methods on DTA/TDA, most of which are (hopefully) near merging.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Nov 28, 2018 via email

@jbrockmendel
Copy link
Member

A thought on a way forward, seeing as how Tom has earned some down-time.

With _index_data implemented, I'm finding the approach used in the previously-segfaulting branch (tentatively) working. Define on DTI/TDI/PI respectively:

@property
def _eadata(self):
    return DatetimeArray._simple_new(self._data, freq=self.freq, tz=self.tz)

@property
def _eadata(self):
    return TimedeltaArray._simple_new(self._data, freq=self.freq)

@property
def _eadata(self):
    return self._data

Then do the entire inheritance/composition switchover, but dispatching to self._eadata instead of self._data. (Several steps later we'll remove _eadata and dispatch to _data).

This limits the diff to the index classes without changing their outward-facing behavior, making for a much more manageable scope.

Thoughts?

@TomAugspurger
Copy link
Contributor Author

My vote is for getting #24024 in sooner rather than later, but I'm the most familiar with the diff so it's easier for me to go through the entire thing at once. It's blocking several changes I'd like to wrap up, and my time for pandas is limited.

@jbrockmendel
Copy link
Member

My vote is for getting #24024 in sooner rather than later

AFAICT the sticking points are:

I have no strong opinion on which approach to take for the astype question. For the rest, I think the best route to "sooner rather than later" is to merge #24394

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Master Tracker High level tracker for similar issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants