ENH: support of pandas.DataFrame.hist for datetime data #36287

onshek · 2020-09-11T10:04:07Z

closes Histogram or kde from datetime column #32590
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Update

import numpy as np
from pandas import DataFrame, to_datetime
from datetime import timedelta

df = DataFrame(
    {
        "a": np.random.rand(10),
        "b": [timedelta(np.random.randn()) for _ in range(10)],
        "c": to_datetime(
            np.random.randint(
                1582800000000000000, 1583500000000000000, 10, dtype=np.int64
            )
        ),
        "d": to_datetime(
            np.random.randint(
                1582800000000000000, 1583500000000000000, 10, dtype=np.int64
            ),
            utc=True
        ),
    }
)

df.dtypes
a                float64
b        timedelta64[ns]
c         datetime64[ns]
d    datetime64[ns, UTC]
dtype: object

df.hist(xrot=90, figsize=(9, 6))
array([[<AxesSubplot:title={'center':'a'}>,
        <AxesSubplot:title={'center':'c'}>],
       [<AxesSubplot:title={'center':'d'}>, <AxesSubplot:>]], dtype=object)

WIP

Since there's no response for several days in #32590, I decide to open this PR for further discussion.
There's two feasible options:

Plan A:
As is shown in the files changed, data._get_numeric_data() is modified to data._get_numeric_or_datetime_data(), and I will work on related tests as the next step.

Plan B, this may be a fotfix according to #32590 (comment):

column_dt = data.select_dtypes(include='datetime64[ns]')
if len(column_dt) > 0:
    data[column_dt] = data[column_dt].astype(np.int64)
data = data._get_numeric_data()
if len(column_dt) > 0:
    data[column_dt] = data[column_dt].astype('datetime64[ns]')

Also, related tests will be modified.

Any comment / thought is welcomed.

pep8speaks · 2020-09-11T10:04:14Z

Hello @onshek! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-10-04 12:59:16 UTC

jreback

pls always add a test

jreback · 2020-09-12T20:48:25Z

pandas/core/generic.py

        return self._constructor(self._mgr.get_bool_data()).__finalize__(self)

+    def _get_numeric_or_datetime_data(self):
+        return self._constructor(self._mgr.get_numeric_or_datetime_data()).__finalize__(self)


let's instead of adding to the internal routines, let's use .select_dtypes() here. I think more idiomatic (we should also change the others).

Refactored, I think the updated version is clearer.

This reverts commit e8989e6.

onshek · 2020-09-13T12:38:18Z

Also need to update whatsnew note here?

jreback

will also need a whatsnew note, this is an api change so put in other api changes in 1.2

pandas/plotting/_matplotlib/hist.py

…column

recover info deleted by mistake

charlesdong1991 · 2020-09-15T08:34:31Z

pandas/plotting/_matplotlib/hist.py

+    # GH32590
+    columns_copy = data.columns
+    numeric_data = data._get_numeric_data()
+    datetime_data = data.select_dtypes(include="datetime64[ns]")


should only datetime type be included? what about timedelta? could you pls check if timedelta should also be supported?

if so you could do something like:

include_type = ['datetime', 'timedelta'] datetime_data = data.select_dtypes(include=include_type)

or you could even do

include_type = ['datetime', 'timedelta', np.number] data = data.select_dtypes(include=include_type)

then probably you could avoid using data._get_numeric_data() and concat.

Agree with the first point.
Would it be ok to use np.number instead of _get_numeric_data()? I'm not sure about this.
I'll have a check on both.

@charlesdong1991 @jreback timedelta is not supported even in pd.Series.hist but, we can convert both timedelta and datetime[ns] into np.int64 then keep original data=data._get_numeric_data().
As to the second point, this minimal exsample shows df.select_dtypes(include=np.number) diffs from df._get_numeric_data():

import numpy as np from pandas import DataFrame, to_datetime from datetime import timedelta df = DataFrame({"a": np.random.randn(10), "b": [timedelta(np.random.randn()) for _ in range(10)], "c": to_datetime(np.random.randint(1582800000000000000, 1583500000000000000, 10, dtype=np.int64)),}) df.select_dtypes(include=np.number) a b 0 0.848384 -1 days +07:39:32.366134 1 -0.184510 -1 days +14:42:46.719730 2 -0.835072 1 days 06:49:21.386804 3 -0.026554 0 days 22:18:48.433275 4 -2.365708 -1 days +08:52:09.934553 5 0.974325 1 days 05:30:18.644021 6 0.135194 1 days 05:06:18.969120 7 1.802466 1 days 07:00:26.538467 8 -1.509265 0 days 00:19:41.677979 9 -0.752476 -1 days +09:55:45.548658 df._get_numeric_data() a 0 0.848384 1 -0.184510 2 -0.835072 3 -0.026554 4 -2.365708 5 0.974325 6 0.135194 7 1.802466 8 -1.509265 9 -0.752476

Any thought?

yeah, timedelta is treated as numbers since it represents interval, probably _get_numeric_data does some other processingss to filter them out, then I think you could keep _get_numeric_data and concat in this PR!

onshek · 2020-09-15T17:36:01Z

@charlesdong1991 @jreback Two tests unexpectedly fails in 35cccd6 (I think 7569a31 have the same problem even if it passed all checks), it's a bit more complicated than initially projected.
As mentioned in #36287 (comment), converting datetimetz, datetime64 and timedelta into np.int64 (or using map/dict, just like what #32590 (comment) has discussed before) may be a simpler choice, which can leave all tests unchanged.

charlesdong1991 · 2020-09-15T19:12:08Z

Two tests unexpectedly fails in 35cccd6 (I think 7569a31 have the same problem even if it passed all checks), it's a bit more complicated than initially projected.

no worries, this flaky failure occurred before in other PRs, probably if you rebase and commit, it will be gone.

As mentioned in #36287 (comment), converting datetimetz, datetime64 and timedelta into np.int64 (or using map/dict, just like what #32590 (comment) has discussed before) may be a simpler choice,

yeah, it's okay to go for this approach, one thing to note is that we'd prefer to see a plot example after this change, you could add the screenshot on top (e.g. in your PR description below your plan a/b).

Furthermore, I think it's nice to see that you already have a test which shows datetime input won't raise the error, and it will be nicer if you could add a test to show the output of the plot with a datetime input is correct! There are many examples in tests/plotting where you could use them as references to add a test!

onshek · 2020-09-16T02:59:34Z

I'll go on to refactor the code and compliment related tests. This may take some time. Thanks for your advice @charlesdong1991 !

…column

onshek · 2020-09-17T15:59:45Z

[WIP] todo: add new tests and pic

jreback · 2020-09-19T20:51:41Z

pandas/plotting/_matplotlib/hist.py

+    # GH32590
+    columns_copy = data.columns
+    numeric_data = data._get_numeric_data()
+    datetime_data = data.select_dtypes(include=["datetime64", "datetimetz"])


you can just add on integer, float (and add exclude=['timedelta'])

I think include=[int, float] is not enough to replace _get_numeric_data(), for numeric types such as uint32 won't be selected by int (but currently it works well with pd.DataFrame.hist). Also, include=[np.number] is not qualified too, see #36287 (comment).

As to timedelta, experimentally timedelta is not supported yet, but we can convert it into float to solve this problem, see

As mentioned in #36287 (comment), converting datetimetz, datetime64 and timedelta into np.int64 (or using map/dict, just like what #32590 (comment) has discussed before) may be a simpler choice, which can leave all tests unchanged.

BTW, which test file (test_hist_method.py/test_frame.py/test_datetimelike.py) is better to put new tests required by #36287 (comment) in ? @jreback @charlesdong1991

include=[np.number] is not qualified too

what if include np.number and exclude timedelta? will the behaviour remain unchanged and all tests can pass?

which test file (test_hist_method.py/test_frame.py/test_datetimelike.py) is better to put new tests

should be test_hist_method

what if include np.number and exclude timedelta? will the behaviour remain unchanged and all tests can pass?

It works, I'll change to it

onshek · 2020-09-20T14:19:13Z

The pic and copy-pastable example are updated in ENH: support of pandas.DataFrame.hist for datetime data #36287 (comment)
Finally, ENH: support of pandas.DataFrame.hist for datetime data #36287 (comment) is the best choice I think
Related tests are modified to add a datetime column in original df data

onshek · 2020-09-28T04:03:23Z

@charlesdong1991 @jreback Do you have time to make a review, thanks!

charlesdong1991

thanks @onshek

Sorry I wasn't very active recently due to personal issues.

I just take a look at your change and leave just a couple of minor suggestions, but overall this PR looks very good to me!

BTW pls rebase and resolve conflicts

pandas/tests/plotting/common.py

pandas/tests/plotting/test_hist_method.py

…column

onshek · 2020-10-04T13:02:11Z

@charlesdong1991 thanks for your suggestions, let me know if there's still any to improve :)

charlesdong1991

very nice

not sure if it's a bit too much: I see you use exactly the same way (to_datetime(np.random.randint(blabla))) to create Series several times in the tests, if those are all necessary to add (although maybe add one or two to prove it works for datetime, and datetimetz is enough for me, it's okay if you want to keep them all), maybe use a fixture/method/constant to do this (e.g. to_datetime(np.random.randint(blabla))) and assign it to different methods (df[col] = constant/method/fixture), so that we could avoid repeatedly doing the same thing?

defer to @jreback @WillAyd for reviews

onshek · 2020-10-08T11:26:39Z

@jreback Can you take some time to make a review, thanks!

jreback · 2020-10-10T23:07:48Z

thanks @onshek very nice!

…6287)

ENH: support of pandas.DataFrame.hist for datetime data

e8989e6

jreback requested changes Sep 12, 2020

View reviewed changes

jreback added the Visualization plotting label Sep 12, 2020

onshek added 2 commits September 13, 2020 13:08

Revert "ENH: support of pandas.DataFrame.hist for datetime data"

ed86887

This reverts commit e8989e6.

ENH: support of pandas.DataFrame.hist for datetime data

9dc96cf

onshek marked this pull request as ready for review September 13, 2020 10:02

REF: reformat imports

602c59b

jreback requested changes Sep 13, 2020

View reviewed changes

pandas/plotting/_matplotlib/hist.py Outdated Show resolved Hide resolved

pandas/plotting/_matplotlib/hist.py Outdated Show resolved Hide resolved

onshek and others added 4 commits September 14, 2020 18:01

UPD: changed as requested

17c182f

Merge branch 'master' into issue32590-Histogram-or-kde-from-datetime-…

aad3433

…column

FIX: recover whatsnew note v1.2.0

2cfc979

recover info deleted by mistake

UPD: delete an extra line

7569a31

charlesdong1991 suggested changes Sep 15, 2020

View reviewed changes

UPD: explicit is better than implicit

35cccd6

onshek and others added 2 commits September 17, 2020 23:55

UPD: hist support for type of datetime64 and datetimetz

bd7f450

Merge branch 'master' into issue32590-Histogram-or-kde-from-datetime-…

731f097

…column

jreback requested changes Sep 19, 2020

View reviewed changes

UPD: modify and reformat related tests

3a99250

charlesdong1991 suggested changes Oct 3, 2020

View reviewed changes

pandas/tests/plotting/common.py Outdated Show resolved Hide resolved

pandas/tests/plotting/test_hist_method.py Outdated Show resolved Hide resolved

pandas/tests/plotting/test_hist_method.py Outdated Show resolved Hide resolved

onshek added 3 commits October 4, 2020 20:41

UPD: reformat code

e9fa7ec

Merge branch 'master' into issue32590-Histogram-or-kde-from-datetime-…

1c098c8

…column

UPD: focus on main issues

b24df4d

onshek added 3 commits October 4, 2020 20:56

UPD: recover # GH 9351

b02c017

UPD: recover a blank line

315d77e

UPD: reformat the blank line

c8c7460

onshek requested review from charlesdong1991 and jreback October 4, 2020 13:11

charlesdong1991 approved these changes Oct 4, 2020

View reviewed changes

jreback added this to the 1.2 milestone Oct 10, 2020

jreback added the Dtype Conversions Unexpected or buggy dtype conversions label Oct 10, 2020

jreback approved these changes Oct 10, 2020

View reviewed changes

jreback merged commit 03709d4 into pandas-dev:master Oct 10, 2020

kesmit13 pushed a commit to kesmit13/pandas that referenced this pull request Nov 2, 2020

ENH: support of pandas.DataFrame.hist for datetime data (pandas-dev#3…

b4aedda

…6287)

Uh oh!

ENH: support of pandas.DataFrame.hist for datetime data #36287

ENH: support of pandas.DataFrame.hist for datetime data #36287

Uh oh!

Conversation

onshek commented Sep 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Update

WIP

Uh oh!

pep8speaks commented Sep 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated at 2020-10-04 12:59:16 UTC

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

onshek commented Sep 13, 2020

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

onshek Sep 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

onshek commented Sep 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

charlesdong1991 commented Sep 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

onshek commented Sep 16, 2020

Uh oh!

onshek commented Sep 17, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

onshek Sep 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

onshek commented Sep 20, 2020

Uh oh!

onshek commented Sep 28, 2020

Uh oh!

charlesdong1991 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

onshek commented Oct 4, 2020

Uh oh!

charlesdong1991 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

onshek commented Oct 8, 2020

Uh oh!

jreback commented Oct 10, 2020

Uh oh!

Reviewers

onshek commented Sep 11, 2020 •

edited

Loading

pep8speaks commented Sep 11, 2020 •

edited

Loading

onshek Sep 15, 2020 •

edited

Loading

onshek commented Sep 15, 2020 •

edited

Loading

charlesdong1991 commented Sep 15, 2020 •

edited

Loading

onshek Sep 20, 2020 •

edited

Loading

charlesdong1991 left a comment •

edited

Loading

charlesdong1991 left a comment •

edited

Loading