Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: support of pandas.DataFrame.hist for datetime data #36287

Conversation

onshek
Copy link
Contributor

@onshek onshek commented Sep 11, 2020

Update

import numpy as np
from pandas import DataFrame, to_datetime
from datetime import timedelta

df = DataFrame(
    {
        "a": np.random.rand(10),
        "b": [timedelta(np.random.randn()) for _ in range(10)],
        "c": to_datetime(
            np.random.randint(
                1582800000000000000, 1583500000000000000, 10, dtype=np.int64
            )
        ),
        "d": to_datetime(
            np.random.randint(
                1582800000000000000, 1583500000000000000, 10, dtype=np.int64
            ),
            utc=True
        ),
    }
)

df.dtypes
a                float64
b        timedelta64[ns]
c         datetime64[ns]
d    datetime64[ns, UTC]
dtype: object

df.hist(xrot=90, figsize=(9, 6))
array([[<AxesSubplot:title={'center':'a'}>,
        <AxesSubplot:title={'center':'c'}>],
       [<AxesSubplot:title={'center':'d'}>, <AxesSubplot:>]], dtype=object)

download

WIP

Since there's no response for several days in #32590, I decide to open this PR for further discussion.
There's two feasible options:

Plan A:
As is shown in the files changed, data._get_numeric_data() is modified to data._get_numeric_or_datetime_data(), and I will work on related tests as the next step.

Plan B, this may be a fotfix according to #32590 (comment):

column_dt = data.select_dtypes(include='datetime64[ns]')
if len(column_dt) > 0:
    data[column_dt] = data[column_dt].astype(np.int64)
data = data._get_numeric_data()
if len(column_dt) > 0:
    data[column_dt] = data[column_dt].astype('datetime64[ns]')

Also, related tests will be modified.

Any comment / thought is welcomed.

@pep8speaks
Copy link

pep8speaks commented Sep 11, 2020

Hello @onshek! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-10-04 12:59:16 UTC

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls always add a test

@@ -5438,6 +5438,9 @@ def _get_numeric_data(self):
def _get_bool_data(self):
return self._constructor(self._mgr.get_bool_data()).__finalize__(self)

def _get_numeric_or_datetime_data(self):
return self._constructor(self._mgr.get_numeric_or_datetime_data()).__finalize__(self)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's instead of adding to the internal routines, let's use .select_dtypes() here. I think more idiomatic (we should also change the others).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored, I think the updated version is clearer.

@jreback jreback added the Visualization plotting label Sep 12, 2020
@onshek onshek marked this pull request as ready for review September 13, 2020 10:02
@onshek
Copy link
Contributor Author

onshek commented Sep 13, 2020

Also need to update whatsnew note here?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will also need a whatsnew note, this is an api change so put in other api changes in 1.2

# GH32590
columns_copy = data.columns
numeric_data = data._get_numeric_data()
datetime_data = data.select_dtypes(include="datetime64[ns]")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should only datetime type be included? what about timedelta? could you pls check if timedelta should also be supported?

if so you could do something like:

include_type = ['datetime', 'timedelta']
datetime_data = data.select_dtypes(include=include_type)

or you could even do

include_type = ['datetime', 'timedelta', np.number]
data = data.select_dtypes(include=include_type)

then probably you could avoid using data._get_numeric_data() and concat.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with the first point.
Would it be ok to use np.number instead of _get_numeric_data()? I'm not sure about this.
I'll have a check on both.

Copy link
Contributor Author

@onshek onshek Sep 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@charlesdong1991 @jreback timedelta is not supported even in pd.Series.hist but, we can convert both timedelta and datetime[ns] into np.int64 then keep original data=data._get_numeric_data().
As to the second point, this minimal exsample shows df.select_dtypes(include=np.number) diffs from df._get_numeric_data():

import numpy as np
from pandas import DataFrame, to_datetime
from datetime import timedelta

df = DataFrame({"a": np.random.randn(10),
                "b": [timedelta(np.random.randn()) for _ in range(10)],          
                "c": to_datetime(np.random.randint(1582800000000000000, 1583500000000000000, 10, dtype=np.int64)),})

df.select_dtypes(include=np.number)
a	b
0	0.848384	-1 days +07:39:32.366134
1	-0.184510	-1 days +14:42:46.719730
2	-0.835072	1 days 06:49:21.386804
3	-0.026554	0 days 22:18:48.433275
4	-2.365708	-1 days +08:52:09.934553
5	0.974325	1 days 05:30:18.644021
6	0.135194	1 days 05:06:18.969120
7	1.802466	1 days 07:00:26.538467
8	-1.509265	0 days 00:19:41.677979
9	-0.752476	-1 days +09:55:45.548658

df._get_numeric_data()
a
0	0.848384
1	-0.184510
2	-0.835072
3	-0.026554
4	-2.365708
5	0.974325
6	0.135194
7	1.802466
8	-1.509265
9	-0.752476

Any thought?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, timedelta is treated as numbers since it represents interval, probably _get_numeric_data does some other processingss to filter them out, then I think you could keep _get_numeric_data and concat in this PR!

@onshek
Copy link
Contributor Author

onshek commented Sep 15, 2020

@charlesdong1991 @jreback Two tests unexpectedly fails in 35cccd6 (I think 7569a31 have the same problem even if it passed all checks), it's a bit more complicated than initially projected.
As mentioned in #36287 (comment), converting datetimetz, datetime64 and timedelta into np.int64 (or using map/dict, just like what #32590 (comment) has discussed before) may be a simpler choice, which can leave all tests unchanged.

@charlesdong1991
Copy link
Member

charlesdong1991 commented Sep 15, 2020

Two tests unexpectedly fails in 35cccd6 (I think 7569a31 have the same problem even if it passed all checks), it's a bit more complicated than initially projected.

no worries, this flaky failure occurred before in other PRs, probably if you rebase and commit, it will be gone.

As mentioned in #36287 (comment), converting datetimetz, datetime64 and timedelta into np.int64 (or using map/dict, just like what #32590 (comment) has discussed before) may be a simpler choice,

yeah, it's okay to go for this approach, one thing to note is that we'd prefer to see a plot example after this change, you could add the screenshot on top (e.g. in your PR description below your plan a/b).

Furthermore, I think it's nice to see that you already have a test which shows datetime input won't raise the error, and it will be nicer if you could add a test to show the output of the plot with a datetime input is correct! There are many examples in tests/plotting where you could use them as references to add a test!

@onshek
Copy link
Contributor Author

onshek commented Sep 16, 2020

I'll go on to refactor the code and compliment related tests. This may take some time. Thanks for your advice @charlesdong1991 !

@onshek
Copy link
Contributor Author

onshek commented Sep 17, 2020

[WIP] todo: add new tests and pic

# GH32590
columns_copy = data.columns
numeric_data = data._get_numeric_data()
datetime_data = data.select_dtypes(include=["datetime64", "datetimetz"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can just add on integer, float (and add exclude=['timedelta'])

Copy link
Contributor Author

@onshek onshek Sep 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think include=[int, float] is not enough to replace _get_numeric_data(), for numeric types such as uint32 won't be selected by int (but currently it works well with pd.DataFrame.hist). Also, include=[np.number] is not qualified too, see #36287 (comment).

As to timedelta, experimentally timedelta is not supported yet, but we can convert it into float to solve this problem, see

As mentioned in #36287 (comment), converting datetimetz, datetime64 and timedelta into np.int64 (or using map/dict, just like what #32590 (comment) has discussed before) may be a simpler choice, which can leave all tests unchanged.

BTW, which test file (test_hist_method.py/test_frame.py/test_datetimelike.py) is better to put new tests required by #36287 (comment) in ? @jreback @charlesdong1991

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

include=[np.number] is not qualified too

what if include np.number and exclude timedelta? will the behaviour remain unchanged and all tests can pass?

which test file (test_hist_method.py/test_frame.py/test_datetimelike.py) is better to put new tests

should be test_hist_method

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if include np.number and exclude timedelta? will the behaviour remain unchanged and all tests can pass?

It works, I'll change to it

@onshek
Copy link
Contributor Author

onshek commented Sep 20, 2020

@onshek
Copy link
Contributor Author

onshek commented Sep 28, 2020

@charlesdong1991 @jreback Do you have time to make a review, thanks!

Copy link
Member

@charlesdong1991 charlesdong1991 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @onshek

Sorry I wasn't very active recently due to personal issues.

I just take a look at your change and leave just a couple of minor suggestions, but overall this PR looks very good to me!

BTW pls rebase and resolve conflicts

@onshek
Copy link
Contributor Author

onshek commented Oct 4, 2020

@charlesdong1991 thanks for your suggestions, let me know if there's still any to improve :)

Copy link
Member

@charlesdong1991 charlesdong1991 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nice

not sure if it's a bit too much: I see you use exactly the same way (to_datetime(np.random.randint(blabla))) to create Series several times in the tests, if those are all necessary to add (although maybe add one or two to prove it works for datetime, and datetimetz is enough for me, it's okay if you want to keep them all), maybe use a fixture/method/constant to do this (e.g. to_datetime(np.random.randint(blabla))) and assign it to different methods (df[col] = constant/method/fixture), so that we could avoid repeatedly doing the same thing?

defer to @jreback @WillAyd for reviews

@onshek
Copy link
Contributor Author

onshek commented Oct 8, 2020

@jreback Can you take some time to make a review, thanks!

@jreback jreback added this to the 1.2 milestone Oct 10, 2020
@jreback jreback added the Dtype Conversions Unexpected or buggy dtype conversions label Oct 10, 2020
@jreback jreback merged commit 03709d4 into pandas-dev:master Oct 10, 2020
@jreback
Copy link
Contributor

jreback commented Oct 10, 2020

thanks @onshek very nice!

kesmit13 pushed a commit to kesmit13/pandas that referenced this pull request Nov 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Visualization plotting
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Histogram or kde from datetime column
4 participants