-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: support of pandas.DataFrame.hist for datetime data #36287
ENH: support of pandas.DataFrame.hist for datetime data #36287
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls always add a test
pandas/core/generic.py
Outdated
@@ -5438,6 +5438,9 @@ def _get_numeric_data(self): | |||
def _get_bool_data(self): | |||
return self._constructor(self._mgr.get_bool_data()).__finalize__(self) | |||
|
|||
def _get_numeric_or_datetime_data(self): | |||
return self._constructor(self._mgr.get_numeric_or_datetime_data()).__finalize__(self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's instead of adding to the internal routines, let's use .select_dtypes()
here. I think more idiomatic (we should also change the others).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored, I think the updated version is clearer.
Also need to update whatsnew note here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will also need a whatsnew note, this is an api change so put in other api changes in 1.2
recover info deleted by mistake
pandas/plotting/_matplotlib/hist.py
Outdated
# GH32590 | ||
columns_copy = data.columns | ||
numeric_data = data._get_numeric_data() | ||
datetime_data = data.select_dtypes(include="datetime64[ns]") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should only datetime
type be included? what about timedelta
? could you pls check if timedelta
should also be supported?
if so you could do something like:
include_type = ['datetime', 'timedelta']
datetime_data = data.select_dtypes(include=include_type)
or you could even do
include_type = ['datetime', 'timedelta', np.number]
data = data.select_dtypes(include=include_type)
then probably you could avoid using data._get_numeric_data()
and concat
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with the first point.
Would it be ok to use np.number
instead of _get_numeric_data()
? I'm not sure about this.
I'll have a check on both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@charlesdong1991 @jreback timedelta
is not supported even in pd.Series.hist
but, we can convert both timedelta
and datetime[ns]
into np.int64
then keep original data=data._get_numeric_data()
.
As to the second point, this minimal exsample shows df.select_dtypes(include=np.number)
diffs from df._get_numeric_data()
:
import numpy as np
from pandas import DataFrame, to_datetime
from datetime import timedelta
df = DataFrame({"a": np.random.randn(10),
"b": [timedelta(np.random.randn()) for _ in range(10)],
"c": to_datetime(np.random.randint(1582800000000000000, 1583500000000000000, 10, dtype=np.int64)),})
df.select_dtypes(include=np.number)
a b
0 0.848384 -1 days +07:39:32.366134
1 -0.184510 -1 days +14:42:46.719730
2 -0.835072 1 days 06:49:21.386804
3 -0.026554 0 days 22:18:48.433275
4 -2.365708 -1 days +08:52:09.934553
5 0.974325 1 days 05:30:18.644021
6 0.135194 1 days 05:06:18.969120
7 1.802466 1 days 07:00:26.538467
8 -1.509265 0 days 00:19:41.677979
9 -0.752476 -1 days +09:55:45.548658
df._get_numeric_data()
a
0 0.848384
1 -0.184510
2 -0.835072
3 -0.026554
4 -2.365708
5 0.974325
6 0.135194
7 1.802466
8 -1.509265
9 -0.752476
Any thought?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, timedelta
is treated as numbers since it represents interval, probably _get_numeric_data
does some other processingss to filter them out, then I think you could keep _get_numeric_data
and concat
in this PR!
@charlesdong1991 @jreback Two tests unexpectedly fails in 35cccd6 (I think 7569a31 have the same problem even if it passed all checks), it's a bit more complicated than initially projected. |
no worries, this flaky failure occurred before in other PRs, probably if you rebase and commit, it will be gone.
yeah, it's okay to go for this approach, one thing to note is that we'd prefer to see a plot example after this change, you could add the screenshot on top (e.g. in your PR description below your plan a/b). Furthermore, I think it's nice to see that you already have a test which shows |
I'll go on to refactor the code and compliment related tests. This may take some time. Thanks for your advice @charlesdong1991 ! |
[WIP] todo: add new tests and pic |
pandas/plotting/_matplotlib/hist.py
Outdated
# GH32590 | ||
columns_copy = data.columns | ||
numeric_data = data._get_numeric_data() | ||
datetime_data = data.select_dtypes(include=["datetime64", "datetimetz"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can just add on integer, float (and add exclude=['timedelta'])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think include=[int, float]
is not enough to replace _get_numeric_data()
, for numeric types such as uint32
won't be selected by int
(but currently it works well with pd.DataFrame.hist
). Also, include=[np.number]
is not qualified too, see #36287 (comment).
As to timedelta
, experimentally timedelta
is not supported yet, but we can convert it into float to solve this problem, see
As mentioned in #36287 (comment), converting datetimetz, datetime64 and timedelta into np.int64 (or using map/dict, just like what #32590 (comment) has discussed before) may be a simpler choice, which can leave all tests unchanged.
BTW, which test file (test_hist_method.py
/test_frame.py
/test_datetimelike.py
) is better to put new tests required by #36287 (comment) in ? @jreback @charlesdong1991
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
include=[np.number] is not qualified too
what if include np.number
and exclude timedelta
? will the behaviour remain unchanged and all tests can pass?
which test file (test_hist_method.py/test_frame.py/test_datetimelike.py) is better to put new tests
should be test_hist_method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if include
np.number
and excludetimedelta
? will the behaviour remain unchanged and all tests can pass?
It works, I'll change to it
|
@charlesdong1991 @jreback Do you have time to make a review, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @onshek
Sorry I wasn't very active recently due to personal issues.
I just take a look at your change and leave just a couple of minor suggestions, but overall this PR looks very good to me!
BTW pls rebase and resolve conflicts
@charlesdong1991 thanks for your suggestions, let me know if there's still any to improve :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very nice
not sure if it's a bit too much: I see you use exactly the same way (to_datetime(np.random.randint(blabla))
) to create Series several times in the tests, if those are all necessary to add (although maybe add one or two to prove it works for datetime
, and datetimetz
is enough for me, it's okay if you want to keep them all), maybe use a fixture/method/constant to do this (e.g. to_datetime(np.random.randint(blabla))
) and assign it to different methods (df[col] = constant/method/fixture), so that we could avoid repeatedly doing the same thing?
@jreback Can you take some time to make a review, thanks! |
thanks @onshek very nice! |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
Update
WIPSince there's no response for several days in #32590, I decide to open this PR for further discussion.
There's two feasible options:
Plan A:
As is shown in the files changed,
data._get_numeric_data()
is modified todata._get_numeric_or_datetime_data()
, and I will work on related tests as the next step.Plan B, this may be a fotfix according to #32590 (comment):
Also, related tests will be modified.
Any comment / thought is welcomed.