Refactor approach for making final bin right-edge inclusive #44

dougiesquire · 2021-04-10T07:44:17Z

Description

Our current approach for making the last bin right-edge inclusive (to align with np.histogram definitions) is a hack stolen from scikit-learn - we simply add a very small increment to the last bin edge prior to digitizing.

Among other problems, this means that xhistogram currently fails for non-float bins (for example, binning over datetime objects, see #25).

This PR implements an alternative, hopefully better, approach.

Type of change

Bug fix (non-breaking change which fixes an issue)

Testing

Added test_histogram_results_datetime() to test_core.py to check that xhistogram matches numpy.histogram with np.datetime64 data.

dougiesquire · 2021-04-10T08:12:35Z

You'll see that the test I added is failing. This is because np.digitize can't cast data from dtype('<M8[D]') to dtype('float64') (things are actually failing when checking the monoticity of the data within np.digitize).

There seem to be ways to get around this, but I'm looking for advice on best way to do this robustly. Should we only deal with np.datetime64 or try to include other datetime objects also? @WardBrian perhaps you have some suggestions?

I should point out, np.histogram can deal with datetime64 bins, but it uses searchsorted (which assumes sorted data) rather than digitize and sorts the data as a separate step (using np.sort). So an alternative is to refactor xhistogram to use searchsorted as well (though note there is a dask version of digitize but not searchsorted)...

codecov · 2021-05-02T04:06:03Z

Codecov Report

Merging #44 (4fdd716) into master (8a6765a) will increase coverage by 11.90%.
The diff coverage is 100.00%.

@@             Coverage Diff             @@
##           master      #44       +/-   ##
===========================================
+ Coverage   84.08%   95.98%   +11.90%     
===========================================
  Files           2        2               
  Lines         245      249        +4     
  Branches       74       71        -3     
===========================================
+ Hits          206      239       +33     
+ Misses         34        7       -27     
+ Partials        5        3        -2

Impacted Files	Coverage Δ
xhistogram/core.py	`97.40% <100.00%> (+15.68%)`	⬆️
xhistogram/xarray.py	`91.07% <0.00%> (-0.46%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8a6765a...4fdd716. Read the comment docs.

dougiesquire · 2021-05-02T04:13:40Z

I think that because of our reliance on digitize, it's going to be difficult for us to generally support a range of bin data types. Instead, we might have to expect the user to cast their data to a type that digitize can in turn cast to float64.

Despite this, I think the approach in this PR for including the right edge of the right bin is still better than the old approach and is maybe still worth merging. Thoughts @rabernat?

rabernat · 2021-06-15T02:17:07Z

Hi Dougie, sorry it has taken so long for me to review this. I am a bit under water right now. Maybe @TomNicholas could provide a review?

rabernat

This is simple and looks fine to me. But we should have a test to cover the change.

dougiesquire · 2021-06-15T03:21:06Z

Thanks @rabernat, and all good - feeling pretty under water here too.

This PR was motivated by #25, so the test I originally wrote was to compute the histogram of datetime objects. I've added this test back in, but you'll see it's failing. This is because np.digitize internally checks for monotonicity and this function tries to cast everything to float64. I see two options:

Someone who understands casting dtypes better than me suggests a robust way for us to handle datetime objects,
Refactor _bincount_2d_vectorized to use np.sort + np.searchsorted instead of np.digitize. This is what is done in np.histogram, which can handle datetime objects.

cc @TomNicholas

TomNicholas · 2021-06-15T15:54:36Z

Someone who understands casting dtypes better than me suggests a robust way for us to handle datetime objects,

I have no idea if that's possible or not, but @spencerkclark has done lots with datetimes and might be able to tell us before we put it in xarray?

Refactor _bincount_2d_vectorized to use np.sort + np.searchsorted instead of np.digitize

I think this would be preferable really, and maybe not too difficult if we can copy np.histogram? I've added it to the to-do list on my xarray PR anyway.

(though note there is a dask version of digitize but not searchsorted)

With the blockwise method of dask-ifying we don't need there to be a dask.array.searchsorted - bincount_2d_vectorized will only ever see a single chunk.

Illviljan · 2021-06-15T19:10:53Z

FYI, dask has added searchsorted recently dask/dask#7696.

dougiesquire · 2021-06-15T20:54:04Z

(though note there is a dask version of digitize but not searchsorted)

With the blockwise method of dask-ifying we don't need there to be a dask.array.searchsorted - bincount_2d_vectorized will only ever see a single chunk.

Yup, this comment was from before the blockwise refactor. I agree using searchsorted should be pretty straightforward now.

spencerkclark · 2021-06-15T21:36:06Z

I have no idea if that's possible or not, but @spencerkclark has done lots with datetimes and might be able to tell us before we put it in xarray?

We have fairly well-tested ways of converting datetimes to integers or floats in xarray, which you could use to work around this issue; however, it's a little unfortunate that it is required in this case. Checking for monotonicity -- and I think the digitize operation in general -- should not need to depend on converting to floats. You could consider posting an issue in the NumPy repo about this, but if you can work around it by using searchsorted instead, that would be much cleaner!

dougiesquire · 2021-06-15T21:40:24Z

You could consider posting an issue in the NumPy repo about this, but if you can work around it by using searchsorted instead, that would be much cleaner!

Great, thanks for the advice @spencerkclark !

xhistogram/core.py

dougiesquire · 2021-06-20T23:17:30Z

I think this is ready for review. @TomNicholas, would you mind taking a quick look?

xhistogram/core.py

refactor approach for last bin RH inclusive

c877978

dougiesquire marked this pull request as draft April 10, 2021 07:44

remove test with datetime objects

9cc290d

dougiesquire marked this pull request as ready for review May 2, 2021 04:13

dougiesquire requested a review from rabernat May 2, 2021 04:13

dougiesquire added 2 commits June 15, 2021 11:40

merge with master

8f81aa3

pre-commit

3b9ca32

rabernat reviewed Jun 15, 2021

View reviewed changes

add back test with datetime objects

b4ecbd3

TomNicholas mentioned this pull request Jun 15, 2021

Multidimensional histogram pydata/xarray#5400

Open

18 tasks

switch to searchsorted

67d32f4

dougiesquire commented Jun 16, 2021

View reviewed changes

xhistogram/core.py Show resolved Hide resolved

rabernat mentioned this pull request Jun 21, 2021

Release v0.3? #64

Closed

TomNicholas approved these changes Jun 21, 2021

View reviewed changes

xhistogram/core.py Show resolved Hide resolved

add note about sorted bins

4fdd716

dougiesquire merged commit 5c12e17 into xgcm:master Jun 22, 2021

dougiesquire deleted the refactor_bin_alignment_with_np branch June 22, 2021 00:25

dougiesquire mentioned this pull request Jun 22, 2021

Support binning over datetime64 objects #25

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor approach for making final bin right-edge inclusive #44

Refactor approach for making final bin right-edge inclusive #44

dougiesquire commented Apr 10, 2021

dougiesquire commented Apr 10, 2021 •

edited

Loading

codecov bot commented May 2, 2021 •

edited

Loading

dougiesquire commented May 2, 2021

rabernat commented Jun 15, 2021

rabernat left a comment

dougiesquire commented Jun 15, 2021

TomNicholas commented Jun 15, 2021 •

edited

Loading

Illviljan commented Jun 15, 2021

dougiesquire commented Jun 15, 2021 •

edited

Loading

spencerkclark commented Jun 15, 2021

dougiesquire commented Jun 15, 2021

dougiesquire commented Jun 20, 2021

Refactor approach for making final bin right-edge inclusive #44

Refactor approach for making final bin right-edge inclusive #44

Conversation

dougiesquire commented Apr 10, 2021

Description

Type of change

Testing

dougiesquire commented Apr 10, 2021 • edited Loading

codecov bot commented May 2, 2021 • edited Loading

Codecov Report

dougiesquire commented May 2, 2021

rabernat commented Jun 15, 2021

rabernat left a comment

Choose a reason for hiding this comment

dougiesquire commented Jun 15, 2021

TomNicholas commented Jun 15, 2021 • edited Loading

Illviljan commented Jun 15, 2021

dougiesquire commented Jun 15, 2021 • edited Loading

spencerkclark commented Jun 15, 2021

dougiesquire commented Jun 15, 2021

dougiesquire commented Jun 20, 2021

dougiesquire commented Apr 10, 2021 •

edited

Loading

codecov bot commented May 2, 2021 •

edited

Loading

TomNicholas commented Jun 15, 2021 •

edited

Loading

dougiesquire commented Jun 15, 2021 •

edited

Loading