Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor approach for making final bin right-edge inclusive #44

Merged
merged 7 commits into from
Jun 22, 2021

Conversation

dougiesquire
Copy link
Contributor

Description

Our current approach for making the last bin right-edge inclusive (to align with np.histogram definitions) is a hack stolen from scikit-learn - we simply add a very small increment to the last bin edge prior to digitizing.

Among other problems, this means that xhistogram currently fails for non-float bins (for example, binning over datetime objects, see #25).

This PR implements an alternative, hopefully better, approach.

Type of change

  • Bug fix (non-breaking change which fixes an issue)

Testing

  • Added test_histogram_results_datetime() to test_core.py to check that xhistogram matches numpy.histogram with np.datetime64 data.

@dougiesquire dougiesquire marked this pull request as draft April 10, 2021 07:44
@dougiesquire
Copy link
Contributor Author

dougiesquire commented Apr 10, 2021

You'll see that the test I added is failing. This is because np.digitize can't cast data from dtype('<M8[D]') to dtype('float64') (things are actually failing when checking the monoticity of the data within np.digitize).

There seem to be ways to get around this, but I'm looking for advice on best way to do this robustly. Should we only deal with np.datetime64 or try to include other datetime objects also? @WardBrian perhaps you have some suggestions?

I should point out, np.histogram can deal with datetime64 bins, but it uses searchsorted (which assumes sorted data) rather than digitize and sorts the data as a separate step (using np.sort). So an alternative is to refactor xhistogram to use searchsorted as well (though note there is a dask version of digitize but not searchsorted)...

@codecov
Copy link

codecov bot commented May 2, 2021

Codecov Report

Merging #44 (4fdd716) into master (8a6765a) will increase coverage by 11.90%.
The diff coverage is 100.00%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master      #44       +/-   ##
===========================================
+ Coverage   84.08%   95.98%   +11.90%     
===========================================
  Files           2        2               
  Lines         245      249        +4     
  Branches       74       71        -3     
===========================================
+ Hits          206      239       +33     
+ Misses         34        7       -27     
+ Partials        5        3        -2     
Impacted Files Coverage Δ
xhistogram/core.py 97.40% <100.00%> (+15.68%) ⬆️
xhistogram/xarray.py 91.07% <0.00%> (-0.46%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8a6765a...4fdd716. Read the comment docs.

@dougiesquire
Copy link
Contributor Author

I think that because of our reliance on digitize, it's going to be difficult for us to generally support a range of bin data types. Instead, we might have to expect the user to cast their data to a type that digitize can in turn cast to float64.

Despite this, I think the approach in this PR for including the right edge of the right bin is still better than the old approach and is maybe still worth merging. Thoughts @rabernat?

@dougiesquire dougiesquire marked this pull request as ready for review May 2, 2021 04:13
@dougiesquire dougiesquire requested a review from rabernat May 2, 2021 04:13
@rabernat
Copy link
Contributor

Hi Dougie, sorry it has taken so long for me to review this. I am a bit under water right now. Maybe @TomNicholas could provide a review?

Copy link
Contributor

@rabernat rabernat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is simple and looks fine to me. But we should have a test to cover the change.

@dougiesquire
Copy link
Contributor Author

Thanks @rabernat, and all good - feeling pretty under water here too.

This PR was motivated by #25, so the test I originally wrote was to compute the histogram of datetime objects. I've added this test back in, but you'll see it's failing. This is because np.digitize internally checks for monotonicity and this function tries to cast everything to float64. I see two options:

  • Someone who understands casting dtypes better than me suggests a robust way for us to handle datetime objects,
  • Refactor _bincount_2d_vectorized to use np.sort + np.searchsorted instead of np.digitize. This is what is done in np.histogram, which can handle datetime objects.

cc @TomNicholas

@TomNicholas
Copy link
Member

TomNicholas commented Jun 15, 2021

Someone who understands casting dtypes better than me suggests a robust way for us to handle datetime objects,

I have no idea if that's possible or not, but @spencerkclark has done lots with datetimes and might be able to tell us before we put it in xarray?

Refactor _bincount_2d_vectorized to use np.sort + np.searchsorted instead of np.digitize

I think this would be preferable really, and maybe not too difficult if we can copy np.histogram? I've added it to the to-do list on my xarray PR anyway.

(though note there is a dask version of digitize but not searchsorted)

With the blockwise method of dask-ifying we don't need there to be a dask.array.searchsorted - bincount_2d_vectorized will only ever see a single chunk.

@Illviljan
Copy link

FYI, dask has added searchsorted recently dask/dask#7696.

@dougiesquire
Copy link
Contributor Author

dougiesquire commented Jun 15, 2021

(though note there is a dask version of digitize but not searchsorted)

With the blockwise method of dask-ifying we don't need there to be a dask.array.searchsorted - bincount_2d_vectorized will only ever see a single chunk.

Yup, this comment was from before the blockwise refactor. I agree using searchsorted should be pretty straightforward now.

@spencerkclark
Copy link

I have no idea if that's possible or not, but @spencerkclark has done lots with datetimes and might be able to tell us before we put it in xarray?

We have fairly well-tested ways of converting datetimes to integers or floats in xarray, which you could use to work around this issue; however, it's a little unfortunate that it is required in this case. Checking for monotonicity -- and I think the digitize operation in general -- should not need to depend on converting to floats. You could consider posting an issue in the NumPy repo about this, but if you can work around it by using searchsorted instead, that would be much cleaner!

@dougiesquire
Copy link
Contributor Author

You could consider posting an issue in the NumPy repo about this, but if you can work around it by using searchsorted instead, that would be much cleaner!

Great, thanks for the advice @spencerkclark !

@dougiesquire
Copy link
Contributor Author

I think this is ready for review. @TomNicholas, would you mind taking a quick look?

@rabernat rabernat mentioned this pull request Jun 21, 2021
@dougiesquire dougiesquire merged commit 5c12e17 into xgcm:master Jun 22, 2021
@dougiesquire dougiesquire deleted the refactor_bin_alignment_with_np branch June 22, 2021 00:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants