Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multidimensional histogram #5400

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

TomNicholas
Copy link
Member

@TomNicholas TomNicholas commented May 28, 2021

Initial work on integrating the multi-dimensional dask-powered histogram functionality from xhistogram into xarray. Just working on the skeleton to fit around the histogram algorithm for now, to be filled in later.

  • Closes Add histogram method #4610
  • API skeleton
  • Input checking
  • Internal blockwise algorithm from Refactor histogram to use blockwise xgcm/xhistogram#49
  • Redirect plot.hist
  • da.weighted().hist()
  • Tests added for results
  • Hypothesis tests for different chunking patterns
  • Examples in documentation
  • Examples in docstrings
  • Type hints (first time trying these so might be wrong)
  • Passes pre-commit run --all-files
  • User visible changes (including notable bug fixes) are documented in whats-new.rst
  • New functions/methods are listed in api.rst
  • Range argument
  • Handle multidimensional bins (for a future PR? - See [WIP] Multidimensional bins xgcm/xhistogram#59)
  • Handle np.datetime64 dtypes by refactoring to use np.searchsorted (for a future PR? See discussion)
  • Fast path for uniform bin widths (for a future PR? See suggestion)

Question: da.hist() or da.histogram()?

@pep8speaks
Copy link

pep8speaks commented May 28, 2021

Hello @TomNicholas! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 4612:10: F821 undefined name 'DataArray'

Line 1990:1: W391 blank line at end of file

Comment last updated at 2021-06-10 21:28:54 UTC


Parameters
----------
darray : DataArray
Can have any number of dimensions.
Can have any number of dimensions, but will be reduced to 1 dimension
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you have plans for this already but will this handle 2d-histograms like the mean oxygen example in https://xhistogram.readthedocs.io/en/latest/tutorial.html#Averaging-a-variable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it will be able to calculate 2D (or N-D in general) histograms in xarray.hist(), it just won't be able to plot them in xarray.plot.hist(), because you need something like an xarray.plot.imshow() to plot the results of 2D histogram instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether you feel it is at all confusing to use the same word "hist" to refer both to the function that calculates the histogram and to the plotting function?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine. da.hist() vs. da.plot.hist() feels distinct enough for me.
I think I prefer the functions being called the proper name, da.histogram() and da.plot.histogram(), instead of the shorthand though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine. da.hist() vs. da.plot.hist() feels distinct enough for me.

Cool.

I think I prefer the functions being called the proper name

That would also be marginally easier for the current users of xhistogram, who have been using xhistogram.histogram so would only need to change the import.

However the two arguments I can see against that are:

  1. Deprecation cycle - I would like consistency between the calculation function and the plot function, but the plot function already exists and is so named because it wraps matplotlib.pyplot.hist. Changing that would require a deprecation cycle, but I think with the changes I'm proposing here we could get away without a deprecation cycle (because we're only adding optional arguments, not changing them).
  2. We have many shorthand names for high-level functions already - cov, corr, concat, it would not be unusually terse.

I don't really have a strong opinion though.

Copy link
Contributor

@Illviljan Illviljan Jun 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deprecation cycle wouldn't hinder your progress since I'm imagining simply just aliasing the method with a warning for a while.

numpy and dask also uses .histogram() for their histograms. matplotlib, pandas and hvplot uses .hist() for the plotting, holoviews uses .Histogram().
Maybe it's just easier to follow the upstream dependencies decisions then to avoid messing with the muscle memory (even though it's in my opinion "bad" muscle memory)?

And I'm not a fan of those shorthand names either, covid, corrosion, interpret? But I'm fine with just following what upstream are doing, there's something satisfying when you can just jump between the different data structures and everything just works.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You make very good points! I think I agree with you now.

covid, corrosion, interpret

🤣

@TomNicholas
Copy link
Member Author

It appears numpy has multiple discussions (here and here) about adding axis-aware histograms, and an axis argument was almost added to np.bincount in this PR. Cross-referencing here because if that ever does get implemented in numpy it would greatly simplify (and speed up) the implementation here as we wouldn't have to use the reshape approach from xhistogram.

@Illviljan
Copy link
Contributor

Not sure if it's any help but dask supports the numpy histogram functions now: dask/dask#7827

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add histogram method
4 participants