Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NEP18 trouble when pint is being wrapped #878

Open
crusaderky opened this issue Sep 12, 2019 · 9 comments
Open

NEP18 trouble when pint is being wrapped #878

crusaderky opened this issue Sep 12, 2019 · 9 comments

Comments

@crusaderky
Copy link
Contributor

FYI @shoyer @hameerabbasi @keewis
numpy 1.17, xarray/dask/sparse/pint git tip

NEP18 doesn't seem to work correctly in several cases.
I'm still in the process of investigating what causes the issue(s).

Works:

  • pint wraps around sparse
  • pint wraps around dask.array
  • xarray wraps around pint
  • xarray wraps around sparse
  • dask.array wraps around sparse
  • xarray wraps around dask.array which wraps around sparse

Broken:

  • [1] dask.array wraps around pint, and there are 2+ chunks
  • [2] xarray wraps around pint which wraps around dask
  • [3] xarray wraps around pint which wraps around sparse

[1] dask.array wraps around pint, and there are 2+ chunks
At first sight, the legitimacy of this use case is arguable, as it feels much cleaner to always have pint wrapping around dask.array (and it saves a few of headaches when dask.distributed and custom UnitRegistries get involved, too, as you never need to pickle your Quantities).

However, the problems of pint->dask and the benefits of dask->pint become clear when one wraps a pint+dask object in xarray.
There, with pint around dask, one would need to write special case handling for pretty much every piece of xarray logic that today has special case handling for dask - which is, a lot, whereas with dask around pint I would expect everything to work out of the box as long as NEP18 compliance is respected by all libraries.
@shoyer I'd like to hear your opinion on this...

>>> import dask.array as da
>>> import pint
>>> ureg = pint.UnitRegistry()
>>> q = ureg.Quantity([1, 2], "kg")
>>> da.from_array(q).compute() # single chunk works
<Quantity([1 2], 'kilogram')>
>> da.from_array(q, chunks=1).compute() . # With 2+ chunks, something's calling Quantity.__array__
array([1, 2])

[2] xarray wraps around pint which wraps around dask
Following the reasoning of [1], this should happen only when a user manually builds the data, as opposed to calling xarray.Dataset.chunk() - which should be rare-ish. I'm tempted to write a single piece of logic in xarray.Variable.data.setter that detects the special pint->dask case and turns it around to dask->pint.

>>> import dask.array as da
>>> import pint
>>> ureg = pint.UnitRegistry()
>>> q = ureg.Quantity(da.from_array([1, 2]), "kg")
>>> q
<Quantity(dask.array<array, shape=(2,), dtype=int64, chunksize=(2,), chunktype=numpy.ndarray>, 'kilogram')>
>>> xarray.DataArray(q) # Something is calling da.Array.__array__, which computes it
<xarray.DataArray 'array-de932becc43e72c010bc91ffefe42af1' (dim_0: 2)>
<Quantity([1 2], 'kilogram')>
Dimensions without coordinates: dim_0

[3] xarray wraps around pint which wraps around sparse
This looks to be the same as [2].

>>> import numpy as np
>>> import pint
>>> import sparse
>>> ureg = pint.UnitRegistry()
>>> q = ureg.Quantity(sparse.COO(np.array([1, 2])), "kg")
>>> q
<Quantity(<COO: shape=(2,), dtype=int64, nnz=2, fill_value=0>, 'kilogram')>
>>> xarray.DataArray(q)
RuntimeError: Cannot convert a sparse array to dense automatically. To manually densify, use the todense method.
@hameerabbasi
Copy link

Most likely you're missing the asarray=False flag in Dask and something similar in xarray.

@crusaderky
Copy link
Contributor Author

@hameerabbasi Nope, I already tried that.

@crusaderky
Copy link
Contributor Author

crusaderky commented Sep 12, 2019

(and even then, the default asarray=None on paper is supposed to work)

@jthielen
Copy link
Contributor

jthielen commented Sep 12, 2019

Works:

  • pint wraps around sparse
  • pint wraps around dask.array

I'm glad to hear that at least pint wrapping sparse and pint wrapping dask.array work, since there are no tests for this (one of the main reasons I created #845). As a part of your investigations, do you happen to have tests for these already written? I was planning on doing so myself if I got the chance after the __array_function__ implementation is settled, but mind as well save the effort if you already have some ready!

[1] dask.array wraps around pint, and there are 2+ chunks
At first sight, the legitimacy of this use case is arguable, as it feels much cleaner to always have pint wrapping around dask.array (and it saves a few of headaches when dask.distributed and custom UnitRegistries get involved, too, as you never need to pickle your Quantities).

However, the problems of pint->dask and the benefits of dask->pint become clear when one wraps a pint+dask object in xarray.
There, with pint around dask, one would need to write special case handling for pretty much every piece of xarray logic that today has special case handling for dask - which is, a lot, whereas with dask around pint I would expect everything to work out of the box as long as NEP18 compliance is respected by all libraries.

As I've only recently gotten into the details of NEP 18 and I'm by far less experienced with all the libraries' internals, I will definitely defer to the consensus of others on this (and would like to hear @shoyer's thoughts). However, I think allowing dask.array wrapping pint and pint wrapping dask.array is a bad idea (xref #845 (comment)). This will make the type casting graph cyclic, which makes the type casting hierarchy ill-defined and the expected result of mixed-type operations ambiguous (xref pydata/xarray#525 (comment) and following comments for some discussion related to this). This would create big problems with non-commutativity and would complicate operations with scalars, among other issues.

Based on past conversations I've seen (primarily in pydata/xarray#525), pint->dask seems to be the preferred order to allow unit math to occur at "graph construction time" rather than "runtime" (borrowing @shoyer's terminology from pydata/xarray#525 (comment)). I'd argue for this order as well, since it is almost a requirement for exploratory analysis of large datasets using unit-aware calculations (I'd want to keep track of units through intermediate steps of calculations, rather than just in the final computation).

With this in mind, I think the larger task at hand is cleaning up xarray internals to allow xarray > pint > dask.array to work as expected, since as you pointed out this is currently a problem area. So, instead of fixing [2] by flipping around to [1], I would think [2] should be the target use case, and perhaps [1] should be flipped around or prohibited?

[2] xarray wraps around pint which wraps around dask
Following the reasoning of [1], this should happen only when a user manually builds the data, as opposed to calling xarray.Dataset.chunk() - which should be rare-ish. I'm tempted to write a single piece of logic in xarray.Variable.data.setter that detects the special pint->dask case and turns it around to dask->pint.

[3] xarray wraps around pint which wraps around sparse
This looks to be the same as [2].

I suspect these might be problems with pint, since I can't shake the feeling that the current "accidental" support of dask.array and sparse in pint is error-prone. Perhaps a thorough set of tests could catch if there is some conversion to ndarray occurring internally in pint with whatever operations xarray uses during construction.

@crusaderky
Copy link
Contributor Author

@jthielen no I don't have unit tests; I just did an extremely brief manual experimentation. I agree that proper automated test suites are in order.

@hgrecco
Copy link
Owner

hgrecco commented Sep 12, 2019

Thanks for such detailed discussion. This is really useful. I would like to suggest 3 organization lines:

  1. Some of this and Insuring interoperability of pint with other (non-unit) array-like types #845 should be in the documentation in a new section. It would be very valuable for the community
  2. Inspired by some work with Rust, some time ago I created this repo https://github.com/hgrecco/pint_downstream_tester as a way to test other packages that depend on pint and might be broken when pint changes. I think is worth enriching with other projects, not only because it helps pint not to break other projects upon change but it also helps finding bugs in real world applications.
  3. We need a decision about Split quantity (No __array_function__) #875 and Split Quantity into scalar and sequence classes #764 to move pint forward.

@shoyer
Copy link

shoyer commented Sep 12, 2019

Perhaps it would be helpful to test things with a custom dask scheduler, to see what the culprit operation is?

e.g., based on https://stackoverflow.com/questions/53289286/determine-how-many-times-dask-computed-something:

def my_scheduler(dsk, keys, **kwargs):
    raise RuntimeError('should not compute!')

with dask.config.set(scheduler=my_scheduler):
    ...

@shoyer
Copy link

shoyer commented Sep 12, 2019

Or actually, I guess we need something wrapping pint's Quantity. I guess you could experiment by raising an error inside pint's __array__ method.

@jthielen
Copy link
Contributor

jthielen commented Sep 12, 2019

Or actually, I guess we need something wrapping pint's Quantity. I guess you could experiment by raising an error inside pint's __array__ method.

Following this lead, I checked quick again and pint doesn't have an explicit __array__ method right now, rather, it relies on attributes/methods starting with __array_ being caught by __getattr__. I would guess that this is where the conversion to ndarray is occurring by way of _to_magnitude:

https://github.com/andrewgsavage/pint/blob/2407e97b21dac78b0e5fc03640f37a2966aa1af1/pint/quantity.py#L1578-L1601

To hack together a possible workaround, I added an explicit __array__ method that raises an error, and altered the conditional in _to_magnitude to not call np.asarray on the value when it has an __array_function__ attribute. Running the examples from [1], [2], and [3] with these changes I received the following output (which I believe is what was expected):

[1]

<Quantity([1 2], 'kilogram')>

[2]

<xarray.DataArray 'array-de932becc43e72c010bc91ffefe42af1' (dim_0: 2)>
<Quantity(dask.array<array, shape=(2,), dtype=int64, chunksize=(2,)>, 'kilogram')>
Dimensions without coordinates: dim_0

[3]

<xarray.DataArray (dim_0: 2)>
<Quantity(<COO: shape=(2,), dtype=int64, nnz=2, fill_value=0>, 'kilogram')>
Dimensions without coordinates: dim_0

Also, no error was raised from a call to __array__, so the coercion seems to have happened with some other __array_* attribute/method.

Overall, I think this points to _to_magnitude and its usage needing a reimplementation with NEP 18 in mind (to differentiate between forcing to ndarray and forcing to ndarray-like)?

bors bot added a commit that referenced this issue Dec 11, 2019
905: NEP-18 Compatibility r=hgrecco a=jthielen

Building off of the implementation of `__array_function__` in #764, this PR adds compatibility with NEP-18 in Pint (mostly in the sense of Quantity having `__array_function__` and being wrappable as a duck array; for Quantity wrapping other duck arrays, see #845). Many tests are added of NumPy functions being used with Pint Quantities by way of `__array_function__`.

Accompanying changes that were needed as a part of this implementation include:
- a complete refactor of `__array_ufunc__` and ufunc attribute fallbacks to work in parallel with `__array_function__`
- promoting `_eq` in `quantity` to `eq` in `compat`
- preliminary handling of array-like compatibility by defining upcast types and attempting to wrap and defer to all others (a follow-up PR, or set of PRs, will be needed to completely address #845 / #878)

Closes #126 
Closes #396 
Closes #424
Closes #547 
Closes #553
Closes #617
Closes #619
Closes #682 
Closes #700
Closes #764
Closes #790
Closes #821

Co-authored-by: Jon Thielen <github@jont.cc>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants