NEP18 trouble when pint is being wrapped #878

crusaderky · 2019-09-12T09:05:24Z

FYI @shoyer @hameerabbasi @keewis
numpy 1.17, xarray/dask/sparse/pint git tip

NEP18 doesn't seem to work correctly in several cases.
I'm still in the process of investigating what causes the issue(s).

Works:

pint wraps around sparse
pint wraps around dask.array
xarray wraps around pint
xarray wraps around sparse
dask.array wraps around sparse
xarray wraps around dask.array which wraps around sparse

Broken:

[1] dask.array wraps around pint, and there are 2+ chunks
[2] xarray wraps around pint which wraps around dask
[3] xarray wraps around pint which wraps around sparse

[1] dask.array wraps around pint, and there are 2+ chunks
At first sight, the legitimacy of this use case is arguable, as it feels much cleaner to always have pint wrapping around dask.array (and it saves a few of headaches when dask.distributed and custom UnitRegistries get involved, too, as you never need to pickle your Quantities).

However, the problems of pint->dask and the benefits of dask->pint become clear when one wraps a pint+dask object in xarray.
There, with pint around dask, one would need to write special case handling for pretty much every piece of xarray logic that today has special case handling for dask - which is, a lot, whereas with dask around pint I would expect everything to work out of the box as long as NEP18 compliance is respected by all libraries.
@shoyer I'd like to hear your opinion on this...

>>> import dask.array as da
>>> import pint
>>> ureg = pint.UnitRegistry()
>>> q = ureg.Quantity([1, 2], "kg")
>>> da.from_array(q).compute() # single chunk works
<Quantity([1 2], 'kilogram')>
>> da.from_array(q, chunks=1).compute() . # With 2+ chunks, something's calling Quantity.__array__
array([1, 2])

[2] xarray wraps around pint which wraps around dask
Following the reasoning of [1], this should happen only when a user manually builds the data, as opposed to calling xarray.Dataset.chunk() - which should be rare-ish. I'm tempted to write a single piece of logic in xarray.Variable.data.setter that detects the special pint->dask case and turns it around to dask->pint.

>>> import dask.array as da
>>> import pint
>>> ureg = pint.UnitRegistry()
>>> q = ureg.Quantity(da.from_array([1, 2]), "kg")
>>> q
<Quantity(dask.array<array, shape=(2,), dtype=int64, chunksize=(2,), chunktype=numpy.ndarray>, 'kilogram')>
>>> xarray.DataArray(q) # Something is calling da.Array.__array__, which computes it
<xarray.DataArray 'array-de932becc43e72c010bc91ffefe42af1' (dim_0: 2)>
<Quantity([1 2], 'kilogram')>
Dimensions without coordinates: dim_0

[3] xarray wraps around pint which wraps around sparse
This looks to be the same as [2].

>>> import numpy as np
>>> import pint
>>> import sparse
>>> ureg = pint.UnitRegistry()
>>> q = ureg.Quantity(sparse.COO(np.array([1, 2])), "kg")
>>> q
<Quantity(<COO: shape=(2,), dtype=int64, nnz=2, fill_value=0>, 'kilogram')>
>>> xarray.DataArray(q)
RuntimeError: Cannot convert a sparse array to dense automatically. To manually densify, use the todense method.

The text was updated successfully, but these errors were encountered:

hameerabbasi · 2019-09-12T09:10:24Z

Most likely you're missing the asarray=False flag in Dask and something similar in xarray.

crusaderky · 2019-09-12T09:20:05Z

@hameerabbasi Nope, I already tried that.

crusaderky · 2019-09-12T09:20:57Z

(and even then, the default asarray=None on paper is supposed to work)

jthielen · 2019-09-12T10:46:02Z

Works:

pint wraps around sparse

pint wraps around dask.array

I'm glad to hear that at least pint wrapping sparse and pint wrapping dask.array work, since there are no tests for this (one of the main reasons I created #845). As a part of your investigations, do you happen to have tests for these already written? I was planning on doing so myself if I got the chance after the __array_function__ implementation is settled, but mind as well save the effort if you already have some ready!

[1] dask.array wraps around pint, and there are 2+ chunks
At first sight, the legitimacy of this use case is arguable, as it feels much cleaner to always have pint wrapping around dask.array (and it saves a few of headaches when dask.distributed and custom UnitRegistries get involved, too, as you never need to pickle your Quantities).

However, the problems of pint->dask and the benefits of dask->pint become clear when one wraps a pint+dask object in xarray.
There, with pint around dask, one would need to write special case handling for pretty much every piece of xarray logic that today has special case handling for dask - which is, a lot, whereas with dask around pint I would expect everything to work out of the box as long as NEP18 compliance is respected by all libraries.

As I've only recently gotten into the details of NEP 18 and I'm by far less experienced with all the libraries' internals, I will definitely defer to the consensus of others on this (and would like to hear @shoyer's thoughts). However, I think allowing dask.array wrapping pint and pint wrapping dask.array is a bad idea (xref #845 (comment)). This will make the type casting graph cyclic, which makes the type casting hierarchy ill-defined and the expected result of mixed-type operations ambiguous (xref pydata/xarray#525 (comment) and following comments for some discussion related to this). This would create big problems with non-commutativity and would complicate operations with scalars, among other issues.

Based on past conversations I've seen (primarily in pydata/xarray#525), pint->dask seems to be the preferred order to allow unit math to occur at "graph construction time" rather than "runtime" (borrowing @shoyer's terminology from pydata/xarray#525 (comment)). I'd argue for this order as well, since it is almost a requirement for exploratory analysis of large datasets using unit-aware calculations (I'd want to keep track of units through intermediate steps of calculations, rather than just in the final computation).

With this in mind, I think the larger task at hand is cleaning up xarray internals to allow xarray > pint > dask.array to work as expected, since as you pointed out this is currently a problem area. So, instead of fixing [2] by flipping around to [1], I would think [2] should be the target use case, and perhaps [1] should be flipped around or prohibited?

[2] xarray wraps around pint which wraps around dask
Following the reasoning of [1], this should happen only when a user manually builds the data, as opposed to calling xarray.Dataset.chunk() - which should be rare-ish. I'm tempted to write a single piece of logic in xarray.Variable.data.setter that detects the special pint->dask case and turns it around to dask->pint.

[3] xarray wraps around pint which wraps around sparse
This looks to be the same as [2].

I suspect these might be problems with pint, since I can't shake the feeling that the current "accidental" support of dask.array and sparse in pint is error-prone. Perhaps a thorough set of tests could catch if there is some conversion to ndarray occurring internally in pint with whatever operations xarray uses during construction.

crusaderky · 2019-09-12T11:14:25Z

@jthielen no I don't have unit tests; I just did an extremely brief manual experimentation. I agree that proper automated test suites are in order.

hgrecco · 2019-09-12T12:11:25Z

Thanks for such detailed discussion. This is really useful. I would like to suggest 3 organization lines:

Some of this and Insuring interoperability of pint with other (non-unit) array-like types #845 should be in the documentation in a new section. It would be very valuable for the community
Inspired by some work with Rust, some time ago I created this repo https://github.com/hgrecco/pint_downstream_tester as a way to test other packages that depend on pint and might be broken when pint changes. I think is worth enriching with other projects, not only because it helps pint not to break other projects upon change but it also helps finding bugs in real world applications.
We need a decision about Split quantity (No __array_function__) #875 and Split Quantity into scalar and sequence classes #764 to move pint forward.

shoyer · 2019-09-12T19:09:12Z

Perhaps it would be helpful to test things with a custom dask scheduler, to see what the culprit operation is?

e.g., based on https://stackoverflow.com/questions/53289286/determine-how-many-times-dask-computed-something:

def my_scheduler(dsk, keys, **kwargs):
    raise RuntimeError('should not compute!')

with dask.config.set(scheduler=my_scheduler):
    ...

shoyer · 2019-09-12T19:11:16Z

Or actually, I guess we need something wrapping pint's Quantity. I guess you could experiment by raising an error inside pint's __array__ method.

jthielen · 2019-09-12T22:21:53Z

Or actually, I guess we need something wrapping pint's Quantity. I guess you could experiment by raising an error inside pint's __array__ method.

Following this lead, I checked quick again and pint doesn't have an explicit __array__ method right now, rather, it relies on attributes/methods starting with __array_ being caught by __getattr__. I would guess that this is where the conversion to ndarray is occurring by way of _to_magnitude:

https://github.com/andrewgsavage/pint/blob/2407e97b21dac78b0e5fc03640f37a2966aa1af1/pint/quantity.py#L1578-L1601

To hack together a possible workaround, I added an explicit __array__ method that raises an error, and altered the conditional in _to_magnitude to not call np.asarray on the value when it has an __array_function__ attribute. Running the examples from [1], [2], and [3] with these changes I received the following output (which I believe is what was expected):

[1]

<Quantity([1 2], 'kilogram')>

[2]

<xarray.DataArray 'array-de932becc43e72c010bc91ffefe42af1' (dim_0: 2)>
<Quantity(dask.array<array, shape=(2,), dtype=int64, chunksize=(2,)>, 'kilogram')>
Dimensions without coordinates: dim_0

[3]

<xarray.DataArray (dim_0: 2)>
<Quantity(<COO: shape=(2,), dtype=int64, nnz=2, fill_value=0>, 'kilogram')>
Dimensions without coordinates: dim_0

Also, no error was raised from a call to __array__, so the coercion seems to have happened with some other __array_* attribute/method.

Overall, I think this points to _to_magnitude and its usage needing a reimplementation with NEP 18 in mind (to differentiate between forcing to ndarray and forcing to ndarray-like)?

905: NEP-18 Compatibility r=hgrecco a=jthielen Building off of the implementation of `__array_function__` in #764, this PR adds compatibility with NEP-18 in Pint (mostly in the sense of Quantity having `__array_function__` and being wrappable as a duck array; for Quantity wrapping other duck arrays, see #845). Many tests are added of NumPy functions being used with Pint Quantities by way of `__array_function__`. Accompanying changes that were needed as a part of this implementation include: - a complete refactor of `__array_ufunc__` and ufunc attribute fallbacks to work in parallel with `__array_function__` - promoting `_eq` in `quantity` to `eq` in `compat` - preliminary handling of array-like compatibility by defining upcast types and attempting to wrap and defer to all others (a follow-up PR, or set of PRs, will be needed to completely address #845 / #878) Closes #126 Closes #396 Closes #424 Closes #547 Closes #553 Closes #617 Closes #619 Closes #682 Closes #700 Closes #764 Closes #790 Closes #821 Co-authored-by: Jon Thielen <github@jont.cc>

jthielen mentioned this issue Sep 12, 2019

Add metpy hgrecco/pint_downstream_tester#1

Open

This was referenced Sep 15, 2019

support for units pydata/xarray#525

Open

Integration with Dask (add tests; implement the Dask collection interface on Quantity) #883

Closed

tests for arrays with units pydata/xarray#3238

Merged

This was referenced Oct 17, 2019

warnings.filterwarnings changes the behavior of np.asarray numpy/numpy#14735

Closed

setting warnings to errors changes the behaviour of np.asarray on quantities #892

Closed

jthielen mentioned this issue Oct 30, 2019

Split Quantity into scalar and sequence classes #764

Closed

jthielen mentioned this issue Nov 29, 2019

NEP-18 Compatibility #905

Merged

jthielen mentioned this issue Dec 10, 2019

Proper handling of __array_*__ attributes/methods #924

Closed

jthielen mentioned this issue Dec 11, 2019

Insuring interoperability of pint with other (non-unit) array-like types #845

Closed

jthielen mentioned this issue Dec 23, 2019

NEP-18 and python scalars #950

Closed

jthielen mentioned this issue Jul 9, 2020

What should define a "duck Dask Array"? dask/dask#6385

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NEP18 trouble when pint is being wrapped #878

NEP18 trouble when pint is being wrapped #878

crusaderky commented Sep 12, 2019

hameerabbasi commented Sep 12, 2019

crusaderky commented Sep 12, 2019

crusaderky commented Sep 12, 2019 •

edited

Loading

jthielen commented Sep 12, 2019 •

edited

Loading

crusaderky commented Sep 12, 2019

hgrecco commented Sep 12, 2019

shoyer commented Sep 12, 2019

shoyer commented Sep 12, 2019

jthielen commented Sep 12, 2019 •

edited

Loading

NEP18 trouble when pint is being wrapped #878

NEP18 trouble when pint is being wrapped #878

Comments

crusaderky commented Sep 12, 2019

hameerabbasi commented Sep 12, 2019

crusaderky commented Sep 12, 2019

crusaderky commented Sep 12, 2019 • edited Loading

jthielen commented Sep 12, 2019 • edited Loading

crusaderky commented Sep 12, 2019

hgrecco commented Sep 12, 2019

shoyer commented Sep 12, 2019

shoyer commented Sep 12, 2019

jthielen commented Sep 12, 2019 • edited Loading

crusaderky commented Sep 12, 2019 •

edited

Loading

jthielen commented Sep 12, 2019 •

edited

Loading

jthielen commented Sep 12, 2019 •

edited

Loading