Shall we create a QuantityArray class? #1128

hgrecco · 2020-07-01T23:36:26Z

I have just released Pint-Pandas 0.1 So first, let me publicly thank @andrewgsavage, @znicholls and @sigvaldm for the awesome work you have done to code, discuss and push this package. Congratulations!

I have also released Pint 0.14 to support a few last minute changes in Pint-Pandas.

Pint-Pandas has exposed (one more time) an old discussion: Do we need a QuantityScalar and QuantityArray classes? This was something on the table at the beginning of Pint. The conclusion was to avoid it for simplicity and we found a very nice way to do it without sacrifizing functionality. From very early releases we were able to support a large number of NumPy functions without requiering numpy and to have a nice way to write array quantities ([1, 2, 3] * ureg.meter). We also made the choice to support but doe not require numpy. I still think this was the right call and we should try to stay on that path.

Pint-Pandas has exposed a some cases in which having a QuantityArray seems to be unavoidable to fully support Pandas functionality with Pint. The question is ... shall we do it? If yes, how to do it in a way which is transparent to the normal user?

PS.- Thanks again to the people behind Pint-Pandas.

The text was updated successfully, but these errors were encountered:

znicholls · 2020-07-02T00:05:04Z

I read through #875, #764 and #753 yesterday and found these super helpful background resources. Unfortunately they made things seem very complicated and I couldn't see an obvious way forward.

The big conflict I saw was that, on the one hand, numpy's separation of scalar and array classes is seen as a 'design mistake', whereas on the other hand, not having the separation is causing headaches with Pandas (particularly with how Pandas decides whether something is iterable or not).

At risk of poking the hornet's nest, perhaps the simplest solution is to enquire in Pandas about whether https://github.com/pandas-dev/pandas/blob/4743cc91719146167eefb43865413fc0b0aef1d4/pandas/_libs/lib.pyx#L992 could be updated so that it checks ndim of any classes which have such an attribute, rather than only checking ndim of numpy arrays? (Only checking if the object is actually a numpy array seems an odd choice to me which restricts the ability of Pandas to be extended...)

hgrecco · 2020-07-03T19:34:54Z

@znicholls Those discussions are indeed useful. I am pinging the people that were involved. @jthielen @crusaderky @shoyer @keewis

hgrecco · 2020-07-03T19:39:07Z

Please feel free to invite more. Anyone with contacts in the pandas community?

crusaderky · 2020-07-04T22:32:40Z

As discussed before, I am against splitting the class in two. I simply see no benefit.
Could you explain these cases with pandas that you believe can't fit in the current design?

znicholls · 2020-07-05T02:32:02Z

@andrewgsavage has done much more work on this than me, but maybe these thoughts are helpful nonetheless...

The big problem is with how pandas decides whether something is iterable or not. As @hgrecco shows here, pandas determines whether something is_list_like here (code copied below).

cdef inline bint c_is_list_like(object obj, bint allow_sets) except -1:
    return (
        isinstance(obj, abc.Iterable)
        # we do not count strings/unicode/bytes as list-like
        and not isinstance(obj, (str, bytes))
        # exclude zero-dimensional numpy arrays, effectively scalars
        and not (util.is_array(obj) and obj.ndim == 0)
        # exclude sets if allow_sets is False
        and not (allow_sets is False and isinstance(obj, abc.Set))
    )

At the moment, all Quantity instances have an __iter__ method. It is designed to raise a TypeError if the quantity isn't actually iterable (code here).

The problem is that the __iter__ method is defined, regardless of whether the Quantity instance is actually iterable or not. This is a problem because it means that, in all cases, isinstance(quantity_instance, abc.Iterable) returns True (even if quantity_instance isn't actually iterable and would return a TypeError as soon as one tried to iterate with it).

A numpy array suffers from the same issue which is where the not (util.is_array(obj) and obj.ndim == 0) in Panda's is_list_like is important, because it excludes scalars. The problem is that util.is_array(obj) returns False if obj is anything other than np.ndarray. As a result, we can't just add an ndim attribute to Quantity because the obj.ndim == 0 will never be evaluated anyway.

So, the options seem to be:

Split Quantity into scalar and array so that scalars don't have an __iter__ method and Pandas correctly recognises them as not being list-like.
See if the Pandas API can be updated in some way so that zero-dimensional, non-numpy arrays are also correctly recognised as not being iterable.

I'm happy to open an issues in Pandas if people think that's a good option (it certainly looks the easier path to me).

shoyer · 2020-07-05T02:41:49Z

I would definitely suggest bringing up this issue in the pandas GitHub issues.

…

On Sat, Jul 4, 2020 at 7:32 PM Zeb Nicholls ***@***.***> wrote: @andrewgsavage <https://github.com/andrewgsavage> has done much more work on this than me, but maybe these thoughts are helpful nonetheless... The big problem is with how pandas decides whether something is iterable or not. As @hgrecco <https://github.com/hgrecco> shows here <hgrecco/pint-pandas#33 (comment)>, pandas determines whether something is_list_like here <https://github.com/pandas-dev/pandas/blob/4743cc91719146167eefb43865413fc0b0aef1d4/pandas/_libs/lib.pyx#L992> (code copied below). cdef inline bint c_is_list_like(object obj, bint allow_sets) except -1: return ( isinstance(obj, abc.Iterable) # we do not count strings/unicode/bytes as list-like and not isinstance(obj, (str, bytes)) # exclude zero-dimensional numpy arrays, effectively scalars and not (util.is_array(obj) and obj.ndim == 0) # exclude sets if allow_sets is False and not (allow_sets is False and isinstance(obj, abc.Set)) ) At the moment, all Quantity instances have an __iter__ method. It is designed to raise a TypeError if the quantity isn't actually iterable (code here <https://github.com/hgrecco/pint/blob/4ab68ba0a00f96ed68fdf084233b61d68127c461/pint/quantity.py#L242> ). The problem is that the __iter__ method is defined, regardless of whether the Quantity instance is actually iterable or not. This is a problem because it means that, in all cases, isinstance(quantity_instance, abc.Iterable) returns True (even if quantity_instance isn't actually iterable and would return a TypeError as soon as one tried to iterate with it). A numpy array suffers from the same issue which is where the not (util.is_array(obj) and obj.ndim == 0) in Panda's is_list_like is important, because it excludes scalars. The problem is that util.is_array(obj) returns False if obj is anything other than np.ndarray. As a result, we can't just add an ndim attribute to Quantity because the obj.ndim == 0 will never be evaluated anyway. So, the options seem to be: 1. Split Quantity into scalar and array so that scalars don't have an __iter__ method and Pandas correctly recognises them as not being list-like. 2. See if the Pandas API can be updated in some way so that zero-dimensional, non-numpy arrays are also correctly recognised as not being iterable. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1128 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAJJFVSAGA32KFVLRQTVPFLRZ7Q23ANCNFSM4OOG6UDQ> .

znicholls · 2020-07-05T03:26:27Z

Thanks @shoyer. Whilst others haven't commented yet, for reference I think the most relevant Pandas issue is pandas-dev/pandas#24702 but it's not the same as the one we're having here

znicholls · 2020-07-05T05:32:13Z

I had a quick play to see what it would take to fix this. It looks like a simple change, but more experienced eyes would be very welcome! See draft PR here pandas-dev/pandas#35127

hgrecco · 2020-07-05T12:34:43Z

I just want to add that magic methods are always looked up in the class (not in the instance) and therefore it is not possible to add it or remove it depending on the type of _magnitude

andrewgsavage · 2020-07-05T16:33:59Z

I'm happy to open an issues in Pandas if people think that's a good option (it certainly looks the easier path to me).

Ya that seems the easiest path and is preferable to splitting classes up, go for it!

znicholls · 2020-07-05T23:15:31Z

Ok done in pandas-dev/pandas#35131 and pandas-dev/pandas#35127. I'm not sure if we need to ping anyone to get those moving?

znicholls · 2020-07-06T00:24:43Z

Looks like they're moving. @andrewgsavage we got a question about places where is_list_like is used if you have the time/energy to answer (I'm not familiar enough with the code base to easily know where they are).

crusaderky · 2020-07-06T14:26:15Z

We should remain aligned to the decision of the core numpy developers where they chose to use the same class for scalar and vectorized ndarrays.

Pandas should be changed to use numpy.iterable:

>>> class C:
...     def __iter__(self):
...         # Match behaviour of numpy.ndarray.__iter__
...         raise TypeError("iteration over scalar object")

>>> isinstance(C(), abc.Iterable)
True
>>> numpy.iterable(C())
False

znicholls · 2020-07-07T01:12:50Z

Having fiddled around with pandas-dev/pandas#35127 a bit more, and tried using that branch with pint-pandas, it's clear that we'd need to update the behaviour of pandas' is_list_like, is_scalar and item_from_zerodim to get things working. I'm not really sure how things are meant to all behave, but as I see it the options are:

update pandas' is_list_like, is_scalar and item_from_zerodim so that pint quantities would give the right behaviour (pros: don't have to change pint, cons: maybe the required updates would hurt pandas' performance too much given how much they rely on the C implementations of numpy? I don't know enough about these functions to comment on whether the changes would actually hurt performance or not)
identify the cases in pandas where less-strict versions of is_list_like, is_scalar and item_from_zerodim would be required for pint-pandas to pass all the tests (pros: don't have to change pint, shouldn't hurt pandas performance too much, cons: requires some effort/dedication to make it happen if pandas devs aren't keen)
update pint or pint-pandas so the quantities used in our pandas interface are sub-classes of numpy arrays (pros: don't have to update pandas, cons: I have no idea if this is even possible or desirable)

burnpanck · 2021-11-25T19:18:15Z

While there seem to be some arguments against a split Quantity, there might be one in favour of it, related to __array__ and it's warnings. IMHO, for a units package, stripping units inside __array__ is the worst possible choice, even when it's emitting a warning. The preferred behaviour however differs between the proposed QuantityScalar and QuantityArray: The former should not have __array__ at all, the latter should raise.

Why shouldn't we have __array__ on QuantityScalar? Because it prevents the reliable creation of dtype=object arrays of scalar quantities. Consider the following 4 cases, assuming a scalar quantity q=QuantityScalar(1.):

np.asarray(q): We expect array(QuantityScalar(1.), dtype=object). If __array__ raises, then the call fails, if __array__ succeeds, then we get array(1., dtype=float).
np.asarray([q]): We expect array([QuantityScalar(1.)], dtype=object). If __array__ raises, then the call fails, if __array__ succeeds, then numpy proceeds to call float(q), leading to array([1.], dtype=float) if float(q) succeeds too.
np.asarray(q,dtype=object): We expect array(QuantityScalar(1.), dtype=object). If __array__ raises, then the call fails, if __array__ succeeds, then we get array(1., dtype=object).
np.asarray([q],dtype=object): We expect array([QuantityScalar(1.)], dtype=object). If __array__ raises, then the call fails, if __array__ succeeds, then we do actually get the expected output!

Note in particular that a failure in __array__ causes np.asarray to fail, even though the result is in fact discarded (except for case 1) - I presume that numpy needs to checks the dimensionality of the result to determine the shape of the resulting array. Because of this behaviour, turning UnitStrippedWarning into errors will prevent all code from working that relies on object arrays of quantity scalars - something that is exceedingly common within (pint-)pandas.

To clarify, if QuantityScalar has no __array__ we get the expected output in all cases. For QuantityArray, raising is probably the correct behaviour.

Here is a bit of code to verify those findings above:

class MockArray:
    def __init__(self, values):
        self._values = values

    def __repr__(self):
        return f"{type(self).__name__}({self._values})"
    
class MockArrayStrip(MockArray):
    def __array__(self, t=None):
        return np.asarray(self._values)

class MockArrayRaise(MockArray):
    def __array__(self, t=None):
        raise TypeError("forbidden")


for mock in [
    MockArray(1.),
    MockArrayStrip(1.),
    MockArrayRaise(1.),
]:
    for orig in [mock, [mock]]:
        for dt in [None, "float", "object"]:
            try:
                out = np.asarray(orig,dtype=dt)
            except Exception as ex:
                out = str(ex)
            print(f"np.asarray({orig!r:21},dtype={dt!r:8}) -> {out!r}")

znicholls mentioned this issue Jul 5, 2020

ENH: Identify numpy-like zero-dimensional arrays as non-iterable pandas-dev/pandas#35131

Closed

crusaderky mentioned this issue Jul 9, 2020

Support for duck Dask Arrays pydata/xarray#4208

Closed

jthielen mentioned this issue Sep 18, 2020

Fix scalar dispatching, fixes #6631 and #6611 dask/dask#6632

Merged

4 tasks

hgrecco mentioned this issue Mar 23, 2021

How to determine if a quantity is a scalar or a list #1271

Closed

edmundsj mentioned this issue Mar 23, 2021

np.testing.assert_equal does not work with quantity lists #1272

Closed

jthielen mentioned this issue Jul 3, 2021

pint scalars have no shape, unlike numpy scalars #1342

Open

dcoukos mentioned this issue Aug 24, 2021

ENH: Change implementation of pandas.api.types.is_list_like to check for iterability in more pythonic way pandas-dev/pandas#43196

Closed

burnpanck mentioned this issue Dec 21, 2021

m_as with a non-unit quantity #1415

Open

burnpanck mentioned this issue Mar 15, 2023

Quantity class is instance of Iterable, but is actually not iterable #1719

Closed

burnpanck mentioned this issue May 10, 2023

importing pint_xarray breaks Quantity.__round__ xarray-contrib/pint-xarray#189

Closed

keewis mentioned this issue May 10, 2023

avoid setting force_ndarray_like = True xarray-contrib/pint-xarray#216

Open

This was referenced May 12, 2023

DataFrame reduction methods broken hgrecco/pint-pandas#174

Closed

Remove __array__ interface #1778

Closed

burnpanck mentioned this issue Apr 10, 2024

Adopt NEP-37 and NEP-47 (array module and Python Array API) #1592

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shall we create a QuantityArray class? #1128

Shall we create a QuantityArray class? #1128

hgrecco commented Jul 1, 2020

znicholls commented Jul 2, 2020 •

edited

Loading

hgrecco commented Jul 3, 2020

hgrecco commented Jul 3, 2020

crusaderky commented Jul 4, 2020

znicholls commented Jul 5, 2020 •

edited

Loading

shoyer commented Jul 5, 2020 via email

znicholls commented Jul 5, 2020

znicholls commented Jul 5, 2020

hgrecco commented Jul 5, 2020

andrewgsavage commented Jul 5, 2020

znicholls commented Jul 5, 2020

znicholls commented Jul 6, 2020

crusaderky commented Jul 6, 2020

znicholls commented Jul 7, 2020 •

edited

Loading

burnpanck commented Nov 25, 2021 •

edited

Loading

Shall we create a QuantityArray class? #1128

Shall we create a QuantityArray class? #1128

Comments

hgrecco commented Jul 1, 2020

znicholls commented Jul 2, 2020 • edited Loading

hgrecco commented Jul 3, 2020

hgrecco commented Jul 3, 2020

crusaderky commented Jul 4, 2020

znicholls commented Jul 5, 2020 • edited Loading

shoyer commented Jul 5, 2020 via email

znicholls commented Jul 5, 2020

znicholls commented Jul 5, 2020

hgrecco commented Jul 5, 2020

andrewgsavage commented Jul 5, 2020

znicholls commented Jul 5, 2020

znicholls commented Jul 6, 2020

crusaderky commented Jul 6, 2020

znicholls commented Jul 7, 2020 • edited Loading

burnpanck commented Nov 25, 2021 • edited Loading

znicholls commented Jul 2, 2020 •

edited

Loading

znicholls commented Jul 5, 2020 •

edited

Loading

znicholls commented Jul 7, 2020 •

edited

Loading

burnpanck commented Nov 25, 2021 •

edited

Loading