Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmping fails when attrib is a numpy array #435

Closed
meereeum opened this issue Aug 26, 2018 · 24 comments
Closed

cmping fails when attrib is a numpy array #435

meereeum opened this issue Aug 26, 2018 · 24 comments
Labels

Comments

@meereeum
Copy link

The generated cmp methods fail if the attrib is a numpy array (of size >1) with

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

this is because:

# equivalent to how __eq__compares fields:
(np.array([47,47],) == (np.array([47,47],) # ValueError
# vs:
(np.array([47,47]) == np.array([47,47])).all() # True

I realize that this can be switched off with cmp=False, but often comparing these attribs is useful!

reproducible example:

@attr.s(auto_attribs=True)
class A:
    arr: np.ndarray

foo = A(np.array([47,47]))
bar = A(np.array([47,47]))

foo == bar # ValueError

the simplest fix might be to trust user annotations about what is/isn't a numpy array and check those separately. or, maybe better, would be awesome to supply a custom cmp function, e.g.

@attr.s
class A:
    arr: np.ndarray = attr.ib(cmpf=lambda f,x,y: f(x,y).all())

where f is the standard cmp function (eq, lt, etc) and x,y are the items being compared.

@hynek
Copy link
Member

hynek commented Aug 29, 2018

Since this is the second time numpy equality brought up (ref #409), I think it’s best to keep it open. :)

As a workaround, you can totally write your own __cmp__ by using @attr.s(cmp=False) but then you’ll have to implement all of them.

We’ll have to come up with a nicer way!

@hynek hynek added the Feature label Aug 29, 2018
@jriddy
Copy link

jriddy commented Nov 18, 2018

I've found this wildly frustrating myself, but I think this is a problem with numpy breaking normal __eq__ semantics, and attrs actually shouldn't try and solve this. Normally the signature of __eq__ should match object.__eq__'s signature but numpy goes its own way:

# object
def __eq__(self, other: object) -> bool: ...

# numpy
def __eq__(self, other: np.ndarray) -> np.ndarray: ...

This is so you can get the equality product of two arrays easily. And every data scientist I talk to seems to think this is the most natural thing in the world. I deeply disagree with this design decision, but I think the reasoning behind this is clear: in numpy, everything operates on arrays, normal python language semantics be damned!

So in order to solve this issue in attrs, you'd need to have a flexible way of comparing equality on an item-per-item basis...except we already have a good way to do this, and it's the __eq__ methods, which numpy abuses to satisfy its matrix manipulation mania. This allows attrs to leverage tuple.__eq__ to dynamically write obvious code for equality. Making this overridable with some kindof alternative cmp_func attribute or some other solution like that would be overkill and would break the simplicity of the current __eq__ implementation for little gain.

But there are realistically a few ways of handling this:

  1. wrap your ndarray, then you can use it normally in other attrs classes:
@attr.s(cmp=False, slots=True)
class Arr:
    arr = attr.ib(type=np.ndarray)
    def __eq__(self, other: object) -> bool:
        if type(self) is not type(other):
            return NotImplemented
        return (self.arr == self.other).all()

@attr.s
class Bitmap:
    raw = attr.ib(type=Arr)
    colorspace = attr.ib(type=str)

Bitmap(Arr(np.array([1])), 'srgb') == Bitmap(Arr(np.array([1])), 'srgb')
  1. Use fields + metadata to generate a custom eq method
import operator

import attr
import numpy as np

@attr.s(cmp=False)
class CustomEq:
    label = attr.ib(type=str)
    tags = attr.ib(type=set)
    data = attr.ib(type=np.ndarray, metadata={'eq': np.array_equal})

    def __eq__(self, other):
        if type(self) is not type(other):
            return NotImplemented
        eq_spec = ((f.name, f.metadata.get('eq', operator.eq))
                   for f in attr.fields(type(self)))
        return all(eq(getattr(self, name), getattr(other, name))
                   for name, eq in eq_spec)

Example 2 is more like what I think attrs would have to do (or provide the option of doing) to make these things work in a more general case (with maybe a cmp_hooks=True argument to attr.s), but I think getting that actually right is more tricky than that short example I showed. I don't think I've considered all the angles on that, and it's a lot of work for a specific edge case.

That said, calling numpy an edge case is a bit silly at this point, since the science wing of python is an enormous and important part of the community. But that's why I think option 1 makes the most sense. There have to be meaningful ways to drag their types into the normal python system, and light wrappers make a lot more sense than anything else I've come up with. In fact, you get the added benefit of a place to provide semantic information about things like numpy arrays, which are normally paraded around as naked data structures that reveal nothing of their intent.

I ran into many of these issues at my job and in writing+using zerial (sorry no docs yet), where I'm trying an approach to serialization somewhere between related and cattrs. And I've found that numpy really often doesn't play well with the rest of python. But for an entire class of users, it is the very reason they use python and not something else. So we'll have to figure out a way to do deal with it, and I think wrapping numpy arrays might be the way.

@hynek
Copy link
Member

hynek commented Nov 24, 2018

That said, calling numpy an edge case is a bit silly at this point, since the science wing of python is an enormous and important part of the community.

I’m painfully aware of that; I meant it’s an edge case in the sense of not being Pythonic. :|

You’ve kinda nailed how I’d hope people to approach this problem: composition and metadata. The resulting code is far too inefficient for core and metadata is mostly for endusers though, so how should one approach that?

Some thoughts:

  1. The information whether an attribute needs special treatment is known beforehand, so it can be taken into consideration while generating code.
  2. I guess we'd just add another block of special comparisons before doing the tuple comparison? That way nobody loses anything.
  3. So we mostly need to come up with a nice API to define equality-hooks, correct?
  4. What about lt/le etc?

@jriddy
Copy link

jriddy commented Nov 24, 2018

Question: what is so much more efficient about the (self.a, self.b) == (other.a, other.b) approach? I assume you came the conclusion that code generation was necessary because of some benchmark that showed it. Is doing dynamic attribute lookups with getattr what slows things down? I'm curious.

  1. I guess we'd just add another block of special comparisons before doing the tuple comparison? That way nobody loses anything.

I could come up with a PR doing that this weekend; it shouldn't be too hard. This use case affects my work a lot, so I could get some real-world feedback with numpy + attrs users pretty quickly.

  1. So we mostly need to come up with a nice API to define equality-hooks, correct?
  2. What about lt/le etc?

Item 4 complicates 3. If we never had to worry about doing this for ordering, then I would say make eq_fn an argument to attr.ib() and be done with it. Ordering gets more complicated because Python itself makes no assertions about the relationship between the ordering methods:

There are no other implied relationships among the comparison operators, for example, the truth of (x<y or x==y) does not imply x<=y.

I know in the numpy instance, although I want normal equality semantics, I don't really want orderability. The fact that numpy arrays can be stored row- or column-major means that we can't even use index-based priority for comparisons like we can with regular python lists, and even if we could, it doesn't make sense semantically. So in this case, we'd want to opt in to eq/ne but opt out of lt/gt/le/ge. I'll look and get back to you with a PR 😉

@jriddy
Copy link

jriddy commented Nov 24, 2018

Hmmm, after looking at this, I think the best place to do it would be to overload the type of cmp attribute of Attribute from bool to a Union[bool, CmpSpec], where CmpSpec is a class the describes all this customization. The cmp attribute itself would have a converter to provide some kind of clean API for the end user, so that CmpSpec may not even have to be exposed publicly.

CmpSpec would have to be flexible enough to not only specify alternate comparison functions, but also indicate which comparisons this attribute takes part in, so, say numpy arrays could take part in equality but be neutral for lt/gt, so that they don't affect sort operations. Or float fields could implement a tolerance-based check for equality, but use standard strict lt/gt.

To keep roughly the same efficiency, much of this could be generated in attrs._make._make_cmp except for the most complex cases.

Overloading cmp seems preferable to me to adding another field because 1) cmp is used in very few places in attrs core (only in _make.py it seems), 2) the interactions of cmp and whatever other field would be complex (more complex than cmp and hash, for instance), and 3) it's probably easier to keep a clean, backwards-compatible API this way:

arr_compare = attr.CmpSpec(eq=np.array_equal, ord=False)

@attr.s
class DataStuff:
    name: str = attr.ib()
    arr: np.ndarray = attr.ib(cmp=arr_compare)

I'm working on it.

@hynek
Copy link
Member

hynek commented Dec 2, 2018

I think you're on to something with CmpSpec! I had this bug on my mind to switch to enums but thinking about this particular problem, I basically. came to the same conclusion: it would be nice to pass even more info.

This would allow us to finally allow eq-only comparisons which people have been asking for a while.

I wonder how to make this fit into the bigger picture (hence my delay in response, sorry).

We've got ongoing work about customization of Reprs but we're aiming at Union[bool,Callable]. It would be good to take a step back and ask ourselves:

  • is *Spec a good name? I literally have no idea.
  • Does ReprSpec make any sense?
  • Does InitSpec make any sense?
  • Does HashSpec make any sense?

That's our current feature switches so we should strive for consistency.

@jriddy
Copy link

jriddy commented Dec 2, 2018

Yeah, I'm not sure exactly how it would work, but it would make it more . And the CmpSpec class or whatever it's called wouldn't even have to be an attrs-class itself, just duck-type compatible with the interface.

I don't know if Spec is a good name, but at least it's generalizable. The other name I came up with was Comparator 🙁, which I don't particularly like. I think you could use either an add-on word fragment like "Spec" or you could just use normal english agent noun formation, with mixed results:

  • Comparer, Comparator
  • Representer, Reprator
  • Initializer, Initiator
  • Hasher, Hashator (ok that one's silly)

On names, I'd talk it over with other people

But I do think it makes sense to use specification objects rather than small unions because it gets harder and harder to manage the values as you have more variants and use them in more places. For example, with the repr, it would make sense to accept a Union[bool, Callable, ReprSpec] and then just convert everything uniformly ReprSpec under the hood:

def default_repr_func(_inst, _attrib, value):
    return repr(value)

@attr.s
class ReprSpec:
    include_this = attr.ib(type=bool)
    repr_func = attr.ib(type=Callable[[object, object], str], default=default_repr_func)

def convert_to_repr_spec(val):
    if isinstance(val, ReprSpec):
        return val
    elif isinstance(val, bool):
        return ReprSpec(include_this=val)
    elif callable(val):
        return ReprSpec(include_this=True, repr_func=val)

But that would only be for making things uniform and allowing for easier future expansion. It almost feels like overkill in the case of reprs. The reason we need this for cmp is because "cmp" combines 2 or 6 concepts, depending on whether we only separate eq/ord or go into the entire rich comparison API.

In any case, much of this wouldn't need to be exposed to the average user, and could be wrapped around for common use cases in some kind of advanced API.

@jriddy
Copy link

jriddy commented Dec 2, 2018

At any rate, I would like to help out with this comparison implementation, but it may be a week or two before I'm able to follow up on that with an actual PR. Do you guys have a place where you're discussing these other expansions to repr-ability and so forth?

@wsanchez
Copy link

wsanchez commented Dec 4, 2018

@jriddy: #212

@jamescasbon
Copy link

I would like a solution to this. I created an evil and naive monkeypatch to try it out:

https://gist.github.com/jamescasbon/b0e1f2113a28e523ff3326d7b93eda19

It does clarify one issue in my mind: array equality is probably going to need policies to be chosen. Sometimes np.array_equal would do but other times we might want all_close with a tolerance.

@jriddy did you get anywhere with a PR?

@jriddy
Copy link

jriddy commented Mar 31, 2019

No I've been busy. And in most cases this ends up being a less important use case for us. I've been getting by mostly by attribute or class level cmp=False for these things.

The place where it really matters to me is in testing, because we build serialization models around attr.s classes, so I have a whole suite of tests that assert that the serialized and de-serialized object matches the orginal object. This is important to me, but I can build test helpers and fixtures to deal with this use case.

Otherwise...it just hasn't come up the way I thought it would. My research guys...well it never occurs to them that it's even weird that numpy arrays behave this way. Even the ones that are deep into python development (I've got a former matplotlib core dev on my team) think this is totally normal. To me, the type of all the cmp functions is like __eq__(self: object, other: object) -> bool, and that is invariant. But not for them. It almost doesn't make sense to them to define a default aggregate equality predicate because "that's really never what you want to do."

If you can help me clarify the use cases, I think I can work out a solution that could fit into the current implementation. So what are the policies we want for an attribute's cmp property?

@hynek
Copy link
Member

hynek commented Apr 6, 2019

As a small update, I think the final idea is brewing in my head, after also flying it by @glyph.

As much as I dislike subclassing, I think it would make sense to refactor the method generation into protocols. Something along the lines of:

class CmpMaker(metaclass=ABCMeta):
    @abstractmethod
    def make_eq(self, cls, attributes):
        pass
    @abstractmethod
    def make_ne(self, cls, attributes):
        pass
    # ...

and each returns a method for cls based on attributes. This would allow you to re-use your implementations, wrap existing ones etc. I suspect/hope that it's a more general solution to the problem. You can also easily disable comparison and just go for eq/ne.

Am I missing something? I'm traveling rn so I haven't gotten around to implement a PoC yet.

@glyph
Copy link
Contributor

glyph commented Apr 7, 2019

@hynek Use Protocol rather than ABCMeta and then you don't need inheritance?

@jriddy
Copy link

jriddy commented Apr 9, 2019

@glyph It seems more that the pattern of re-use that this would encourage would be inheritance:

class EqOnlyMaker:
    @staticmethod
    def _notimpl(this, other):
        return NotImplemeted
    def make_lt(self, cls, attributes):
        return self._notimpl
    make_le, make_gt, make_ge = make_lt

class MyEqMakerImpl(EqOnlyMaker):
    def make_eq(self, cls, attributes):
        ...

It seems whenever you start introducing method "hooks" like this, inheritance becomes the dominant pattern. How have you guys avoided this with Twisted in recent years? The last I used it, you had to use inheritance everywhere

@glyph
Copy link
Contributor

glyph commented Apr 9, 2019

@jriddy I think that this might be a degenerate case of inheritance, where you have a protocol that has obvious default behavior for all of its methods and then you have a superclass that provides null implementations of all that behavior for convenience just to avoid repetition. The problem that Twisted has, which I agree is hard to avoid, is that people mistake the degenerate case for a more general one, and then start using inheritance multiple layers deep to implement different interfaces.

I'm not sure how to defend against this, since Twisted still has a decade-sized hole to dig itself out of and may never fully emerge.

@jriddy
Copy link

jriddy commented Apr 19, 2019

So the end result of this is a class that is created at definition time, and whose methods get called to return the 6 rich comparison functions, which get added to the class? I'm starting to see some sense to this. I think the only thing lacking from this is a way to selectively re-use the default implementation (which is clever and fast) for selected attributes, while still remaining picklable,

@botant
Copy link
Contributor

botant commented Feb 7, 2020

I quite like @jriddy's suggestion of attr.CmpSpec, and it makes sense to me that the onus of providing suitable eq (and lt, gt, etc) is on the users, because they know what data the attribute holds.

Nothing against providing an out-of-the-box CmpSpec that can be used with numpy or pandas, of course, but without trying to guess for the user.

We could implement a protocol in the following way: CmpSpec is a simple object potentially containing the usual 6 suspect attributes, and they must be Callables with input (self, other) and output (bool); if CmpSpec has only one function, we can add the negative of that for free; if it has 2 or more, we can fill the blanks and implement what is missing. The user shouldn't even have to provide a CmpSpec class - anything that looks like it (i.e. has any of the 6 attributes) should be good enough.

This seems easy enough to incorporate into _make_eq: instead of comparing a big tuple that contains all attributes, we can compare one tuple per attribute, or use CmpSpec-provided eq Callable. I don't know if that would be slower because we'd build many single tuples instead of one big tuple, but I can't imagine users being super sensitive to eq performance.

Disclaimer: I'd like very much to have a go at this, but I haven't worked with any project like this before, and I might underestimate the complexities of modifying a library with so many users.

@jriddy
Copy link

jriddy commented Feb 7, 2020

@botant I'd be willing to collaborate with you. I'm finally going to start to have time due to some changes in a my work situation, so we can find the way to make this work. If you can @-me in a WIP PR, I can help with integrating it well into the existing code more seamlessly.

@botant
Copy link
Contributor

botant commented Feb 7, 2020

That is awesome @jriddy, thanks! I'll get started on it.

@botant
Copy link
Contributor

botant commented Feb 15, 2020

@hynek @jriddy I've created a draft PR as it is not finished yet, but there is enough in there to judge whether it is in the right direction.

@jriddy
Copy link

jriddy commented Feb 16, 2020

@botant and I are working on this, but we really need some help with the name for this concept. See botant/attrs#2 if you have any ideas or suggestions please

hynek added a commit that referenced this issue Feb 22, 2021
* Updated implementation of comparison behaviour customization.

* Fixed version of next release, updates newsfragment and documentation.

* Fixed documentation.

* Fixed documentation.

* Fixed comments and changelog.

* Fix doctest error

* Update src/attr/_make.py

Co-authored-by: Hynek Schlawack <hs@ox.cx>

* Pass eq_key and order_key explicitly in _CountingAttr

* Merged with master and resolved conflics after introduction of _make_method

Co-authored-by: Antonio Botelho <antonio@inhames.com>
Co-authored-by: Hynek Schlawack <hs@ox.cx>
@hynek
Copy link
Member

hynek commented Feb 22, 2021

Fixed by #627 – tell your Numpy friends. :)

@hynek hynek closed this as completed Feb 22, 2021
@belm0
Copy link
Contributor

belm0 commented Jun 8, 2021

I'm trying to follow what the solution was.

Docs for flexible attribute comparison are in this PR: #768

But there is still a cmp_using() helper missing to make this easy? #768 (comment)

It's not simply this?

@attrs(auto_attribs=True)
class Foo:
    position: numpy.ndarray = attrib(eq=numpy.array_equal, factory=partial(numpy.zeros, 3))
    flag: bool = False

@hynek
Copy link
Member

hynek commented Jun 8, 2021

The documentation can be found here: https://www.attrs.org/en/stable/comparison.html#customization

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants