Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

should in use isequal to test for containment? #9381

Closed
StefanKarpinski opened this issue Dec 16, 2014 · 139 comments
Closed

should in use isequal to test for containment? #9381

StefanKarpinski opened this issue Dec 16, 2014 · 139 comments
Assignees
Labels
breaking This change will break code help wanted Indicates that a maintainer wants help on an issue or pull request
Milestone

Comments

@StefanKarpinski
Copy link
Member

This is how Dicts work, so there's a case to me made. Came up in this discussion:

https://groups.google.com/forum/#!topic/julia-users/XZDm57yHc5M

Motivating examples:

julia> NaN in [1,1.5,2,NaN]
false

julia> NaN in Set([1,1.5,2,NaN])
true

julia> NaN in keys(Dict(1=>"one", 1.5=>"one and a half", 2=>"two", NaN=>"not a number"))
true

It seems like in for arrays is the one that's out of line here.

@StefanKarpinski StefanKarpinski added needs decision A decision on this change is needed breaking This change will break code labels Dec 16, 2014
@StefanKarpinski StefanKarpinski added this to the 0.4 milestone Dec 16, 2014
@jakebolewski
Copy link
Member

Wouldn't this necessitate two versions of Dicts / Sets, one using isequal and the other using egal for containment? I would find this behavior to be very confusing. For example is Set([1,1.0]) a single element set?

Edit. I see we already have this behavior.

@JeffBezanson
Copy link
Member

This is another case that can be justified on the basis that the IEEE standard specifies the behavior of ==, but not of in. In other words, the standard doesn't say an in function must use IEEE == to compare elements. Arguably the IEEE NaN behavior should be used as rarely as possible. The case of -0.0 is trickier; 0 in A not finding a -0.0 could be a problem.

@StefanKarpinski
Copy link
Member Author

@jakebolewski, Set([1,1.0]) is already a one-element set:

julia> Set([1,1.0])
Set([1.0])

This is a pretty straightforward consequence of using Dict as the foundation for Set. If we wanted the egal behavior, we would need an ObjectIdSet type to correspond to ObjectIdDict.

@StefanKarpinski
Copy link
Member Author

The case of -0.0 is trickier; 0 in A not finding a -0.0 could be a problem.

This kind of sucks, but I think we have to live with it – it's either that or introduce yet another kind of equality, which we can do over my cold, dead body.

@JeffBezanson
Copy link
Member

We definitely don't need another kind of equality. I think we will opt to keep you alive instead.

@StefanKarpinski
Copy link
Member Author

I'm relieved.

@StefanKarpinski
Copy link
Member Author

The other option would be to make -0.0 and 0.0 the same for isequal and hashing, but that seems bad too.

@StefanKarpinski
Copy link
Member Author

Oh, we can't make isequal(-0.0, 0.0) since that would mean that -0.0 wouldn't sort before 0.0.

@nalimilan
Copy link
Member

Is it really required that isless be consistent with isequal? Anyway with several concepts of equality, you already need to look at the docs to check what it's behavior is.

As a data point, MATLAB's isequal and ismember consider -0.0 and 0.0 as equal. R doesn't have isequal, but its identical function allows for either behavior via an argument, and its %in% function considers them as equal. Python's in does the same. So it doesn't look like distinguishing negative and positive zero is a compelling use-case: it's fine if it's only supported in special dict/set types. On the contrary, distinguishing them by default has the potential of introducing very subtle bugs for the 99% of users (source: classified survey) who didn't even think zero is not always zero.

@StefanKarpinski
Copy link
Member Author

Is it really required that isless be consistent with isequal?

It's not strictly necessary, but I think anything else would be kind of a disaster, imo.

@JeffBezanson
Copy link
Member

Interesting data point:

In [19]: np.nan == np.nan
Out[19]: False

In [20]: np.nan in [np.nan]
Out[20]: True

In [21]: -0.0
Out[21]: -0.0

In [22]: 0.0 in [-0.0]
Out[22]: True

I'm not sure how python's in is handling equality here.

@nalimilan
Copy link
Member

It's not strictly necessary, but I think anything else would be kind of a disaster, imo.

I'm not so sure. -0.0 and 0.0 are sorted next to one another, so considering them as equal does not go against the ordering, it just loses a small bit of information. (I guess there is theory to express that more generally.) In what cases would that be an issue?

@simonbyrne
Copy link
Contributor

One thing worth considering is making isless match the IEEE totalOrder predicate: the main difference is that NaNs with different bit-patterns are distinct, and negatively-signed NaNs precede -Inf: in a certain sense, this is the ordering corresponding to ===.

This could be done via

function totalorder(x::Float64,y::Float64)
    xi = reinterpret(Int64,x)
    yi = reinterpret(Int64,y)
    (xi < yi) $ ((xi < 0) & (yi < 0))
end

(which also seems to be faster than our current isless).

@StefanKarpinski
Copy link
Member Author

That sort order has always struck me as weird and useless – why would I want some NaNs to come at the beginning and some at the end? If we started doing this, I feel like we'd almost need to start printing NaNs with a sign, which would also be strange. The only advantage I can see to the IEEE ordering is that you can implement it efficiently with integer comparisons.

@JeffBezanson
Copy link
Member

you can implement it efficiently with integer comparisons.

Yep, that's how to think like a 1980s computer scientist :)

@StefanKarpinski
Copy link
Member Author

Alright, so I would be ok with making isless a finer grained ordering than isequal, namely:

  1. isequal(-0.0, 0.0) but isless(-0.0, 0.0).
  2. isequal(nan1, nan2) for any two NaN values, but have isless order NaNs by bit pattern.
  3. all other equivalence classes are the same for isequal and isless.

I'm not inclined to have some NaNs negative while others are positive. That's just silly and useless.

@JeffBezanson
Copy link
Member

Yes, that might be OK. Have to think about it.

By the way, if there are any python experts around I really would like to know what is happening in the example I posted above. It seems in is not calling __eq__ for np.nan. However it doesn't use is either, as this example shows:

In [1]: 1 is 1.0
Out[1]: False

In [2]: 1 == 1.0
Out[2]: True

In [3]: 1 in [1.0]
Out[3]: True

EDIT: Ok, my guess is that in uses is first, and returns true if it does, before calling __eq__.

@StefanKarpinski
Copy link
Member Author

Oh, Python, you so crazy.

@stevengj
Copy link
Member

Python's in is implemented by PySequence_Contains, which defaults to using PyObject_RichCompareBool(obj, item, Py_EQ), which calls the tp_richcompare slot of the type object, which for floating-point types turns into float_richcompare (whose comments begin Comparison is pretty much a nightmare) which uses the C == for two floating-point variables. However, PyObject_RichCompareBool first checks for object identity in which case it just skips the comparison. (See also the documentation for PyObject_RichCompareBool, which actually mentions this.)

So, Jeff's guess is correct.

(Yes, I've spent far too much time looking at the CPython codebase for someone who never programs in Python.)

@StefanKarpinski
Copy link
Member Author

Clearly, I used up my "oh, Python, you so crazy" way too early in this thread.

@Ismael-VC
Copy link
Contributor

This is an abstract from Python's documentation for comparison operators:

>>> help('in')

...

Comparison of objects of the same type depends on the type:

...

* The values "float('NaN')" and "Decimal('NaN')" are special. The
are identical to themselves, "x is x" but are not equal to
themselves, "x != x".  Additionally, comparing any value to a
not-a-number value will return "False".  For example, both "3 <
float('NaN')" and "float('NaN') < 3" will return "False".

...

The operators "in" and "not in" test for membership.  "x in s"
evaluates to true if *x* is a member of *s*, and false otherwise.  "x
not in s" returns the negation of "x in s".  All built-in sequences
and set types support this as well as dictionary, for which "in" tests
whether the dictionary has a given key. For container types such as
list, tuple, set, frozenset, dict, or collections.deque, the
expression "x in y" is equivalent to "any(x is e or x == e for e in
y)".

For the string and bytes types, "x in y" is true if and only if *x* is
a substring of *y*.  An equivalent test is "y.find(x) != -1".  Empty
strings are always considered to be a substring of any other string,
so """ in "abc"" will return "True".

For user-defined classes which define the "__contains__()" method, "x
in y" is true if and only if "y.__contains__(x)" is true.

For user-defined classes which do not define "__contains__()" but do
define "__iter__()", "x in y" is true if some value "z" with "x == z"
is produced while iterating over "y".  If an exception is raised
during the iteration, it is as if "in" raised that exception.

Lastly, the old-style iteration protocol is tried: if a class defines
"__getitem__()", "x in y" is true if and only if there is a non-
negative integer index *i* such that "x == y[i]", and all lower
integer indices do not raise "IndexError" exception.  (If any other
exception is raised, it is as if "in" raised that exception).

The operator "not in" is defined to have the inverse true value of
"in".

The operators "is" and "is not" test for object identity: "x is y" is
true if and only if *x* and *y* are the same object.  "x is not y"
yields the inverse truth value.

So even while 1 and 1.0 are equal, they are not the same object in memory:

In [1]: 1 is 1.0
Out[1]: False

In [2]: id(1) == id(1.0)
Out[2]: False

@stevengj
Copy link
Member

@Ismael-VC, so their documentation is wrong, because it effectively uses x is z or x == z, not x == z.

@kmsquire
Copy link
Member

Actually, the radix sort function in SortingAlgorithms.jl does use integer
comparisons (and some shenanigans) for sorting floating point numbers.

On Wednesday, December 17, 2014, Jeff Bezanson notifications@github.com
wrote:

you can implement it efficiently with integer comparisons.

Yep, that's how to think like a 1980s computer scientist :)


Reply to this email directly or view it on GitHub
#9381 (comment).

@JeffBezanson JeffBezanson modified the milestones: 0.4, 0.4.1 Feb 25, 2015
@JeffBezanson
Copy link
Member

I think -0.0 is more important than NaN: code that tries to compare with NaN is always going to have problems. Finding 0 in an array containing -0.0 would be what most people want. So I think we have to use either "== or isequal" or just ==. Using a specific function is definitely simpler, so I say we keep the current behavior.

@StefanKarpinski
Copy link
Member Author

How would you implement that efficiently for Sets? I feel that as containers arrays should behave like ordered multisets, which to me implies using isequal.

On Jun 22, 2015, at 6:27 PM, Jeff Bezanson notifications@github.com wrote:

I think -0.0 is more important than NaN: code that tries to compare with NaN is always going to have problems. Finding 0 in an array containing -0.0 would be what most people want. So I think we have to use either "== or isequal" or just ==. Using a specific function is definitely simpler, so I say we keep the current behavior.


Reply to this email directly or view it on GitHub.

@tkelman tkelman removed this from the 0.6.0 milestone Jan 19, 2017
@tkelman
Copy link
Contributor

tkelman commented Jan 19, 2017

changes here aren't in scope for 0.6

@StefanKarpinski
Copy link
Member Author

Triage decision: we should change this and in should use isequal 💥

@StefanKarpinski StefanKarpinski added help wanted Indicates that a maintainer wants help on an issue or pull request and removed needs decision A decision on this change is needed labels Jul 20, 2017
@andyferris
Copy link
Member

I should have posted this here (or maybe somewhere else?) instead of @JackDevine's PR (sorry).

I've been thinking about this as well as our current plans for null / missing, and I note that we have some interesting behavior for == which IMO has the same problems as in as described in the OP:

julia> [NaN] == [NaN]
false

julia> Set(NaN) == Set(NaN)
true

julia> using Nulls

julia> [null] == [null]
null

julia> Set(null) == Set(null)
ERROR: MethodError: no method matching length(::Nulls.Null)
Closest candidates are:
  length(::SimpleVector) at essentials.jl:528
  length(::Base.MethodList) at reflection.jl:670
  length(::MethodTable) at reflection.jl:745
  ...
Stacktrace:
 [1] union!(::Set{Any}, ::Nulls.Null) at ./set.jl:126
 [2] Set(::Nulls.Null) at ./set.jl:19

I would make the argument that "two containers are equal when they contain (tested via element in container) the same elements". I feel it would be more consistent if == on containers checked isequal on the elements (which would make the four examples above true). Anyone know if there's been another issue discussing this?

(For context, I became interested in this because I am slightly worried about the way null has "escaped" its container here in [null] == [null] whereas previously ==(::Array, ::Array) --> Bool seemed guaranteed).

@StefanKarpinski
Copy link
Member Author

StefanKarpinski commented Nov 12, 2017

So it seems like this is potentially a good coherent position to take:

  1. Make isequal(missing, missing) true in addition to isequal(NaN, NaN) whereas (missing == missing) === missing and (NaN == NaN) === false.
  2. Use isequal and isless for all generic container code like this issue proposes.
  3. Define equality of containers in terms of isequal and isless – i.e. have [NaN] == [NaN] and [missing] == [missing].

This is a containment strategy for "unusual" scalar behaviors like those of NaN and the proposed three-value logic of missing. There are some dangers here, since if you take three-valued logic seriously, missing might be in any non-empty collection and any value might be in a collection containing missing; but I think @andyferris is probably correct: that way lies madness. So the general policy would be:

  1. Use == for scalars, 3VL applies to scalars.
  2. Use isequal for containers, 3VL does not apply to containers.

It gets a little weird when you consider containers as scalars, i.e. when doing missing == [1,2,3]. Is that true, false or missing?

@nalimilan
Copy link
Member

The Set(null) issue is a red herring, the outlier is actually NaN due to eltype(::Float64) being defined (as for all numbers). Set(Date()) fails with the same error.

@StefanKarpinski's plan is reasonable and consistent, but I'm not sure that would really be a good idea. Let me stress that in practice I don't really care how == behaves on arrays, but I don't find the arguments in favor of changing it very convincing.

First, having == call isequal recursively sounds really strange and would make the language more complex: if you want isequal, why not use it instead of ==? We should just provide an operator for isequal (Unicode and ideally also ASCII). Note that by changing this you would fix [NaN] != [NaN], but you would get [0] != [-0.0] in exchange, which sounds possibly worse than the former.

Second, we have chosen the policy to propagate missing values everywhere to be on the safe side. That can be quite cumbersome in practice. Personally for my own work I'd find it more convenient for sum to skip missing values automatically (as major proprietary statistical software all do), but that could hide bugs for use cases where you would not have expected missing values to be present. That's just the same issue for array equality: if there's a missing value, you basically don't know whether the arrays are equal or not.

Third, I don't really see why == not returning Bool would be OK for scalars, but would suddenly become a problem for containers. What is so special about them? This sounds like a half-way reaction against 3VL which will only lead to inconsistency. I'd rather bite the 3VL bullet and apply it everywhere. I could be convinced if somebody exposed why containers should be treated in a specific way.

@JackDevine
Copy link
Contributor

I agree with @nalimilan here:

First, having == call isequal recursively sounds really strange and would make the language more complex: if you want isequal, why not use it instead of ==?

Quoting @andyferris from #24563

I would say that containers would be equal when they contain (i.e. element in container) the same elements.

Boiling so much information down to one sentence helps clarify the situation a lot (thanks Andy). What I don't understand about the above sentence though, is what is meant by "equal". If #24563 goes through, then we would have:

Containers are isequal when they contain (i.e. element in container) the same elements.

But we would not have:

Containers are == when they contain (i.e. element in container) the same elements.

Like Milan says, if there's a missing value, you basically don't know whether the arrays are equal or not. However, if both arrays have a missing value in the same place then they are isequal but not ==.

@kmsquire
Copy link
Member

Slightly related question: should the type of the container matter?

E.g., if I have a Dict with consecutive integer keys, and an array with the same key-value pairs, would these be isequal (or ==)? (I assume not isequal).

More interesting for me, would Set([1,2,3]) == OrderedSet([1,2,3]) (using OrderedSet from DataStructures.jl)? Here, should the ordering be considered when comparing containers of different types?

@StefanKarpinski
Copy link
Member Author

The other way to go with this is to put missing in Base and double down on 3VL everywhere.

@JeffBezanson
Copy link
Member

First, having == call isequal recursively sounds really strange

The idea here is that == and isequal should really be the same function, and therefore should be the same in as many cases as possible. However, adding missing increases the need for a distinction between == and isequal. If ([missing] == [missing]) === missing is desired, then I agree we should put missing in Base.

I do think the type of container matters. For example, we have this ugly check:

function ==(l::Associative, r::Associative)
    ...
    if isa(l,ObjectIdDict) != isa(r,ObjectIdDict)
        return false

It's pretty clear to me that we should have a keycompare(dict) function that returns the key comparison function the dict uses (isequal for Dict and === for ObjectIdDict), and that this code should check keycompare(l) != keycompare(r).

All of that is to say that the comparison function used determines the meaning of the object, and so == should take it into account. On that basis I think Set([NaN]) == Set([NaN]) is defensible, and the same would probably apply to missing.

@nalimilan
Copy link
Member

Makes sense. Maybe it's not so bad to say that in uses == for arrays and tuples, but isequal for dicts and sets? The idea being that you need to use isequal to be able to access a dict entry with its key, but that it's not required for other collections. Using == with arrays would ensure -0.0 in [0] -> true and missing in [1] -> missing, which appear as two desirable properties to me. OTOH we would have NaN in [NaN] -> false, but I find it less problematic than the -0.0 case (since NaN are not valid numbers, so they need special treatment anyways).

@JackDevine
Copy link
Contributor

Bump,

Will it be possible to make a decision on this before the feature freeze? I am not completely sure what it is that we want to do here. However, if we come up with a solid plan, then I am happy to help out.

@andyferris
Copy link
Member

I see the minimal effort version of a coherent story for container == going like:

  • Elements of AbstractSets and the keys of Associatives are compared by isequal.
  • Elements of Tuples, AbstractArrays and the values of Associatives are compared by ==.
  • Other types like ObjectIdDict can specialize these rules as necessary (I kind of see ObjectIdDict as quite special... I can't think of many cases where it would be beneficial to deviate from the above).

(It's also possible for us to state that the keys of AbstractArrays (and Tuples) are compared via isequal, because there isn't really a difference for integers).

@JackDevine
Copy link
Contributor

That sounds good, are you arguing that we make in follow the above rules?

@andyferris
Copy link
Member

Sorry, I forgot to mention in.

Yes, in the precise sense of my opening statement above - the minimal effort version is to do very little, and cover it over with a convenient and somewhat coherent story afterwards :)

Going into more detail, the data people appear to consider missing as "any one of a range of values" so that the predicate missing in [missing] could actually be seen as missing (i.e. is some unknown value contained in a vector which contains some unknown value? the answer is "maybe", it depends on the values that you don't have information about). Actually, if you take that reading of the meaning of missing to it's logical conclusion, then missing in [1,2,3] is missing, and 0 in [1,2,missing] is missing, which seems... interesting...

Sorry to wax philosophically here, but to me the choice between in using == or isequal is becoming more of a choice of whether this is a "software engineering" concern of containers, or a "data question" we are asking about values. With missing, we may end up in the situation where if a == b, if a < b, and potentially if a in B are not "safe" around missing data (things we thought were safe boolean predicates will no longer be, and Base and user packages will need to adjust; the benefits might be worth the effort since missing data seems difficult to deal with in many other languages and frameworks - we'll find out).

I think the answer is that sometimes you will be doing some software engineering, and sometimes you will be doing some data science, and in both cases you might want to test for containment. For == and < we have isequal and isless for the safe, conservative, software engineering question. For in, perhaps we also need a partner function like isin or contains, one using isequal and one using == as the test? (There's a case for saying that all the "predicates" that will be "infected" (sorry for negative choice of word) by missing will need a "safe" version that always returns Bool - if only so we can continue to use if statements :) )

The reasoning behind what I wrote in my last post is that testing whether a key exists seems more like software concern where in getindex you must find a match (or not) and return some data (or not) and it can't be ambiguous which one, but the values of dictionaries and the data inside arrays is "user data". This seemed like a "sensible default semantic", but I'd also argue for a way to control this as necessary (hence I mention introducing isin or contains for this purpose).

However, even that default semantic is tricky because sometimes keys will come from data (say a groupby operation) and sometimes (frequently) sets are "data".

@andyferris
Copy link
Member

Considering this more, I wonder if having in always use == and a new contains function always use isequal might be the simplest way forward?

It would let us use NaN and missing as a key when necessary (e.g. haskey(A, k) = contains(k, keys(A)) will work and always returns a Bool) and still let users interrogate "data" with in.

@JeffBezanson
Copy link
Member

I don't think we want both in and contains. And x in keys(dict) should always use whatever key comparison function the dict uses.

@JackDevine
Copy link
Contributor

Ok, so I think that the current plan is to have in use isequal for Sets and the keys of Associative, in uses == for Arrays and Tuples and we have to specialize as necessary for ObjectIdDict.

If people think that the above is reasonable, then I could make the changes and add a little bit to the docstring that describes the intention of in as well as the rules for people who want to create new methods for in.

@JeffBezanson
Copy link
Member

Revisiting this, I tend to think we should leave it as-is. Currently in has two kinds of behaviors:

  1. For containers with explicit ideas about what they contain (like dicts and sets), the container determines the equality test used.
  2. Everything else uses ==.

This is nice and simple. If we don't want to change rule (2) to use isequal instead of ==, but we want to change something, then we have to move to 3 kinds of behaviors instead of 2. I think that would be bad; for example x in (f(y) for y in itr) should give the same answer as x in [f(y) for y in itr].

@andyferris
Copy link
Member

andyferris commented Jan 5, 2018

Sorry - dumb question - what's not a container but supports in? (Edit: maybe I misread that slightly)

@JeffBezanson JeffBezanson added the triage This should be discussed on a triage call label Jan 8, 2018
@JeffBezanson JeffBezanson removed the triage This should be discussed on a triage call label Jan 18, 2018
@kapple19
Copy link

kapple19 commented Dec 3, 2024

I have a vector that is meant to have unique values but it has both -0.0 and 0.0 and I can't unique! them to get rid of the -0.0.

I can't use in to check if -0.0 in v && 0.0 in v then alter my vector accordingly.

If we are going to keep the two behaviours, which I'm fine with given the somewhat reasonable rationale discussed above, then users need to be able to manage both behaviours.

I don't know what the best implementation is, but something like function unique!(v, by = isequal) where we can override the comparator by.

Or any other implementation that allows me to combine -0.0 and 0.0 in my vector without making my own function to copy and paste to all my projects.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking This change will break code help wanted Indicates that a maintainer wants help on an issue or pull request
Projects
None yet
Development

Successfully merging a pull request may close this issue.