Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

== on collections containing missing breaks its contract #52484

Open
Seelengrab opened this issue Dec 11, 2023 · 15 comments
Open

== on collections containing missing breaks its contract #52484

Seelengrab opened this issue Dec 11, 2023 · 15 comments
Labels
docs This change adds or pertains to documentation equality Issues relating to equality relations: ==, ===, isequal missing data Base.missing and related functionality

Comments

@Seelengrab
Copy link
Contributor

help?> ==
search: == === !== = .= >= => <= !=

==(x, y)
[...]
For collections, missing is returned if at least one of the
operands contains a missing value and all non-missing values are equal. Use isequal or === to always get a Bool result.

Yet:

julia> (1,missing) == (missing,2)
missing

julia> [1,missing] == [missing,2]
missing

Per the docstring, these should both be false, since the non-missing values are not equal.

@Seelengrab Seelengrab added the missing data Base.missing and related functionality label Dec 11, 2023
@nalimilan
Copy link
Member

Maybe the docs could be made more explicit and say "all non-missing pairs of values".

@Seelengrab
Copy link
Contributor Author

Seelengrab commented Dec 11, 2023

To me, the docstring implies usage of a skipmissing-like behavior, i.e. "in the absence of missing, are the two equal?". If that is true, and there is at least one missing in either collection, then the result is missing. If there is no missing, the result is whatever the former returns. Changing that seems breaking to me, so IMO the implementations should be fixed.

@adienes
Copy link
Member

adienes commented Dec 11, 2023

I would expect it to be the same as all(x .== y) so I think the current behavior is correct

@nalimilan
Copy link
Member

nalimilan commented Dec 11, 2023

And at least I can assure you that it behaves as intended. ;-)

To me, the docstring implies usage of a skipmissing-like behavior, i.e. "in the absence of missing, are the two equal?". If that is true, and there is at least one missing in either collection, then the result is missing.

Actually the reasoning behind the implemented behavior is different. The idea is that missing is returned if it's not possible to determine whether the two collections are equal or different, because by replacing missing values by some values they could be made either equal or different. missing represents a quantity which could have been measured but for some reason isn't known. This is exactly the same as three-valued logic that is implemented by & and |.

Changing that seems breaking to me, so IMO the implementations should be fixed.

You mean the documentation? Existing code can rely on the implementation, but not on an interpretation of the docstring which doesn't match the actual behavior.

@Seelengrab
Copy link
Contributor Author

And at least I can assure you that it behaves as intended. ;-)

Then the docstring isn't precise enough in the behavior that is guaranteed, since the docstring is at the very least ambiguous in what is required/guaranteed.

Existing code can rely on the implementation, but not on an interpretation of the docstring which doesn't match the actual behavior.

That seems backwards to me - relying on what is documented is what has always been the go-to in Julia ("if it's not in the Manual, it's not API"), so hearing this is directly counter to what has been practiced so far. To me it feels like you're trying to justify the existing behavior post-fact.

@adienes
Copy link
Member

adienes commented Dec 11, 2023

my reading of the existing documentation matches the existing behavior

since the docstring is at the very least ambiguous

I would call it at the very most ambiguous? maybe the intersection of those two bounds makes it exactly equal ambiguous.

I certainly wouldn't agree that it's worded in a way counter to existing behavior

@Seelengrab
Copy link
Contributor Author

If swapping the order in the conditional changes the result (that is, checking for "all non-missing values are equal" first), the documentation is certainly ambiguous. You're reading the docstring like return_missing = has_missing || all_different, whereas I'm reading it like return_missing = has_missing && all_different, because of that and there, suggesting a conjunction instead of a disjunction. I don't know how you get to that disjunction when the docstring clearly says and.

@mbauman
Copy link
Member

mbauman commented Dec 11, 2023

The "all non-missing values are equal" clause could actually be read in at least three ways:

  • Every single non-missing value on both LHS and RHS must be == to eachother. In other words, both sides must only have two unique values: missing and some x.
  • After applying skipmissing to each argument independently, the resulting missing-less collections are iterated jointly and the iterated pairs of values must be equal.
  • After applying Missings.skipmissings to the arguments jointly, the resulting missing-less collections are iterated jointly and the iterated pairs of values must be equal.

Docs do double-duty: they must be both accessible to newcomers while still giving technical guidance to advanced users.

@mbauman mbauman added the docs This change adds or pertains to documentation label Dec 11, 2023
@Seelengrab
Copy link
Contributor Author

Seelengrab commented Dec 11, 2023

You can add at least two more, duplicating the last two - == doesn't say anything about iteration or order. That's why I only wrote "skipmissing-like" above, because not everything you can == is necessarily actually iterable or orderable.

@mbauman
Copy link
Member

mbauman commented Dec 11, 2023

Right, which is precisely why this isn't well-specified enough to be a "contract" that should force implementation changes counter to the very much intended design.

@Seelengrab
Copy link
Contributor Author

So what can I as a user that wants to implement == on a type refer to as a "ground truth"? What properties do I need to obey? "Check the implementation" just isn't feasible when there isn't necessarily an applicable reference implementation to compare to.

@nalimilan
Copy link
Member

Maybe the simplest way to document this would be by mentioning the result of calling == on each pair of values? Something like:

For collections, missing is returned if calling == on each pair of values in matching positions
in x and y returns at least one missing and no false. For arrays, this is equivalent to all(x .== y).

@Seelengrab
Copy link
Contributor Author

Seelengrab commented Dec 12, 2023

What does "pair of values in matching positions" mean for non-orderable collections like Set? For example:

julia> Set([missing, 1]) == Set([missing, 2])
false

julia> Set([missing, 1]) == Set([missing, 1])
true

I'd argue that this NOT being missing is correct, since the fact that there is a Set is meaningful.

@rfourquet
Copy link
Member

I'd argue that this NOT being missing is correct

I think that's intended, because Set membership is defined by isequal, and isequal(missing, missing) is true. Of course Set is a "collection", but doesn't fit in the == docstring from the OP, whose wording should be amended accordingly.

@Seelengrab
Copy link
Contributor Author

Yes, that's (partially) my point - == propagating missing or not depends on the object, not just "it's a collection" or "it wraps other objects". For the same reason, Some(missing) == Some(1) should be false, not missing.

@nsajko nsajko added the equality Issues relating to equality relations: ==, ===, isequal label Jan 26, 2024
KristofferC pushed a commit that referenced this issue Feb 8, 2024
Alright, here's my spitball at fixing
#52484.

This also clarifies the non-missing behaviors, and adds a reference to
`all` which should be a useful breadcrumb. I don't want to say `all(x
.== y)` because that has all sorts of broadcasting semantics, but we
could perhaps add a comparison to `all(splat(==), zip(x, y))` or
`all(map(==, x, y))`, but even those feel sketchy to reference as they
may miss other properties (like array shape).

---------

Co-authored-by: Jameson Nash <vtjnash@gmail.com>
Co-authored-by: inky <git@wo-class.cn>
Co-authored-by: Max Horn <max@quendi.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs This change adds or pertains to documentation equality Issues relating to equality relations: ==, ===, isequal missing data Base.missing and related functionality
Projects
None yet
Development

No branches or pull requests

6 participants