Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should allequal use == instead of isequal? #49372

Closed
CameronBieganek opened this issue Apr 15, 2023 · 13 comments
Closed

Should allequal use == instead of isequal? #49372

CameronBieganek opened this issue Apr 15, 2023 · 13 comments
Labels
equality Issues relating to equality relations: ==, ===, isequal

Comments

@CameronBieganek
Copy link
Contributor

When I worked on the PR to add allequal (#43353), I used isequal instead of ==, because unique and allunique use isequal. However, in retrospect that seems like the wrong choice. It seems to me that users looking for allequal functionality will generally want to compare values with ==. In particular, it is unfortunate that allequal does not follow three-valued logic. So, we currently have the following behavior, which seems undesirable to me:

julia> allequal([NaN, NaN])
true

julia> allequal([missing, missing])
true

julia> allequal([1, missing])
false

I think for most use cases it would be better for those to return false, missing, and missing, respectively.

In the PR discussion, I mentioned the difference between using isequal and using ==, but nobody objected to using isequal. 😅

Unfortunately this would be a breaking change. I would almost consider the failure to propagate missingness a bug, but technically it's not a bug since the docstring does say that comparisons are made using isequal.

Would this qualify as one of those "technically breaking" changes that we could slip into a minor release?

@ericphanson
Copy link
Contributor

To me, it makes sense that allequal uses isequal, and those answers look reasonable. Do you have examples/situations where the other behavior would be better?

@CameronBieganek
Copy link
Contributor Author

I don't have a particular use case in mind. I just have a gut feeling that comparison with == is more generally useful, especially since it propagates missingness.

Here's an example inspired by a question on Discourse. Suppose we have the following matrix:

m = [1        missing  NaN  3  4
     missing  missing  NaN  3  5]

And then we want to call findall(allequal, eachcol(m)). The problem is, there is no obvious right way to do it. It depends on what the user is looking for. We probably shouldn't have added allequal, because it's hard to predict which version of equality the user is looking for, whether it be ==, isequal, or something in between, like

function myequals(x, y)
    isnan(x) && isnan(y) && return true
    x == y
end

When I opened the issue and the PR, there were some objections to expanding the Base namespace, but I don't think anyone brought up the issue of ambiguity over the type of equality comparison to use.

@ericphanson
Copy link
Contributor

The way I’ve used it is for sanity-checking data, as a complement to allunique. E.g. check the id column is all unique, check some supposedly constant column is allequal (or all-equal per group after grouping), etc. In this case, isequal semantics are generally what I want and are consistent with allunique. But I agree in general different uses will need different semantics.

@udohjeremiah udohjeremiah added the equality Issues relating to equality relations: ==, ===, isequal label Apr 16, 2023
@aplavin
Copy link
Contributor

aplavin commented Apr 17, 2023

First of all, it would be extremely weird if a function named allequal used something other than isequal for comparisons.

Next is probably a matter of opinion, but would be nice not to pollute more and more fundamentally boolean function results with missings. For me, as someone who uses missings extremely rarely, it's already easy to forget and be surprised that == can return not a bool.

For those, presumably rare, cases when the difference actually matters, a slightly more verbose all(==(first(A)), A) should be fine, I think.

@CameronBieganek
Copy link
Contributor Author

First of all, it would be extremely weird if a function named allequal used something other than isequal for comparisons.

all== is not a syntactically valid identifier, so what else could all(==(first(A)), A) possibly be named? The == symbol is read as "equals", after all. Also, many beginner Julia programmers are not even aware of isequal.

Next is probably a matter of opinion, but would be nice not to pollute more and more fundamentally boolean function results with missings.

all itself already propagates missing, so it shouldn't be too surprising if allequal also propagates missing.

julia> all([true, missing])
missing

Once you buy in to three-valued logic at the language level, it's hard to justify not using three-valued logic.

@StefanKarpinski
Copy link
Member

I think this is a non-starter because == is not an equivalence relation, specifically for values like NaN it's non-reflexive. The only equality predicates that this can behave sanely with and be implemented efficiently with are the ones that are true equivalence relations and compatible with hashing, so isequal and ===.

@aplavin
Copy link
Contributor

aplavin commented Apr 17, 2023

Once you buy in to three-valued logic at the language level, it's hard to justify not using three-valued logic.

The only, or at least the default, equality operator in Julia functions is most typically isequal -- due to the reasons outlined by @StefanKarpinski. For example, current allequal is the same as length(unique(A)) <= 1, and it's nice to have this consistency.

@mikmoore
Copy link
Contributor

mikmoore commented Apr 17, 2023

Would there be room for keyword arguments like our sorting functions have? I.e.,

allequal([(+0.0,1), (-0.0,2)]; by=first, eq=(==)) == true
allequal([(+0.0,1), (-0.0,2)]; by=first, eq=isequal) == false

I suppose this would also involve expanding isequal, unique, and maybe some others with matching keywords.

@CameronBieganek
Copy link
Contributor Author

I'm fully aware that isequal is an equivalence relation and == is not. But I think the point is that the correct comparison to use depends highly on the context. In the context of data analysis and table operations, one usually expects 3VL to hold.

Let me sharpen the matrix example above to make it more concrete. I think this is the use case that originally motivated me to make the allequal PR. Suppose you are working on a predictive modeling task. You have assembled various data sources into a data frame containing all the features you wish to use. However, you typically don't want to use columns that are constant when training a machine learning model or performing a linear regression. So you use allequal to look for columns that are constant:

julia> df = DataFrame(a=[1, missing], b=[missing, missing], c=[NaN, NaN], d=[3, 3], e=[4, 5])
2×5 DataFrame
 Row │ a        b        c        d      e     
     │ Int64?   Missing  Float64  Int64  Int64 
─────┼─────────────────────────────────────────
   11  missing      NaN      3      4
   2missing  missing      NaN      3      5

julia> allequal.(eachcol(df))
5-element BitVector:
 0
 1
 1
 1
 0

If you're not paying attention, you might think that column a is non-constant. But in fact the real answer (from the perspective of a data analyst) is missing. Here is what you would see if you used == instead:

julia> allequal3VL(itr) = all(==(first(itr)), itr);

julia> allequal3VL.(eachcol(df))
5-element Vector{Union{Missing, Bool}}:
      missing
      missing
 false
  true
 false

(Granted, in this scenario you probably also need to keep an eye out for columns that are all NaN. That would have to be a separate check.)

Would there be room for keyword arguments like our sorting functions have?

That seems reasonable to me. There is already an open PR to add a predicate version, allequal(f, itr), PR #47679. If we have the predicate version, we wouldn't need the by keyword argument, so your example would become allequal(first, itr; eq=(==)).

Are there any objections to adding an eq keyword argument? It does parallel nicely with the lt keyword argument in sort.

@LilithHafner
Copy link
Member

This seems like a large API for a small feature. Is allequal(x; eq=(==)) that much better than all(==(first(x)), x)?

@CameronBieganek
Copy link
Contributor Author

No, it's not much better. But it's a slippery slope. Once you have allequal, it starts to make sense to have an eq keyword argument.

I think I'm now of the opinion that we (I) shouldn't have added allequal to Base in the first place. 😬

@ericphanson
Copy link
Contributor

This might not be very helpful, but I like allequal and didn't comment on the PR about which comparison because I thought isequal was obviously the right one to use, and I assume others did the same.

@CameronBieganek
Copy link
Contributor Author

Ok, at least this issue has verified that some people prefer it the way it is now, which is good to know.

@vtjnash vtjnash closed this as completed Feb 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
equality Issues relating to equality relations: ==, ===, isequal
Projects
None yet
Development

No branches or pull requests

8 participants