Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default comparison for data classes #710

Closed
josh11b opened this issue Aug 5, 2021 · 15 comments
Closed

Default comparison for data classes #710

josh11b opened this issue Aug 5, 2021 · 15 comments
Labels
leads question A question for the leads team

Comments

@josh11b
Copy link
Contributor

josh11b commented Aug 5, 2021

I think we agree that data classes (including struct types) should support equality comparisons if all of its field types do. This comparison should at least support values with the same type, including field order.

var s: {.x: i32, .y: i32} = {.x = 3, .y = 4};
Assert(s == {.x = 3, .y = 4});

Question 1: Should this also be supported if the fields are the same, but in a different order?

Assert(s == {.y = 4, .x = 3});

Question 2: Should data classes support ordering comparisons (<, <=, >, >=) using lexicographical order? Clearly the field orders would have to match in this case.

Assert(s < {.x = 5, .y = 2});

Context: proposals #561 on basic classes and #702 on comparisons, and Discord discussion starting with https://discord.com/channels/655572317891461132/708431657849585705/872603493242908772

@geoffromer
Copy link
Contributor

It's not clear to me how (or even whether) these issues will affect nominal data classes, so my comments will focus primarily on structural data classes.

I think many (most?) users' intuitive mental model for structs will be that they are an unordered collection of named fields. I think that's already the case in C++ to a large extent, and #561 further encourages this mental model by enabling users to disregard field order when reading or writing initializers. I think that mental model is not only unavoidable, it's useful, and it makes structs a very natural complement for tuple types. I think we should lean into that model, and try to avoid attaching API-visible meaning to the declaration order of struct fields.

Regarding question 1, this model implies that a struct type should be equality-comparable with any struct type that has the same field names and types (in other words, any struct type that it can be initialized from), and it should evaluate the field comparisons in an unspecified order. I'm honestly having a hard time coming up with any drawbacks to supporting heterogeneous equality comparisons that wouldn't apply just as much to heterogeneous initialization (which #561 proposes to support).

One possible concern, particularly for the homogeneous case, is that we'll presumably want the == operator to short-circuit: as it iterates through the field pairs, it should return false as soon as it finds an unequal pair, rather than perform further comparisons that cannot affect the final result. Since some field comparisons may be more expensive than others, it may sometimes be possible to optimize the average-case performance of the comparison by changing the order in which fields are compared. If we guarantee to perform the comparisons in field declaration order, then users can use the declaration order to control the comparison order; if the comparison order is unspecified, users have no such control.

However, I don't think that carries much weight: the cases where that degree of control is useful will likely be very rare, and when they occur, the user would probably be better off using a nominal class type instead, where they can achieve much more precise control over performance by hand-implementing the == operator.

Another possible concern is that if the field comparisons have side effects, then changes in the (unspecified) order of evaluation could change the behavior of the program unexpectedly, in ways that could be difficult to understand or diagnose. However, this risk seems quite remote, and to the extent we nonetheless want to address it, we could do so either by pseudo-randomizing the order so that such problems are caught quickly, or by defining a fixed order that does not depend on the declaration order (such as lexicographic by field names).

Regarding question 2, the fields-are-unordered model implies that it is unnecessary, if not outright harmful, to provide ordering comparison operators for structs: it's not meaningful to ask whether one collection of unordered field values is "less than" another, so people are unlikely to need an operator that asks that question in the first place, and are likely to be confused by any usages of that operator that they come across. On the other hand, it is meaningful and useful for a struct to provide some total order that is unspecified but fixed (at least for the life of the program), so that's what we should do instead. That will enable users to do things like perform binary search on struct values without risking the confusion that would come from naming the comparison <.

Note that this issue goes beyond data classes: choice types and most other sum types likewise have no natural concept of "less than", but can and should define a fixed but unspecified total order under a different name. So I think we will eventually need a "default total order" API whether or not we support comparison operators for data classes, so I don't think we're incurring much language-design cost if we plan on taking this route for data classes as well. Note that I think the design of that API is out of scope for #561 and #702, because I expect it to be tied to the design for user-defined comparison operators.

@zygoloid
Copy link
Contributor

zygoloid commented Aug 5, 2021

I think, for me, the primary question here is: is the field order a salient property of a struct value? Do we consider two structs that have the same field names and the same field values, but in a different order, to be importantly different (as they would be if they had different field names or different values) or only incidentally different (eg, like the difference between a vector<int> containing {1, 2, 3} whose capacity is 3 and a vector<int> containing {1, 2, 3} whose capacity is 4)?

If the field order is salient, then I think comparisons between structs with different field order should require an explicit conversion to a common type, whether they're equality or relational comparisons. {.a = 1, .b = 2} and {.b = 2, .a = 1} should not compare equal (because they are different values), and should not compare unequal (because that would be surprising), so having them not be comparable at all seems best.

If field order is non-salient, then I would expect equality comparison to work regardless of field order, and relational comparison to not work regardless of field order (even if the field order is the same). For relational comparisons, if field order is non-salient, we should get the same result regardless of field order, so we'd need to make an arbitrary but consistent choice (such as comparing the fields in alphabetical order by name), but any such choice seems like it would be surprising. I don't think this option would or should imply that {a: Int, b: Int} and {b: Int, a: Int} are the same type, but rather that they are two different types that can represent the same set of values (like two different string implementations).

My current leaning is very slightly that field order is salient in this sense, because I think that it's better to avoid non-salient but representable differences, and that this option is more reminiscent of C++ structs. However, I think the alternative also has merit, and in the alternative I do like @geoffromer's idea of producing an unspecified-but-fixed total order for structs with a name that is not <. (There are some considerations from predictable performance goal here: we might not want the comparison to be entirely unspecified. But I think we can find a reasonable approach.)

@geoffromer
Copy link
Contributor

My current leaning is very slightly that field order is salient in this sense, because I think that it's better to avoid non-salient but representable differences, and that this option is more reminiscent of C++ structs.

"Representable" in what sense?

@josh11b
Copy link
Contributor Author

josh11b commented Aug 9, 2021

My general position is that we should the allow operations we can since they will be convenient/useful for users and not result in more bugs in user code.

On question 1: I think a useful property is that a = b results in a == b, even when a and b have different field orders.

@zygoloid
Copy link
Contributor

My current leaning is very slightly that field order is salient in this sense, because I think that it's better to avoid non-salient but representable differences, and that this option is more reminiscent of C++ structs.

"Representable" in what sense?

I think what I mean is "observable by a program".

@github-actions
Copy link

We triage inactive PRs and issues in order to make it easier to find active work. If this issue should remain active or becomes active again, please comment or remove the inactive label. The long term label can also be added for issues which are expected to take time.
This issue is labeled inactive because the last activity was over 90 days ago.

@github-actions github-actions bot added the inactive Issues and PRs which have been inactive for at least 90 days. label Nov 26, 2021
@chandlerc chandlerc removed the inactive Issues and PRs which have been inactive for at least 90 days. label Nov 29, 2021
@chandlerc
Copy link
Contributor

FWIW, I like the idea of "is field order salient" from @zygoloid -- and FWIW, I feel that the order is salient here (and with tuples).

@chandlerc
Copy link
Contributor

From a discussion with @zygoloid I feel like we at least have a tentative direction here.

Fundamentally, we're both pretty happy with the idea that field order is salient as @zygoloid outlined in his comment -- it is an observable aspect of the program. This isn't really a narrow thing about data classes, but a general principle. The declaration order of fields in a class represents their order in memory, similar for data classes, for tuples, etc.

I think we really did try to take to heart the idea of having more "set" semantics as @geoffromer outlined, but I think we're both a bit more comfortable being transparent about the underlying implementation here and how things would be laid out in memory if they are in fact laid out in memory.

However, there are common operations that can be especially efficiently defined heterogeneously for data classes:

  • converting assignment (and initialization) should work between types with a set of fields with the same names and convertible types.
  • data classes with a set of fields with the same name and comparable types should also be comparable for both equality and less-than.
    • However, only equality should allow different ordered fields. For less-than (and other relational comparisons), the order of field names (with comparable types) must match so we can pick a consistent and unsurprising order of comparison (even if not especially meaningful). The order we suggest is declaration order.
  • explicit cast between data classes with a set of fields with the same names and convertible types, but no implicit cast.
    • No implicit cast even if the order matches and the fields are implicit convertible.
      • We would expect tuples and data classes and fixed size arrays to all work similarly. But implicitly converting 10,000 x i8 to 100 x i64 seems likely to be surprising at best, and more likely to indicate a missing overload or other mistake, even though the fields are "in order" and implicitly convertible.
    • Maybe in the future, we allow implicit destructuring in some cases, but initially start w/ simple and restrictive model

I think I have reproduced the direction correctly, but if I've missed anything let me know. Also want to check with @KateGregory to see if all of us are actually happy with this direction despite the tradeoff it makes compared to what @geoffromer suggests above.

@chandlerc
Copy link
Contributor

After some discussions in the weekly meeting and follow-up on Discord, some tweaks to the above:

  • Let's allow implicit conversions between all of these style of aggregate types (tuples, arrays, data classes) when the members implicitly convert
    • Super important for some cases: (1, 1) converting to (i8, i8).
    • To start, keep it very simple and consistent across all of these. Can watch for surprising cases that we need to address w/ more nuanced rule.
    • Include different field orders for data classes.
  • Need to pin down which order is used when doing heterogenous comparison with data classes that have different order of fields
    • Consistently use LHS order.
    • Match the order used in assignment.

Maybe this converges us on a good initial decision? @KateGregory @zygoloid

@chandlerc
Copy link
Contributor

From chat, looks like all the leads are happy here. Closing this up and will put it in the queue that needs a proposal.

@josh11b
Copy link
Contributor Author

josh11b commented Dec 9, 2021

Just to clarify, I'd like to write down a bunch of examples so we are clear on which are allowed.

Assignment

We support both reordering fields and implicit conversions during assignment and initialization of aggregates. So this is allowed:

var s: Array({.x: i32, .y: i32}, 2) = ...;
var t: Array({.y: i64, .x: i64}, 2) = s;

Fields are assigned in the order defined by the left-hand side. So in this example: array element 0 before element 1, and within an array element .y before .x.

Equality comparison

For equality (and inequality) comparison of aggregates, we support both reordering fields and fields with different types as long as comparison is defined between those types. So this is allowed:

var s: Array({.x: i32, .y: i32}, 2) = ...;
var t: Array({.y: u32, .x: u32}, 2) = ...;
Assert(s == t);

Fields are compared in the order defined by the left-hand side. So in this example: array element 0 before element 1, and within an array element .y before .x.

Ordering comparison

Ordering comparison (<, <=, etc.) is allowed between aggregates only when the field order is consistent. The types may be different as long as ordering comparison is defined between those types. So this is allowed:

var s: Array({.x: i32, .y: i32}, 2) = ({.x = 1 .y = 2}, {.x = 3, .y = 4});
var t: Array({.x: u32, .y: u32}, 2) = ({.x = 4, .y = 3}, {.x = 2, .y = 1});
Assert(s < t);

but this is not:

var s: {.x: i32, .y: i32} = {.x = 3, .y = 4};
var t: {.y: u32, .x: u32} = {.y = 3, .x = 4};
Assert(s < t);

Data classes are compared in lexicographical order, with the first field being most significant. In the first example, [0].x is compared before [0].y before [1].x before [1].y.

Argument passing

(I'm least certain of this section, but this is my reading of what is written.)

Implicit conversion between aggregates is allowed to reorder fields and perform implicit conversions on field types, just like assignment. So this is allowed:

fn F(t: Array({.y: i64, .x: i64}, 2));
var s: Array({.x: i32, .y: i32}, 2) = ...;
F(s);

Fields are assigned according to the order in the type of the parameter from the function signature, just like an assignment of the argument value to the parameter.

@chandlerc
Copy link
Contributor

All of those examples match what I expected. @zygoloid ?

josh11b added a commit that referenced this issue Jan 6, 2022
Define initialization, assignment, comparison, and implicit conversion between data classes with different field orders and/or types.

Implements the decision made in #710 .

Co-authored-by: Richard Smith <richard@metafoo.co.uk>
zygoloid added a commit that referenced this issue May 3, 2022
Add concrete design for interfaces for comparison.

Rename interfaces for arithmetic following current thinking in #1058.

Update rules for mixed-type comparisons for data classes following #710.

Co-authored-by: Chandler Carruth <chandlerc@gmail.com>
chandlerc pushed a commit that referenced this issue Jun 28, 2022
Define initialization, assignment, comparison, and implicit conversion between data classes with different field orders and/or types.

Implements the decision made in #710 .

Co-authored-by: Richard Smith <richard@metafoo.co.uk>
chandlerc added a commit that referenced this issue Jun 28, 2022
Add concrete design for interfaces for comparison.

Rename interfaces for arithmetic following current thinking in #1058.

Update rules for mixed-type comparisons for data classes following #710.

Co-authored-by: Chandler Carruth <chandlerc@gmail.com>
@jonmeow jonmeow added the leads question A question for the leads team label Aug 10, 2022
@jonmeow
Copy link
Contributor

jonmeow commented Aug 11, 2022

I think #1178 maybe addressed the need for a proposal here?

@chandlerc
Copy link
Contributor

I think #1178 maybe addressed the need for a proposal here?

@zygoloid or @josh11b to confirm.

@josh11b
Copy link
Contributor Author

josh11b commented Aug 12, 2022

Confirmed, #981 and #1178 addressed this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
leads question A question for the leads team
Projects
None yet
Development

No branches or pull requests

5 participants