Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial design for unions #139

Closed
wants to merge 10 commits into from
Closed

Initial design for unions #139

wants to merge 10 commits into from

Conversation

geoffromer
Copy link
Contributor

Note: FIXME indicates something I intend to address before merging, but TODO indicates something I expect to be addressed by future PRs.

@geoffromer geoffromer added the WIP label Aug 7, 2020
@googlebot googlebot added the cla: yes PR meets CLA requirements according to bot. label Aug 7, 2020
@geoffromer geoffromer added the proposal A proposal label Aug 7, 2020
docs/design/unions.md Outdated Show resolved Hide resolved

```
struct SsoString {
bitfield(1) var Bool: is_small;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we should be more actively considering ways of expressing this in a way where the language is aware of the relationship between is_small and which component of the union is active. This would open the door to zero-cost static analysis that the type is being used correctly, and lower-cost dynamic analysis.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that there's no guarantee that the discriminator is in the same struct as the union, or is even reachable from the union. There are even quite plausible cases where there's no discriminator at all, and safety is achieved entirely through static reasoning.

We could consider something like that as an optional add-on, rather than something that's inherent in all unions, but because it's an add-on, I don't think it needs to be part of the initial unions design. Instead, I think we should consider it as part of the overall safety story, because until then it won't be clear how much benefit it actually provides. I've just added a proposal doc to this PR, which includes a rationale for why I want to proceed with this proposal even though the safety plan are unresolved.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that there is no guarantee that you won't need to do something unsafe, but I don't think that is the common case we need to make easy, especially since it is the case where the language can give you little support. Rust is an example of a language aiming for high performance but provides a safe primitive here. If you really need something outside what the language can reason about, then use something like a byte array and casts.

I think the main thing as a team we should be doing at this stage is deciding between high level approaches. So even if we ultimately decide to go with the approach currently presented in this document, the job of this document should be to present and discuss the alternatives under consideration, and presenting rationale for one choice over others. I very much feel this document needs to at least address the safer design alternative.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you added a blurb in the proposal text about discriminated unions -- I maintain that it requires more serious consideration since I think some of the concerns raised there are quite addressable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fully agreed with @josh11b. I feel like our safe algebraic data type facility should allow the type of micro-management of storage layout necessary for SsoString presented here. Of course, there will be cases where the discriminator is not accessible easily, but I don't think that's the common case, and that type of code is highly unsafe anyway, so there's little the language can do to help.

…n matching,

update the discussion of safety, and add some section headings.
@geoffromer geoffromer added proposal rfc Proposal with request-for-comment sent out and removed WIP labels Aug 12, 2020
Copy link
Contributor

@jonmeow jonmeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding some quick comments -- I haven't gotten far, but it's the end of the day and I don't want to run into commit skew if you're making changes before I get back.

docs/design/unions.md Outdated Show resolved Hide resolved
docs/design/unions.md Outdated Show resolved Hide resolved
docs/design/unions.md Outdated Show resolved Hide resolved
docs/design/unions.md Outdated Show resolved Hide resolved
docs/design/unions.md Outdated Show resolved Hide resolved
docs/design/unions.md Outdated Show resolved Hide resolved
docs/design/unions.md Outdated Show resolved Hide resolved
docs/design/unions.md Outdated Show resolved Hide resolved
docs/design/unions.md Outdated Show resolved Hide resolved
docs/design/unions.md Outdated Show resolved Hide resolved
Copy link
Contributor

@jonmeow jonmeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finishing a pass of comments.

docs/design/unions.md Show resolved Hide resolved
docs/design/unions.md Show resolved Hide resolved
docs/design/unions.md Outdated Show resolved Hide resolved
docs/design/unions.md Outdated Show resolved Hide resolved
docs/design/unions.md Outdated Show resolved Hide resolved

### No type punning

This proposal does not permit accessing an inactive union member. This is unlike
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the designator, this behavior can't be detected, right? And so in opt builds, this probably won't be caught and would not be an error?

I wonder, is "error" an apt description for something that would only be caught in debug build modes? Is this something that we should make clear in our documentation, that "error" does not mean a program will consistently fail?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed by chat, I've switched to using the term "out of contract". I'm not sure I like the result; I'm open to alternative suggestions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with "out of contract", but can you also replace "does not permit"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better?

Comment on lines 122 to 124
However, that's a false analogy: the defining characteristic of an uninitialized
or indeterminate value is that we know nothing about its state, whereas we know
everything there is to know about the state of an inactive union. In particular,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels very similar to me as if we simply said every struct defaults members to uninit, and thus we "know everything" and don't need require struct members be initialized.

(as I point out, this very much feels like an inconsistency -- I disagree with the false analogy claim)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the next sentence: when I say we "know nothing" about the state of an uninitialized value, what I mean is that if an operation's precondition is guaranteed to be true of an uninitialized value, then that precondition is also guaranteed to be true of every possible value. To put it another way, the knowledge that a variable is uninitialized never helps you deduce that a program is correct, or deduce anything about the behavior of a correct program.

By contrast, there are preconditions that are satisfied by an inactive union, but not by an active union (in particular, the precondition of create:), and the knowledge that a union is inactive can help you deduce that a program is correct (e.g. if that union is passed to create:).

Does that make sense? If so, any suggestions for how to make this point clearer in the text?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me it also seems like a distinction without a difference.

To put it another way, the knowledge that a variable is uninitialized never helps you deduce that a program is correct, or deduce anything about the behavior of a correct program.

Certainly it does. A storage location should be uninitialized before the allocation it is a part of is deallocated, and before create, regardless of whether it is a part of a union or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A storage location should be uninitialized before the allocation it is a part of is deallocated, and before create, regardless of whether it is a part of a union or not.

I agree, but I don't understand how that relates to what I said.


### `create` and `destroy` are union-only

In this proposal, `create` and `destroy` can only be applied to union members,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these only apply to union members, perhaps union_create and union_destroy might be better names? I'm just wondering if we want to reserve these two words as keywords for such a narrow purpose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've rephrased to try to make clear why I think that would be counterproductive. If you still want to change these keywords, can you elaborate on what problem you're trying to solve, i.e. what might go wrong if we stuck with the proposed names?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are use cases for leaving local variables and struct fields uninitialized:

var Int32: x = uninitialized;
if (...) {
  create x = 42;
} else {
  create x = 0;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the proposed names, combined with reserved keywords it means that people can't write create or destroy identifiers. It's just not obvious to me that unions are common enough for that. shrug

Dmitri, for local variables, suppose you have:

var Int32: x = uninit;
if (...) {
  x = 42;
}
(some body of code generating B)
if (B) {
  x = 0;
}

If you require create, and it's enforced in the way described here, then you would need:

var Int32: x = uninit;
var bool: created = false;
if (...) {
  create x = 42;
  created = true;
}
(some body of code generating B)
if (B) {
  if (created) {
    x = 0;
  } else {
    create x = 0;
  }
}

Or some other refactoring. I think this could get to be more complex, and would ideally not be required for most types. The link of requiring users to not just initialize, but also state that they expect this is the first initialization, seems like a burden that's maybe okay in the scope of unions (as a niche feature) but would likely be problematic taken broadly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sort of gets back to the discussion of uninit vs. a union with no active member: I've been assuming that = uninit is like default-initialization in C++: it creates the object, but leaves it in a partially-formed state (i.e. a state that can only be assigned to or destroyed). Consequently, it has to be an error to apply create to that object, because you can't create an object that already exists.

I agree that there are important use cases for manually controlling the lifetime of a local variable, but I think you should have to opt into that as part of the variable's declaration, e.g.

struct ManualLifetimeInt32 {
  union {
    Int32 object;
  }
}
...
var ManualLifetimeInt32: x;
if (...) {
  create: x.object = 42;
} else {
  create: x.object = 0;
}
...
destroy x.object;

ManualLifetimeInt32 could of course be a template, and might have a nicer API. One variant to explore would be a type where the underlying object starts out live, and is destroyed by the wrapper's destructor, but can be destroyed and recreated in the middle; that seems to be the more common pattern in C++, possibly because C++ makes it easier to write.

Regarding the proposed names, combined with reserved keywords it means that people can't write create or destroy identifiers.

I expect that to be the case regardless, since operator create and operator destroy are planned to be how we spell constructors and destructors.

Comment on lines +157 to +158
at the end of a union, but it's not clear that there are any compelling use
cases for it. In particular, packing at the end of a union is not necessary to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the other doc, I think I've suggested two use cases when I saw this:

  1. Avoid asymmetric behavior that creates bitfield ordering surprises for users.
  2. Sequential bitfield unions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In an earlier comment, I suggested another one: packing a discriminator into the trailing bits of the union.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't describe 1 as a "use case" (i.e. something a user might want to do), and as noted on your other comment, the surprises in question don't seem very serious to me. As for 2, I don't find it very compelling; it seems like a very rare corner case (note that C and C++ don't support it, and I'm not aware of any complaints about that), and there's a viable if somewhat annoying workaround (nest the second union inside the first).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for packing into the trailing bits of the union, is there a reason the user couldn't pack the discriminator into the leading bits instead?

Copy link
Contributor

@zygoloid zygoloid Aug 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering the example I gave before, with a PackedPtr, where we'd be reusing the low-order (alignment) bits: The high order bits aren't necessarily available in general, so one ordering of the union versus the discriminator will use at most only a mask (and if the value of the discriminator is known, as it typically will be, can directly use the pointer value or ptr-1), whereas the other ordering will use bit-shifts. Which ordering is best will depend on endianness -- for little-endian, we do want to use the leading bits not the trailing ones, but it seems like that is us getting lucky rather than things working out by design.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree 1 isn't something a user might intend to do, but it is something they might accidentally do from misunderstanding how byte packing of unions works (or just clumsy reordering of struct members). This gets to the thread on the design:

@geoffromer:

Dangerous how? As far as I can tell, if users mistakenly think that packing can happen at the end of a bitfield, the worst case scenario is that their objects are somewhat larger than they expected.

@gribozavr:

That can be pretty bad for the types of software Carbon wants to target.

To emphasize on Dmitri's point, the issue is that it would be larger than the developer expects, which would be a bug. That could mean a difficult to detect, difficult to debug issue. In goal terms, I believe this is problematic under "Adhere to the principle of least surprise," an understandability issue for a performance feature.

Regarding 2, and in general, I'm fine if you don't consider issues compelling. Please try to address them in the alternatives regardless. Simply saying "it's not clear that there are any compelling use cases for it" without offering any discussion of cost-benefit or pros-cons feels more dismissive of concerns, at least to me.

Comment on lines +160 to +161
allowing packing at the end of a union would make it more difficult to tell when
accesses to two separate fields can cause a data race.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You already do this at the start of the boundary -- why is it measurably worse at the end of the boundary?

Maybe it'd be better to allow field groups at the struct level, to determine when layout packing applies, making it explicit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You already do this at the start of the boundary -- why is it measurably worse at the end of the boundary?

Because the end of the boundary might also be the start of another boundary, which multiplies the number of fields that can conflict with each other.

Maybe it'd be better to allow field groups at the struct level, to determine when layout packing applies, making it explicit?

It sounds like you're suggesting an approach where adjacent fields would not be packed together unless they're in the same field group, but that would defeat the whole purpose of introducing field groups in the first place; for example it would mean that is_small is not packed together with small.size or large.size.

The more fundamental problem is that we're trying to address cases where the packing boundaries cut across the logical field structure boundaries. For example, for packing purposes small.size groups with is_small but not with small.buffer, but for purposes of which fields are live at the same time, small.size groups with small.buffer but not with is_small. That means that at most one of those sets of boundaries can line up with open/close curly braces. C/C++ chose to make curly braces represent packing boundaries, but that means sometimes you have to put a field inside a union even though it's supposed to always be available. I'm proposing that Carbon should make curly braces represent logical field structure, but that means we can't use curly braces to define packing boundaries.

apply `create` to a union that already has an active member, or apply `destroy`
to a union that does not have an active member. `destroy` can be thought of as a
unary operator, but a `create` statement has the syntax and semantics of a
variable declaration, with `create` taking the place of `var`, and the field
Copy link
Contributor

@zygoloid zygoloid Aug 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
variable declaration, with `create` taking the place of `var`, and the field
variable declaration, with `create` taking the place of `var T`, where `T` is the type of the field, and the field

Copy link
Contributor

@zygoloid zygoloid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, this design seems very promising to me. Treating union and group as not-exactly-type containers has some appeal, and I like that this lets us form efficient structure layouts that C++ makes painful or impossible to achieve.

Tying the possibility of a subobject having a different lifetime from its container to whether the subobject is a union member seems quite appealing. It'll be interesting to see how this plays out with (for example) vector, where we might not be able to expose raw pointers to an array of elements, because the elements are actually a union { var T: value; }. But I think this is a good base to build from.

Comment on lines 156 to 157
have the same types, names, and order, and their names are part of the names of
their fields:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
have the same types, names, and order, and their names are part of the names of
their fields:
have the same types, names, and order. The name of a field group is part of the name of
each field in the group:

docs/design/unions.md Show resolved Hide resolved
Assert(str.small.size == 0);
destroy str.small;
create: str.large = (.size = 0, .capacity = 100, .buffer = MakeBuffer(100));
Assert(str.large.capacity == 100);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is str.large here? Is it possible to refer to that without a following .field? Does that expression have a type? Can I pass it to a function?

Not supposed to be leading questions -- I think it could possible be OK for the answer to be that you must always write .something after str.large, so it's really just as if large. is a prefix of the member names. But that would potentially lead to some non-uniformity. I think it could also possibly be OK for the answer to be that you get some unspecified anonymous type -- but that would somewhat contract the statement below that field groups don't have types.

Comment on lines 169 to 173
However, field groups are not objects, and do not have types; like unions, field
groups are a way of controlling the layout of the fields of a struct. For
example, if `large` and `small` were structs, the bitfields would not save any
space, and `is_small` would have to be followed by 63 bytes of padding in order
to ensure proper alignment of `large`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One possible path through the "what is the type of str.large?" question would be to generalize what we mean by "type" to include types whose representation spans some set of bits instead of some set of bytes, and possibly a non-contiguous set of bits -- or at least, a sequence of bits not starting at the first bit in their alignment unit and not ending at the last bit in such a unit.

Comment on lines 182 to 184
field group containing a single field. The ending offset of a union is the
maximum ending offset of any field in any of its field groups, rounded up to the
next whole byte.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forcing a byte boundary to the end of a union means we can't do

union {
  bitfield(63) var Int63: x;
  bitfield(63) var PackedPtr(Foo): y;
}
bitfield(1) var Int1: active;

Do we need to pad to the end of the byte here? What do we get in exchange?

Comment on lines 194 to 195
destruction of an active union member as also destroying any immediately
preceding bitfields, and then recreating them with the same contents. This
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what this is acknowledging is that any store to a bitfield may actually also write to adjacent bitfield members, because the hardware doesn't support bit-level access, and that can result in a data race.

So in some sense yes, any store to a union is more expensive than a regular store, because we will first load the entire storage unit, including any unchanged members that are nearby, then change the bitfield in question with some bitwise operations, then store back to the entire storage unit. But I think the cost is the same for a destroy+create as it would be for a regular store.

docs/design/unions.md Outdated Show resolved Hide resolved
nested subobjects), and such pointers are indistinguishable from pointers to any
other object. Reliably sanitizing accesses via such pointers would require
dynamically tracking which union member (if any) each pointer points to,
propagating that information to subobject accesses, and instrumenting every
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps. We'd need to be able to represent pointers in such a way that we can invalidate all pointers that point to a union member when it becomes inactive. We could use pointer tagging for this, at the expense of not having the tag bits available for other purposes -- and that still leaves us instrumenting every access, but perhaps in a way we were going to anyway. Or we could track the set of pointers that point to the union member, which would likely require instrumenting at least every pointer assignment that we can't statically prove doesn't point to a union member.

I think a better baseline behavior would be to say that members (union or otherwise) cannot have their address taken by default. This is a desirable property for other reasons too. That still leaves open this problem for the case where unions are explicitly made addressable, but that's at least only a (hopefully small) subset of all union members.

Comment on lines 335 to 337
**TODO:** It looks very difficult to support exposing C++ unions to Carbon in an
automated way, unless we are willing to allow type-punning via ordinary pointer
reads, and allow assignment through a pointer to implicitly destroy and create
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't entirely follow this. If a C++ union is exposed to Carbon semantically as a Carbon struct { union { ... } } (with the proper layout enforced by the interoperability layer), the Carbon semantics would presumably apply to it in Carbon code, and the C++ semantics would presumably apply in C++ code -- Carbon wouldn't be able to change the active member by assignment but C++ would.

Is the concern here about migration rather than interoperability?

Comment on lines +157 to +158
at the end of a union, but it's not clear that there are any compelling use
cases for it. In particular, packing at the end of a union is not necessary to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In an earlier comment, I suggested another one: packing a discriminator into the trailing bits of the union.

Copy link
Contributor Author

@geoffromer geoffromer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I haven't finished responding to comments)

docs/design/unions.md Show resolved Hide resolved
nested subobjects), and such pointers are indistinguishable from pointers to any
other object. Reliably sanitizing accesses via such pointers would require
dynamically tracking which union member (if any) each pointer points to,
propagating that information to subobject accesses, and instrumenting every
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Clarification: there was a race between @zygoloid's comment and mine; I was responding to the original comment)

Comment on lines +157 to +158
at the end of a union, but it's not clear that there are any compelling use
cases for it. In particular, packing at the end of a union is not necessary to
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for packing into the trailing bits of the union, is there a reason the user couldn't pack the discriminator into the leading bits instead?

}
```

A union consists of zero or more _members_, at most one of which can be live at
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bikeshed, but since you are already using the term "field" when referring to things contained in a struct, why not use the same term here for unions as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to clarify why the two aren't synonymous.

docs/design/unions.md Outdated Show resolved Hide resolved
accessed during pattern matching, and has no effect on whether the pattern
matches. If the pattern mentions a union member, the corresponding subpattern is
matched against that member, which means that user code must ensure that pattern
matching does not reach that point unless that member is active. Note that this
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like there's a consequence that we would have to document exactly how pattern matching reads from memory and finds the best match, and that would become a part of language semantics. I personally don't think we want to do that given the optimization potential. I'd rather say that union members can't participate in pattern matching; given how subtle the code handling unions is, the code should spell out the exact memory access order for readability and making the intent explicit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have to document it exactly, we just have to constrain it enough to enable some basic pattern matching usages. In particular, we have to constrain it enough that code for a user-defined type can ensure that pattern matching applies to the discriminator before it applies to the union members.

I'm extremely reluctant to say that unions can't participate in pattern matching, because I think it will actually make pattern matching a much more complicated feature. If unions can't participate in pattern matching, then we have to invent some way for pattern matching on a user-defined type to delegate to non-pattern-matching-based code (i.e. a pretty heavyweight pattern matching customization mechanism), or say that user-defined types that support pattern matching can't be implemented in terms of unions. On the other hand, if we allow pattern matching on unions, we can allow user-defined types to support pattern matching by delegating to pattern matching on their implementation details.


## Overview

Fields of a struct can be grouped into _unions_. For example:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can fields within a union have non-trivial copy, assignment, or destruction semantics?

Does a struct with a union get a compiler-generated copy, assign, or destroy operation? If yes, what does it do with the fields inside the union?

(Sorry, I can't find this point discussed anywhere in the proposal, so I'm commenting in an arbitrary place.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A union member can have any type, but structs with unions won't get compiler-generated operations. Updated the text to make those points explicit.


```
struct SsoString {
bitfield(1) var Bool: is_small;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fully agreed with @josh11b. I feel like our safe algebraic data type facility should allow the type of micro-management of storage layout necessary for SsoString presented here. Of course, there will be cases where the discriminator is not accessible easily, but I don't think that's the common case, and that type of code is highly unsafe anyway, so there's little the language can do to help.

space, and `is_small` would have to be followed by 63 bits of padding in order
to ensure proper alignment of `large`.

## Layout
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative that I'd like us to consider is to by default let the compiler choose the layout, but allow the user to specify the layout exactly. For example, the parts of a single bitfield should be tagged with the name of the bitfield to demand that they are packed together, members that should be laid out at specific offsets should be annotated as such, any padding must be explicitly inserted (or the following field must be annotated as unaligned) etc.

We could demand that code must always explicitly specify the layout in unions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think explicit layout control specifically needs to be part of this proposal, rather than a separate one?

We could demand that code must always explicitly specify the layout in unions.

Sure, but what problem would that solve?


Conversely, C++ is more restrictive in one respect: the members of a C++ union
must be objects, and the union's alignment must conform to the alignment
requirements of its member objects. For example, there doesn't seem to be any
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced it is important in practice to expose Carbon unions to C++. Most unions of this kind in C++ are typically very well encapsulated (like SsoString above), and expose an API that does not mention unions; so I expect the same to be true in Carbon as well. My suggestion is to represent Carbon unions as opaque byte blobs in C++ if the type does happen to de exported to C++.

Furthermore, since there's a semantic gap (like the fact that Carbon requires using special create and destroy operators to change the active member), I'm not sure that C++ union is really a good match, even if we tried hard to make it work. I think a more fruitful direction would be to expose Carbon unions to C++ as classes with no public data members, and only member functions to manipulate the data. The contract of those member functions would match Carbon's language semantics exactly.

prevents the toolchain from detecting and diagnosing bugs, and may foreclose
useful optimization opportunities.

These restrictions also substantially simplifies the design of the language: we
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
These restrictions also substantially simplifies the design of the language: we
These restrictions also substantially simplify the design of the language: we

substantively, there are two cases in which union names seem like they might be
useful. However, in both cases the actual utility seems limited.

First, it's tempting to think of an inactive union as equivalent to an
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
First, it's tempting to think of an inactive union as equivalent to an
First, it's tempting to think of an union without an active member as equivalent to an

Comment on lines 129 to 131
expectation that an integer variable holds a meaningful value. There is no
corresponding default expectation that unions will always have an active member,
so there is no need for special syntax to opt out of such a default.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IDK about that. Whether someone has such expectations depends on a person and their background. I, for example, was surprised that unions allow having no active member. In my opinion, it should be at least an opt-in feature (since not every union will need it).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised this passage to hopefully make the argument clearer.

Comment on lines 122 to 124
However, that's a false analogy: the defining characteristic of an uninitialized
or indeterminate value is that we know nothing about its state, whereas we know
everything there is to know about the state of an inactive union. In particular,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me it also seems like a distinction without a difference.

To put it another way, the knowledge that a variable is uninitialized never helps you deduce that a program is correct, or deduce anything about the behavior of a correct program.

Certainly it does. A storage location should be uninitialized before the allocation it is a part of is deallocated, and before create, regardless of whether it is a part of a union or not.


### `create` and `destroy` are union-only

In this proposal, `create` and `destroy` can only be applied to union members,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are use cases for leaving local variables and struct fields uninitialized:

var Int32: x = uninitialized;
if (...) {
  create x = 42;
} else {
  create x = 0;
}

Copy link
Contributor Author

@geoffromer geoffromer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pushed out some fixes I had in progress, but before I respond to the remaining comments I'm going to put together an overview proposal for the design direction this is part of, in order to address some of the feedback about how this fits together with discriminated unions.


## Overview

Fields of a struct can be grouped into _unions_. For example:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A union member can have any type, but structs with unions won't get compiler-generated operations. Updated the text to make those points explicit.

}
```

A union consists of zero or more _members_, at most one of which can be live at
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to clarify why the two aren't synonymous.

accessed during pattern matching, and has no effect on whether the pattern
matches. If the pattern mentions a union member, the corresponding subpattern is
matched against that member, which means that user code must ensure that pattern
matching does not reach that point unless that member is active. Note that this
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have to document it exactly, we just have to constrain it enough to enable some basic pattern matching usages. In particular, we have to constrain it enough that code for a user-defined type can ensure that pattern matching applies to the discriminator before it applies to the union members.

I'm extremely reluctant to say that unions can't participate in pattern matching, because I think it will actually make pattern matching a much more complicated feature. If unions can't participate in pattern matching, then we have to invent some way for pattern matching on a user-defined type to delegate to non-pattern-matching-based code (i.e. a pretty heavyweight pattern matching customization mechanism), or say that user-defined types that support pattern matching can't be implemented in terms of unions. On the other hand, if we allow pattern matching on unions, we can allow user-defined types to support pattern matching by delegating to pattern matching on their implementation details.

space, and `is_small` would have to be followed by 63 bits of padding in order
to ensure proper alignment of `large`.

## Layout
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think explicit layout control specifically needs to be part of this proposal, rather than a separate one?

We could demand that code must always explicitly specify the layout in unions.

Sure, but what problem would that solve?

Comment on lines 122 to 124
However, that's a false analogy: the defining characteristic of an uninitialized
or indeterminate value is that we know nothing about its state, whereas we know
everything there is to know about the state of an inactive union. In particular,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A storage location should be uninitialized before the allocation it is a part of is deallocated, and before create, regardless of whether it is a part of a union or not.

I agree, but I don't understand how that relates to what I said.


### No type punning

This proposal does not permit accessing an inactive union member. This is unlike
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better?

Comment on lines 129 to 131
expectation that an integer variable holds a meaningful value. There is no
corresponding default expectation that unions will always have an active member,
so there is no need for special syntax to opt out of such a default.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised this passage to hopefully make the argument clearer.


### `create` and `destroy` are union-only

In this proposal, `create` and `destroy` can only be applied to union members,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sort of gets back to the discussion of uninit vs. a union with no active member: I've been assuming that = uninit is like default-initialization in C++: it creates the object, but leaves it in a partially-formed state (i.e. a state that can only be assigned to or destroyed). Consequently, it has to be an error to apply create to that object, because you can't create an object that already exists.

I agree that there are important use cases for manually controlling the lifetime of a local variable, but I think you should have to opt into that as part of the variable's declaration, e.g.

struct ManualLifetimeInt32 {
  union {
    Int32 object;
  }
}
...
var ManualLifetimeInt32: x;
if (...) {
  create: x.object = 42;
} else {
  create: x.object = 0;
}
...
destroy x.object;

ManualLifetimeInt32 could of course be a template, and might have a nicer API. One variant to explore would be a type where the underlying object starts out live, and is destroyed by the wrapper's destructor, but can be destroyed and recreated in the middle; that seems to be the more common pattern in C++, possibly because C++ makes it easier to write.

Regarding the proposed names, combined with reserved keywords it means that people can't write create or destroy identifiers.

I expect that to be the case regardless, since operator create and operator destroy are planned to be how we spell constructors and destructors.

@geoffromer geoffromer added WIP and removed proposal rfc Proposal with request-for-comment sent out labels Sep 11, 2020
@jonmeow jonmeow marked this pull request as draft April 20, 2021 16:19
@jonmeow jonmeow removed the WIP label Apr 20, 2021
@github-actions
Copy link

We triage inactive PRs and issues in order to make it easier to find active work. If this PR should remain active, please comment or remove the inactive label.
This PR is labeled inactive because the last activity was over 90 days ago. This PR will be closed and archived after 14 additional days without activity.

@github-actions github-actions bot added the inactive Issues and PRs which have been inactive for at least 90 days. label Jul 31, 2021
@github-actions
Copy link

We triage inactive PRs and issues in order to make it easier to find active work. If this PR should remain active or becomes active again, please reopen it.
This PR was closed and archived because there has been no new activity in the 14 days since the inactive label was added.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes PR meets CLA requirements according to bot. inactive Issues and PRs which have been inactive for at least 90 days. proposal deferred Decision made, proposal deferred proposal A proposal
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants