-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial design for unions #139
Changes from all commits
62fbdc3
dfbbd28
1cdb79e
092f159
51884a0
96e53db
a7fe9fc
8fcd122
54f3653
7a11a60
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,353 @@ | ||
# Unions | ||
|
||
<!-- | ||
Part of the Carbon Language project, under the Apache License v2.0 with LLVM | ||
Exceptions. See /LICENSE for license information. | ||
SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception | ||
--> | ||
|
||
## Table of contents | ||
|
||
<!-- toc --> | ||
|
||
- [Overview](#overview) | ||
- [Union members](#union-members) | ||
- [Changing the live member](#changing-the-live-member) | ||
- [Field groups](#field-groups) | ||
- [Layout](#layout) | ||
- [Safety](#safety) | ||
- [C++ interoperability and migration](#c-interoperability-and-migration) | ||
|
||
<!-- tocstop --> | ||
|
||
## Overview | ||
|
||
Fields of a struct can be grouped into _unions_. For example: | ||
|
||
``` | ||
struct Number { | ||
union { | ||
var Int64: int_value; | ||
var Float64: float_value; | ||
} | ||
|
||
// 0 if no live member, 1 if int_value is live, 2 if double_value is live. | ||
var Int2: discriminator; | ||
} | ||
``` | ||
|
||
A union consists of zero or more _members_, at most one of which can be live at | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A bikeshed, but since you are already using the term "field" when referring to things contained in a struct, why not use the same term here for unions as well? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Updated to clarify why the two aren't synonymous. |
||
a time. A member can be either a field of any type, or a | ||
[field group](#field-groups). All members of a union share the same storage, so | ||
`Number` as a whole is 9 bytes, not 17. It is out of contract to access a member | ||
that is not live, even if it has the same type as the live member. There is no | ||
intrinsic way to determine which member of the union is live, and in fact that | ||
information will probably not be tracked in production builds. Instead, user | ||
code is responsible for doing whatever bookkeeping is necessary to ensure that | ||
only the live member is accessed. This typically takes the form of an associated | ||
discriminator field, as in `Number`, but this is not required. For example, some | ||
user code may be able to satisfy that requirement statically, with no run-time | ||
tracking at all. | ||
|
||
> **TODO:** If Carbon supports auto-generating struct operations such as | ||
> copying, assignment, or destruction, we will need to specify that it does not | ||
> support structs that contain unions, since the generated code won't know which | ||
> field is live. | ||
|
||
> **TODO:** Carbon should also permit users to easily define "algebraic data | ||
> types", which are type-safe union-like constructs with compiler-provided | ||
> discriminators. It will also need a separate facility for reinterpreting the | ||
> representation of one type to the representation of a different type (also | ||
> known as "type punning"). When those are designed, cite them above. | ||
|
||
## Union members | ||
|
||
Unlike C++ unions, Carbon unions never have names, are not objects, and do not | ||
have types; they are a means of controlling the layout of fields in a struct. | ||
Consequently, it is not possible to form a pointer to a union. You can form a | ||
pointer to a member field of a union, but you cannot cast it to a pointer to the | ||
type of a different member. | ||
|
||
Union fields are referenced and initialized exactly like ordinary members of the | ||
enclosing struct. However, union fields can be omitted from the initializer, | ||
jonmeow marked this conversation as resolved.
Show resolved
Hide resolved
|
||
unlike ordinary fields. Obviously the initializer cannot initialize more than | ||
one member of a union, but it can initialize either one member or none. If no | ||
member of the union is initialized, no member of the union is initially live: | ||
|
||
``` | ||
// Neither n1.int_value nor n1.float_value are live | ||
Number: n1 = ( .discriminator = 0 ); | ||
|
||
// n2.int_value is live and holds 0 | ||
Number: n2 = ( .int_value = 0, .discriminator = 1 ); | ||
|
||
// n3.int_value is live, but not initialized | ||
Number: n3 = ( .int_value = uninit, .discriminator = 1 ); | ||
|
||
// Error: cannot initialize multiple members of a single union | ||
Number: n4 = ( .int_value = 0, .float_value = uninit, .discriminator = 1 ); | ||
``` | ||
|
||
Unions follow similar rules in pattern matching: a pattern can mention either | ||
jonmeow marked this conversation as resolved.
Show resolved
Hide resolved
|
||
zero or one member of a given union. If no member is mentioned, the union is not | ||
accessed during pattern matching, and has no effect on whether the pattern | ||
matches. If the pattern mentions a union member, the corresponding subpattern is | ||
matched against that member, which means that user code must ensure that pattern | ||
matching does not reach that point unless that member is live. Note that this | ||
applies even if the subpattern is a wildcard that is guaranteed to match. | ||
|
||
> **TODO:** Ensure the above remains consistent with the overall design for | ||
> struct initialization and pattern matching, and ensure pattern matching gives | ||
> enough control over evaluation order to make it possible to safely mention | ||
> union members in patterns. | ||
|
||
## Changing the live member | ||
|
||
The live member can only be changed by destroying the current live member (if | ||
any), and then constructing the new live member, using the `destroy` and | ||
`create` keywords: | ||
|
||
``` | ||
fn SetFloatValue(Ptr(Number): n, Float64: value) { | ||
if (n->discriminator == 0) { | ||
destroy n->int_value; | ||
create: n->float_value = value; | ||
n->discriminator = 1; | ||
} else { | ||
n->float_value = value; | ||
} | ||
} | ||
``` | ||
|
||
`create` and `destroy` can only be applied to union members (unlike their C++ | ||
counterparts, placement `new` and pseudo-destructor calls); the lifetimes of | ||
ordinary struct fields and variables are always tied to their scope. It is out | ||
of contract to apply `create` to a member of a union that already has a live | ||
member, or apply `destroy` to a member that is not live. `destroy` can be | ||
thought of as a unary operator, but a `create` statement has the syntax and | ||
semantics of a variable declaration, with `create` taking the place of `var` and | ||
the field type, and the field expression taking the place of the variable name. | ||
`destroy` permanently invalidates any pointers to the destroyed object; they do | ||
not become valid again if that member is re-created. | ||
|
||
> **FIXME**: Does the `create` syntax make it sufficiently clear that the `=` | ||
> represents initialization, not assignment? Is it OK that the `create` syntax | ||
> omits the type? Can we do better? | ||
|
||
> **TODO:** The spelling of `create` and `destroy` are chosen for consistency | ||
> with `operator create` and `operator destroy`, which are how constructor and | ||
> destructor declarations are spelled in the currently-pending structs proposal. | ||
> They should be updated as necessary to stay consistent. | ||
|
||
## Field groups | ||
|
||
A union member can be either a field or a _group_ of fields: | ||
|
||
``` | ||
struct SsoString { | ||
bitfield(1) var Bool: is_small; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seems like we should be more actively considering ways of expressing this in a way where the language is aware of the relationship between There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The problem is that there's no guarantee that the discriminator is in the same struct as the union, or is even reachable from the union. There are even quite plausible cases where there's no discriminator at all, and safety is achieved entirely through static reasoning. We could consider something like that as an optional add-on, rather than something that's inherent in all unions, but because it's an add-on, I don't think it needs to be part of the initial unions design. Instead, I think we should consider it as part of the overall safety story, because until then it won't be clear how much benefit it actually provides. I've just added a proposal doc to this PR, which includes a rationale for why I want to proceed with this proposal even though the safety plan are unresolved. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree that there is no guarantee that you won't need to do something unsafe, but I don't think that is the common case we need to make easy, especially since it is the case where the language can give you little support. Rust is an example of a language aiming for high performance but provides a safe primitive here. If you really need something outside what the language can reason about, then use something like a byte array and casts. I think the main thing as a team we should be doing at this stage is deciding between high level approaches. So even if we ultimately decide to go with the approach currently presented in this document, the job of this document should be to present and discuss the alternatives under consideration, and presenting rationale for one choice over others. I very much feel this document needs to at least address the safer design alternative. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It looks like you added a blurb in the proposal text about discriminated unions -- I maintain that it requires more serious consideration since I think some of the concerns raised there are quite addressable. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fully agreed with @josh11b. I feel like our safe algebraic data type facility should allow the type of micro-management of storage layout necessary for |
||
|
||
union { | ||
group small { | ||
zygoloid marked this conversation as resolved.
Show resolved
Hide resolved
|
||
bitfield(7) var Int7: size; | ||
var FixedArray(Char, 22): buffer; | ||
} | ||
group large { | ||
bitfield(63) var Int63: size; | ||
var Int64: capacity; | ||
var UniquePtr(Array(Char)): buffer; | ||
} | ||
} | ||
} | ||
``` | ||
|
||
> **TODO:** The treatment of bitfields, arrays, and owning pointers in this | ||
> example is speculative, and should be updated to reflect the eventual design | ||
> of those features. | ||
|
||
Field groups are sets of fields (and/or unions) that can be created and | ||
destroyed as a unit. They are initialized from anonymous structs whose fields | ||
have the same types, names, and order. The name of a field group is part of the | ||
name of each field in the group: | ||
|
||
``` | ||
var SsoString: str = (.is_small = True, | ||
.small = (.size = 0, .buffer = uninitialized) | ||
); | ||
Assert(str.small.size == 0); | ||
destroy str.small; | ||
create: str.large = (.size = 0, .capacity = 100, .buffer = MakeBuffer(100)); | ||
Assert(str.large.capacity == 100); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What is Not supposed to be leading questions -- I think it could possible be OK for the answer to be that you must always write |
||
``` | ||
|
||
However, field groups are not objects, and do not have types; like unions, field | ||
groups are a way of controlling the layout of the fields of a struct. For | ||
example, if `large` and `small` were structs, the bitfields would not save any | ||
space, and `is_small` would have to be followed by 63 bits of padding in order | ||
to ensure proper alignment of `large`. | ||
|
||
## Layout | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. An alternative that I'd like us to consider is to by default let the compiler choose the layout, but allow the user to specify the layout exactly. For example, the parts of a single bitfield should be tagged with the name of the bitfield to demand that they are packed together, members that should be laid out at specific offsets should be annotated as such, any padding must be explicitly inserted (or the following field must be annotated as unaligned) etc. We could demand that code must always explicitly specify the layout in unions. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you think explicit layout control specifically needs to be part of this proposal, rather than a separate one?
Sure, but what problem would that solve? |
||
|
||
We express the layout of a field in terms of its starting and ending offsets, | ||
which are measured in bits in order to handle bitfields. The starting and ending | ||
offsets of each field in a field group are the same as if the fields were all | ||
direct members of the enclosing struct. A union member that is not a field group | ||
is laid out as if it were a field group containing a single field. The ending | ||
offset of a union is the maximum ending offset of any field in any of its field | ||
groups, rounded up to the next whole byte. Note that this means that bitfields | ||
after the union are not "packed" together with any bitfields at the end of the | ||
union. | ||
|
||
> **TODO:** This presupposes that struct fields are laid out sequentially; if | ||
> that is not the case, this algorithm will need to be revised accordingly. | ||
|
||
The C/C++ memory model (which we expect Carbon to adopt) treats any maximal | ||
contiguous sequence of bitfields as a single memory location, but C/C++ do not | ||
allow bitfields to be split across the beginning of a union. Introducing that | ||
ability in Carbon implies that the size of a memory location can change, which | ||
the memory model doesn't seem to countenance. In order to fix that | ||
inconsistency, we model the creation or destruction of union member as also | ||
destroying any immediately preceding bitfields, and then recreating them with | ||
the same contents. This implies that you cannot access the preceding bitfields | ||
concurrently with creating or destroying a union member. | ||
|
||
A union cannot be nested directly within a union, and a field group cannot be | ||
nested directly within a struct or a field group. Unions and field groups cannot | ||
contain constructors, destructors, or methods. | ||
|
||
## Safety | ||
|
||
The safety rules for Carbon unions are easily summarized: it is always out of | ||
contract to access or destroy a union member that is not live, and the | ||
operations that can change the live member are always explicit and unambiguous. | ||
It should be quite straightforward for a sanitizer to check direct accesses to | ||
union members, by tracking the live member in shadow memory, and verifying that | ||
the member being accessed is live. | ||
|
||
However, union members can also be accessed through pointers (including pointers | ||
to nested subobjects), and such pointers are indistinguishable from pointers to | ||
any other object. Reliably sanitizing accesses through such pointers would | ||
require dynamically tracking which union member (if any) each pointer points to, | ||
propagating that information to subobject accesses, and instrumenting every | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could taking a pointer require that the named target of the pointer be active, and then invalidate the pointer if the active member is destroyed? i.e., without instrumenting every pointer access. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, that seems like a better model, so I've made that change. However, I don't think that removes the need to instrument every pointer access. If anything, it suggests a need for more instrumentation: now we have to not only check that this pointer points to the active member, we also have to check that the active member wasn't destroyed and then recreated since the pointer was formed. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Perhaps. We'd need to be able to represent pointers in such a way that we can invalidate all pointers that point to a union member when it becomes inactive. We could use pointer tagging for this, at the expense of not having the tag bits available for other purposes -- and that still leaves us instrumenting every access, but perhaps in a way we were going to anyway. Or we could track the set of pointers that point to the union member, which would likely require instrumenting at least every pointer assignment that we can't statically prove doesn't point to a union member. I think a better baseline behavior would be to say that members (union or otherwise) cannot have their address taken by default. This is a desirable property for other reasons too. That still leaves open this problem for the case where unions are explicitly made addressable, but that's at least only a (hopefully small) subset of all union members. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (Clarification: there was a race between @zygoloid's comment and mine; I was responding to the original comment) |
||
pointer access in the program to determine whether it is accessing a union, and | ||
if so whether the currently live member matches the one the pointer value was | ||
created with. This would definitely be too costly for a hardened production | ||
build mode, and might even be too costly (relative to its benefit) to be useful | ||
as a sanitizer. | ||
|
||
> **TODO:** Figure out whether and how to sanitize invalid union accesses, | ||
> probably in the context of an overall design for temporal memory safety. | ||
|
||
## C++ interoperability and migration | ||
|
||
Carbon unions are more restrictive than C++ unions in several respects: | ||
|
||
- C++ unions can be types, and can have names. | ||
- C++ permits the active member to be implicitly changed by assigning to an | ||
inactive member. | ||
- C++ permits accessing fields of an inactive member if they are part of a | ||
"common initial sequence" of fields that's shared with the active member. | ||
- In practice, C++ permits accessing inactive members even in ways that | ||
violate the "common initial sequence" rule, with the semantics that the | ||
object representation of the active member is reinterpreted as a | ||
representation of the inactive member. This is formally undefined behavior, | ||
but broadly supported and fairly common in practice. | ||
|
||
Conversely, C++ is more restrictive in one respect: the members of a C++ union | ||
must be objects, and the union's alignment must conform to the alignment | ||
requirements of its member objects. For example, there doesn't seem to be any | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not convinced it is important in practice to expose Carbon unions to C++. Most unions of this kind in C++ are typically very well encapsulated (like Furthermore, since there's a semantic gap (like the fact that Carbon requires using special |
||
way of expressing this Carbon struct in terms of C++ unions while preserving | ||
both its structure and its layout: | ||
|
||
``` | ||
struct S { | ||
union { | ||
group g1 { | ||
Int32: a; | ||
Int16: b; | ||
Int8: c; | ||
} | ||
group g2 {} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should empty groups like this be disallowed? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think so; disallowing them could make it harder to reduce test cases, search for bugs by commenting out code, etc, and wouldn't provide any benefit that I can see. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wonder here... The benefit would be that you'd prevent people from using active, inoperative code. Allowing this could be a readability issue, especially if members in the middle are commented out in a way that obscures a group actually being empty. Regarding comments, they could just block-comment including the group. I'm not clear how test cases relate. To me, this is similar to not allowing groups nested within groups: if that's disallowed because it doesn't offer structural benefits to outweigh costs, I don't see how an empty group is better. |
||
} | ||
union { | ||
group g3 { | ||
Int8: d; | ||
Int32: e; | ||
} | ||
group g4 {} | ||
} | ||
} | ||
``` | ||
|
||
If that structure were naively translated to C++, `g3` would be required to to | ||
have 4-byte alignment, which would force the addition of a byte of padding | ||
between the unions. That would in turn force the addition of 3 bytes of padding | ||
between `d` and `e`, in order to properly align `e`. | ||
|
||
We can preserve the layout by expanding all groups so that they contain all | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Noting that bitfield alignment is deliberately different with Carbon, should there be a method for aligning bitfields consistently? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you be more specific about what you mean by "a method for aligning bitfields consistently"? If you mean "a way to make a C++ struct match the alignment of a corresponding Carbon struct", that's exactly what this passage is trying to describe. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This section doesn't mention The issue I see with |
||
prior members of the struct, and hence all start at offset 0: | ||
|
||
```c++ | ||
struct S { | ||
union { | ||
struct /* g3 */ { | ||
union { | ||
struct { | ||
int32_t a; | ||
int16_t b; | ||
int8_t c; | ||
} g1; | ||
struct {} g2; | ||
}; | ||
int8_t d; | ||
int32_t e; | ||
} g3; | ||
struct /* g4 */ { | ||
union { | ||
struct { | ||
int32_t a; | ||
int16_t b; | ||
int8_t c; | ||
} g1; | ||
struct {} g2; | ||
}; | ||
} g4; | ||
}; | ||
}; | ||
``` | ||
|
||
However, this doesn't preserve the naming structure of the fields: `s.g1.a` in | ||
Carbon would become either `s.g3.g1.a` or `s.g4.g1.a` in C++ (the two are | ||
equivalent because they are part of a common initial sequence), and every | ||
additional union in the struct compounds this problem. Rather than expose this | ||
complexity to users, the members that form the actual data layout will be made | ||
private, and the original members will be exposed through methods that return | ||
references, so that `s.g1.a` becomes `s.g1().a()`. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Strictly speaking this reasoning doesn't apply to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Couldn't it return a struct for |
||
|
||
> **TODO:** With this scheme, all data members up to and including the last | ||
> union in a struct must be exposed through methods. The structs design will | ||
> need to determine whether any subsequent members are exposed in C++ as data | ||
> members, methods or both. | ||
|
||
Note that this mapping doesn't quite preserve concurrency semantics: Carbon code | ||
can safely access the first union while concurrently creating or destroying a | ||
member of the second union, but in C++ the corresponding operation would be | ||
undefined behavior. | ||
|
||
> **FIXME:** Does this matter in practice, and can we do anything to avoid the | ||
> undefined behavior? | ||
|
||
> **TODO:** The design of the Carbon memory model will need to address this | ||
> inconsistency. While we generally intend to adopt the C/C++ memory model, it's | ||
> unclear exactly what that means in cases like this one, where Carbon code | ||
> creates situations that are inexpressible in C/C++. Note that it's not | ||
> entirely clear whether the undefined behavior in question is a data race per | ||
> se, because it's not clear whether invoking a trivial constructor or | ||
> destructor actually modifies the memory locations containing the object. | ||
|
||
> **TODO:** It looks very difficult to support exposing C++ unions to Carbon in | ||
> an automated way, unless we are willing to allow type-punning through ordinary | ||
> pointer reads, and allow assignment through a pointer to implicitly destroy | ||
> and create objects. It may be possible to support partial or full automation | ||
> in cases where the union is sufficiently encapsulated, but this will require | ||
> further research about what encapsulation patterns are common. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can fields within a union have non-trivial copy, assignment, or destruction semantics?
Does a struct with a union get a compiler-generated copy, assign, or destroy operation? If yes, what does it do with the fields inside the union?
(Sorry, I can't find this point discussed anywhere in the proposal, so I'm commenting in an arbitrary place.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A union member can have any type, but structs with unions won't get compiler-generated operations. Updated the text to make those points explicit.