Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial design for unions #139

Closed
wants to merge 10 commits into from
353 changes: 353 additions & 0 deletions docs/design/unions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,353 @@
# Unions

<!--
Part of the Carbon Language project, under the Apache License v2.0 with LLVM
Exceptions. See /LICENSE for license information.
SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-->

## Table of contents

<!-- toc -->

- [Overview](#overview)
- [Union members](#union-members)
- [Changing the live member](#changing-the-live-member)
- [Field groups](#field-groups)
- [Layout](#layout)
- [Safety](#safety)
- [C++ interoperability and migration](#c-interoperability-and-migration)

<!-- tocstop -->

## Overview

Fields of a struct can be grouped into _unions_. For example:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can fields within a union have non-trivial copy, assignment, or destruction semantics?

Does a struct with a union get a compiler-generated copy, assign, or destroy operation? If yes, what does it do with the fields inside the union?

(Sorry, I can't find this point discussed anywhere in the proposal, so I'm commenting in an arbitrary place.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A union member can have any type, but structs with unions won't get compiler-generated operations. Updated the text to make those points explicit.


```
struct Number {
union {
var Int64: int_value;
var Float64: float_value;
}

// 0 if no live member, 1 if int_value is live, 2 if double_value is live.
var Int2: discriminator;
}
```

A union consists of zero or more _members_, at most one of which can be live at
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bikeshed, but since you are already using the term "field" when referring to things contained in a struct, why not use the same term here for unions as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to clarify why the two aren't synonymous.

a time. A member can be either a field of any type, or a
[field group](#field-groups). All members of a union share the same storage, so
`Number` as a whole is 9 bytes, not 17. It is out of contract to access a member
that is not live, even if it has the same type as the live member. There is no
intrinsic way to determine which member of the union is live, and in fact that
information will probably not be tracked in production builds. Instead, user
code is responsible for doing whatever bookkeeping is necessary to ensure that
only the live member is accessed. This typically takes the form of an associated
discriminator field, as in `Number`, but this is not required. For example, some
user code may be able to satisfy that requirement statically, with no run-time
tracking at all.

> **TODO:** If Carbon supports auto-generating struct operations such as
> copying, assignment, or destruction, we will need to specify that it does not
> support structs that contain unions, since the generated code won't know which
> field is live.

> **TODO:** Carbon should also permit users to easily define "algebraic data
> types", which are type-safe union-like constructs with compiler-provided
> discriminators. It will also need a separate facility for reinterpreting the
> representation of one type to the representation of a different type (also
> known as "type punning"). When those are designed, cite them above.

## Union members

Unlike C++ unions, Carbon unions never have names, are not objects, and do not
have types; they are a means of controlling the layout of fields in a struct.
Consequently, it is not possible to form a pointer to a union. You can form a
pointer to a member field of a union, but you cannot cast it to a pointer to the
type of a different member.

Union fields are referenced and initialized exactly like ordinary members of the
enclosing struct. However, union fields can be omitted from the initializer,
jonmeow marked this conversation as resolved.
Show resolved Hide resolved
unlike ordinary fields. Obviously the initializer cannot initialize more than
one member of a union, but it can initialize either one member or none. If no
member of the union is initialized, no member of the union is initially live:

```
// Neither n1.int_value nor n1.float_value are live
Number: n1 = ( .discriminator = 0 );

// n2.int_value is live and holds 0
Number: n2 = ( .int_value = 0, .discriminator = 1 );

// n3.int_value is live, but not initialized
Number: n3 = ( .int_value = uninit, .discriminator = 1 );

// Error: cannot initialize multiple members of a single union
Number: n4 = ( .int_value = 0, .float_value = uninit, .discriminator = 1 );
```

Unions follow similar rules in pattern matching: a pattern can mention either
jonmeow marked this conversation as resolved.
Show resolved Hide resolved
zero or one member of a given union. If no member is mentioned, the union is not
accessed during pattern matching, and has no effect on whether the pattern
matches. If the pattern mentions a union member, the corresponding subpattern is
matched against that member, which means that user code must ensure that pattern
matching does not reach that point unless that member is live. Note that this
applies even if the subpattern is a wildcard that is guaranteed to match.

> **TODO:** Ensure the above remains consistent with the overall design for
> struct initialization and pattern matching, and ensure pattern matching gives
> enough control over evaluation order to make it possible to safely mention
> union members in patterns.

## Changing the live member

The live member can only be changed by destroying the current live member (if
any), and then constructing the new live member, using the `destroy` and
`create` keywords:

```
fn SetFloatValue(Ptr(Number): n, Float64: value) {
if (n->discriminator == 0) {
destroy n->int_value;
create: n->float_value = value;
n->discriminator = 1;
} else {
n->float_value = value;
}
}
```

`create` and `destroy` can only be applied to union members (unlike their C++
counterparts, placement `new` and pseudo-destructor calls); the lifetimes of
ordinary struct fields and variables are always tied to their scope. It is out
of contract to apply `create` to a member of a union that already has a live
member, or apply `destroy` to a member that is not live. `destroy` can be
thought of as a unary operator, but a `create` statement has the syntax and
semantics of a variable declaration, with `create` taking the place of `var` and
the field type, and the field expression taking the place of the variable name.
`destroy` permanently invalidates any pointers to the destroyed object; they do
not become valid again if that member is re-created.

> **FIXME**: Does the `create` syntax make it sufficiently clear that the `=`
> represents initialization, not assignment? Is it OK that the `create` syntax
> omits the type? Can we do better?

> **TODO:** The spelling of `create` and `destroy` are chosen for consistency
> with `operator create` and `operator destroy`, which are how constructor and
> destructor declarations are spelled in the currently-pending structs proposal.
> They should be updated as necessary to stay consistent.

## Field groups

A union member can be either a field or a _group_ of fields:

```
struct SsoString {
bitfield(1) var Bool: is_small;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we should be more actively considering ways of expressing this in a way where the language is aware of the relationship between is_small and which component of the union is active. This would open the door to zero-cost static analysis that the type is being used correctly, and lower-cost dynamic analysis.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that there's no guarantee that the discriminator is in the same struct as the union, or is even reachable from the union. There are even quite plausible cases where there's no discriminator at all, and safety is achieved entirely through static reasoning.

We could consider something like that as an optional add-on, rather than something that's inherent in all unions, but because it's an add-on, I don't think it needs to be part of the initial unions design. Instead, I think we should consider it as part of the overall safety story, because until then it won't be clear how much benefit it actually provides. I've just added a proposal doc to this PR, which includes a rationale for why I want to proceed with this proposal even though the safety plan are unresolved.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that there is no guarantee that you won't need to do something unsafe, but I don't think that is the common case we need to make easy, especially since it is the case where the language can give you little support. Rust is an example of a language aiming for high performance but provides a safe primitive here. If you really need something outside what the language can reason about, then use something like a byte array and casts.

I think the main thing as a team we should be doing at this stage is deciding between high level approaches. So even if we ultimately decide to go with the approach currently presented in this document, the job of this document should be to present and discuss the alternatives under consideration, and presenting rationale for one choice over others. I very much feel this document needs to at least address the safer design alternative.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you added a blurb in the proposal text about discriminated unions -- I maintain that it requires more serious consideration since I think some of the concerns raised there are quite addressable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fully agreed with @josh11b. I feel like our safe algebraic data type facility should allow the type of micro-management of storage layout necessary for SsoString presented here. Of course, there will be cases where the discriminator is not accessible easily, but I don't think that's the common case, and that type of code is highly unsafe anyway, so there's little the language can do to help.


union {
group small {
zygoloid marked this conversation as resolved.
Show resolved Hide resolved
bitfield(7) var Int7: size;
var FixedArray(Char, 22): buffer;
}
group large {
bitfield(63) var Int63: size;
var Int64: capacity;
var UniquePtr(Array(Char)): buffer;
}
}
}
```

> **TODO:** The treatment of bitfields, arrays, and owning pointers in this
> example is speculative, and should be updated to reflect the eventual design
> of those features.

Field groups are sets of fields (and/or unions) that can be created and
destroyed as a unit. They are initialized from anonymous structs whose fields
have the same types, names, and order. The name of a field group is part of the
name of each field in the group:

```
var SsoString: str = (.is_small = True,
.small = (.size = 0, .buffer = uninitialized)
);
Assert(str.small.size == 0);
destroy str.small;
create: str.large = (.size = 0, .capacity = 100, .buffer = MakeBuffer(100));
Assert(str.large.capacity == 100);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is str.large here? Is it possible to refer to that without a following .field? Does that expression have a type? Can I pass it to a function?

Not supposed to be leading questions -- I think it could possible be OK for the answer to be that you must always write .something after str.large, so it's really just as if large. is a prefix of the member names. But that would potentially lead to some non-uniformity. I think it could also possibly be OK for the answer to be that you get some unspecified anonymous type -- but that would somewhat contract the statement below that field groups don't have types.

```

However, field groups are not objects, and do not have types; like unions, field
groups are a way of controlling the layout of the fields of a struct. For
example, if `large` and `small` were structs, the bitfields would not save any
space, and `is_small` would have to be followed by 63 bits of padding in order
to ensure proper alignment of `large`.

## Layout
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An alternative that I'd like us to consider is to by default let the compiler choose the layout, but allow the user to specify the layout exactly. For example, the parts of a single bitfield should be tagged with the name of the bitfield to demand that they are packed together, members that should be laid out at specific offsets should be annotated as such, any padding must be explicitly inserted (or the following field must be annotated as unaligned) etc.

We could demand that code must always explicitly specify the layout in unions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think explicit layout control specifically needs to be part of this proposal, rather than a separate one?

We could demand that code must always explicitly specify the layout in unions.

Sure, but what problem would that solve?


We express the layout of a field in terms of its starting and ending offsets,
which are measured in bits in order to handle bitfields. The starting and ending
offsets of each field in a field group are the same as if the fields were all
direct members of the enclosing struct. A union member that is not a field group
is laid out as if it were a field group containing a single field. The ending
offset of a union is the maximum ending offset of any field in any of its field
groups, rounded up to the next whole byte. Note that this means that bitfields
after the union are not "packed" together with any bitfields at the end of the
union.

> **TODO:** This presupposes that struct fields are laid out sequentially; if
> that is not the case, this algorithm will need to be revised accordingly.

The C/C++ memory model (which we expect Carbon to adopt) treats any maximal
contiguous sequence of bitfields as a single memory location, but C/C++ do not
allow bitfields to be split across the beginning of a union. Introducing that
ability in Carbon implies that the size of a memory location can change, which
the memory model doesn't seem to countenance. In order to fix that
inconsistency, we model the creation or destruction of union member as also
destroying any immediately preceding bitfields, and then recreating them with
the same contents. This implies that you cannot access the preceding bitfields
concurrently with creating or destroying a union member.

A union cannot be nested directly within a union, and a field group cannot be
nested directly within a struct or a field group. Unions and field groups cannot
contain constructors, destructors, or methods.

## Safety

The safety rules for Carbon unions are easily summarized: it is always out of
contract to access or destroy a union member that is not live, and the
operations that can change the live member are always explicit and unambiguous.
It should be quite straightforward for a sanitizer to check direct accesses to
union members, by tracking the live member in shadow memory, and verifying that
the member being accessed is live.

However, union members can also be accessed through pointers (including pointers
to nested subobjects), and such pointers are indistinguishable from pointers to
any other object. Reliably sanitizing accesses through such pointers would
require dynamically tracking which union member (if any) each pointer points to,
propagating that information to subobject accesses, and instrumenting every
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could taking a pointer require that the named target of the pointer be active, and then invalidate the pointer if the active member is destroyed? i.e., without instrumenting every pointer access.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that seems like a better model, so I've made that change. However, I don't think that removes the need to instrument every pointer access. If anything, it suggests a need for more instrumentation: now we have to not only check that this pointer points to the active member, we also have to check that the active member wasn't destroyed and then recreated since the pointer was formed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps. We'd need to be able to represent pointers in such a way that we can invalidate all pointers that point to a union member when it becomes inactive. We could use pointer tagging for this, at the expense of not having the tag bits available for other purposes -- and that still leaves us instrumenting every access, but perhaps in a way we were going to anyway. Or we could track the set of pointers that point to the union member, which would likely require instrumenting at least every pointer assignment that we can't statically prove doesn't point to a union member.

I think a better baseline behavior would be to say that members (union or otherwise) cannot have their address taken by default. This is a desirable property for other reasons too. That still leaves open this problem for the case where unions are explicitly made addressable, but that's at least only a (hopefully small) subset of all union members.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Clarification: there was a race between @zygoloid's comment and mine; I was responding to the original comment)

pointer access in the program to determine whether it is accessing a union, and
if so whether the currently live member matches the one the pointer value was
created with. This would definitely be too costly for a hardened production
build mode, and might even be too costly (relative to its benefit) to be useful
as a sanitizer.

> **TODO:** Figure out whether and how to sanitize invalid union accesses,
> probably in the context of an overall design for temporal memory safety.

## C++ interoperability and migration

Carbon unions are more restrictive than C++ unions in several respects:

- C++ unions can be types, and can have names.
- C++ permits the active member to be implicitly changed by assigning to an
inactive member.
- C++ permits accessing fields of an inactive member if they are part of a
"common initial sequence" of fields that's shared with the active member.
- In practice, C++ permits accessing inactive members even in ways that
violate the "common initial sequence" rule, with the semantics that the
object representation of the active member is reinterpreted as a
representation of the inactive member. This is formally undefined behavior,
but broadly supported and fairly common in practice.

Conversely, C++ is more restrictive in one respect: the members of a C++ union
must be objects, and the union's alignment must conform to the alignment
requirements of its member objects. For example, there doesn't seem to be any
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced it is important in practice to expose Carbon unions to C++. Most unions of this kind in C++ are typically very well encapsulated (like SsoString above), and expose an API that does not mention unions; so I expect the same to be true in Carbon as well. My suggestion is to represent Carbon unions as opaque byte blobs in C++ if the type does happen to de exported to C++.

Furthermore, since there's a semantic gap (like the fact that Carbon requires using special create and destroy operators to change the active member), I'm not sure that C++ union is really a good match, even if we tried hard to make it work. I think a more fruitful direction would be to expose Carbon unions to C++ as classes with no public data members, and only member functions to manipulate the data. The contract of those member functions would match Carbon's language semantics exactly.

way of expressing this Carbon struct in terms of C++ unions while preserving
both its structure and its layout:

```
struct S {
union {
group g1 {
Int32: a;
Int16: b;
Int8: c;
}
group g2 {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should empty groups like this be disallowed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so; disallowing them could make it harder to reduce test cases, search for bugs by commenting out code, etc, and wouldn't provide any benefit that I can see.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder here...

The benefit would be that you'd prevent people from using active, inoperative code. Allowing this could be a readability issue, especially if members in the middle are commented out in a way that obscures a group actually being empty.

Regarding comments, they could just block-comment including the group. I'm not clear how test cases relate.

To me, this is similar to not allowing groups nested within groups: if that's disallowed because it doesn't offer structural benefits to outweigh costs, I don't see how an empty group is better.

}
union {
group g3 {
Int8: d;
Int32: e;
}
group g4 {}
}
}
```

If that structure were naively translated to C++, `g3` would be required to to
have 4-byte alignment, which would force the addition of a byte of padding
between the unions. That would in turn force the addition of 3 bytes of padding
between `d` and `e`, in order to properly align `e`.

We can preserve the layout by expanding all groups so that they contain all
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting that bitfield alignment is deliberately different with Carbon, should there be a method for aligning bitfields consistently?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you be more specific about what you mean by "a method for aligning bitfields consistently"? If you mean "a way to make a C++ struct match the alignment of a corresponding Carbon struct", that's exactly what this passage is trying to describe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section doesn't mention bitfield or bit alignment, and so I don't think it addresses the case.

The issue I see with bitfield is, what if a Carbon user wants to write code that uses bitfields, such as the example under "field groups" and have that code called from C++? As you state there, the alignment is inconsistent with C++. What if a user wants to have essentially the same code, but still have it work with C++?

prior members of the struct, and hence all start at offset 0:

```c++
struct S {
union {
struct /* g3 */ {
union {
struct {
int32_t a;
int16_t b;
int8_t c;
} g1;
struct {} g2;
};
int8_t d;
int32_t e;
} g3;
struct /* g4 */ {
union {
struct {
int32_t a;
int16_t b;
int8_t c;
} g1;
struct {} g2;
};
} g4;
};
};
```

However, this doesn't preserve the naming structure of the fields: `s.g1.a` in
Carbon would become either `s.g3.g1.a` or `s.g4.g1.a` in C++ (the two are
equivalent because they are part of a common initial sequence), and every
additional union in the struct compounds this problem. Rather than expose this
complexity to users, the members that form the actual data layout will be made
private, and the original members will be exposed through methods that return
references, so that `s.g1.a` becomes `s.g1().a()`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does a() really need to be a method? Could you get by with only g1() and g3() (only adding methods for field groups, not members), leading to s.g1().a, and s.g3().d?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

d() definitely needs to be a method. s.g3() can't return the g3 struct depicted above, because it contains members that aren't logically part of g3. Instead, s.g3() will need to return a proxy object that holds a pointer to the g3 struct, and exposes methods for the members that should be user-visible.

Strictly speaking this reasoning doesn't apply to a, but it seems simpler to make them always be methods, rather than try to teach the rules for when a field becomes a method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't it return a struct for g3 that pointed at the correct address and had the members laid out to map correctly, even though it's not the g3 that you picture above? I think the trade-offs may be something like struct implementation complexity vs performance of introducing the proxy object?


> **TODO:** With this scheme, all data members up to and including the last
> union in a struct must be exposed through methods. The structs design will
> need to determine whether any subsequent members are exposed in C++ as data
> members, methods or both.

Note that this mapping doesn't quite preserve concurrency semantics: Carbon code
can safely access the first union while concurrently creating or destroying a
member of the second union, but in C++ the corresponding operation would be
undefined behavior.

> **FIXME:** Does this matter in practice, and can we do anything to avoid the
> undefined behavior?

> **TODO:** The design of the Carbon memory model will need to address this
> inconsistency. While we generally intend to adopt the C/C++ memory model, it's
> unclear exactly what that means in cases like this one, where Carbon code
> creates situations that are inexpressible in C/C++. Note that it's not
> entirely clear whether the undefined behavior in question is a data race per
> se, because it's not clear whether invoking a trivial constructor or
> destructor actually modifies the memory locations containing the object.

> **TODO:** It looks very difficult to support exposing C++ unions to Carbon in
> an automated way, unless we are willing to allow type-punning through ordinary
> pointer reads, and allow assignment through a pointer to implicitly destroy
> and create objects. It may be possible to support partial or full automation
> in cases where the union is sufficiently encapsulated, but this will require
> further research about what encapsulation patterns are common.
2 changes: 2 additions & 0 deletions proposals/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,10 @@ request:
- [0044 - Proposal tracking](p0044.md)
- [Decision](p0044-decision.md)
- [0051 - Goals](p0051.md)
- [Decision](p0051-decision.md)
- [0074 - Change comment/decision timelines in proposal process](p0074.md)
- [Decision](p0074-decision.md)
- [0083 - In-progress design overview](p0083.md)
- [0139 - Unions](p0139.md)

<!-- endproposals -->
Loading