-
Notifications
You must be signed in to change notification settings - Fork 13.4k
Clang should lower unions to LLVM as byte array, not structure type #107239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I think it's okay to switch to a byte array to make things more robust and obvious. But I do want to point out that whenever some piece of code emits a load/store of an aggregate type, it's that code that is wrong. E.g. looking at #53710 it's a bug in the NVPTX backend that it copies byval arguments using a load+store pair instead of using memcpy. Generally, the only information you're allowed to take from a byval argument is its size and alignment. |
@llvm/issue-subscribers-clang-codegen Author: Michael Kuron (mkuron)
Consider a C++ union like this:
```c++
struct S1 {
char s;
int i;
};
struct S2 {
int a;
int b;
};
union U {
S1 s1;
S2 s2;
};
```
Clang lowers it to LLVM-IR like this as LLVM has no concept of unions:
```llvm
%union.U = type { %struct.S1 }
%struct.S1 = type { i8, i32 }
```
This is dangerous whenever one of the structs contains alignment padding that doesn't coincide with the other structs' padding. In the above example, LLVM's perspective is that the second, third, and fourth byte of the union are unused, but if the union happens to contain an `S2`, they are in fact used. Whenever something in LLVM iterates over structure members, there is thus a danger of data loss. This has come up before in various contexts (#53710, #64081, #76017, probably others that I haven't seen) and the workaround tends to be to make LLVM locally reinterpret the structure type as an opaque byte array or fix up its size. This does not scale, however, and there are likely more places that need this workaround but don't have it yet.
@Artem-B pointed out on #53710 that the documentation says and suggested I open this issue. Clang's way of lowering unions as structure types thus seems to be in violation of the assumptions that LLVM makes about structure types. Instead of applying workarounds throughout LLVM, the preferable solution would thus be to change Clang's union lowering to use byte arrays instead of structure types. The above example would then be represented as %union.U = type { [8 x i8] } which does not entice LLVM to make any assumptions that certain bytes might be padding. The code in Clang that is responsible for lowering unions is in llvm-project/clang/lib/CodeGen/CGRecordLayoutBuilder.cpp Lines 313 to 378 in 4497ec2
It even says that the complicated heuristic for deciding which one of the union's members the type should be based on is unnecessary. It already contains some fallback paths where it generates an opaque byte array if it can't identify an appropriate member. Pinging the people involved in #64081 and #76017: @jacobly0, @nikic, @efriedma-quic, @ivafanas. |
The way forward here is to just remove these types from the IR completely, I think. AllocaInst::getAllocatedType() and Argument::getParamByValType aren't returning useful information; if you look at the LangRef rules, only the size is semantically significant. So we should just be using AllocaInst::getAllocationSize() etc. Once we do that, clang's lowering of unions becomes irrelevant. We could change the way clang emits these types in the meantime, I guess, but we don't get much benefit out of it. (Also, this isn't at all relevant to #76017.) |
+1. This seems to be the simplest consistent way to keep everyone happy.
Semantics of loads/stores of aggregates is currently documented in the LLVM IR and as such should be valid, no? |
Aggregate loads/stores work as documented... the issue is that the documented semantics aren't the semantics you want for copying an alloca/byval value. |
I agree that there's a disconnect, but I think I'm missing something here. User's IR says "here's an aggregate of type T". IR spec says -- you're free to load/store that aggregate, but you only get the fields, not the padding. Presumably, if one loads/stores/loads that aggregate, the values of both loads will be the same, but not necessarily the padding. Clang clearly assumes that everything must be copied. I want to understand what is that assumption based on, because LLVM does not seem to make such a promise. At least I have not found it in the docs so far. |
IR loads and stores of aggregates do not copy padding. However, clang will not actually (or at least should not) emit load/stores of aggregate type, so this is not a problem for union lowering. Our frontend guidelines explicitly discourage the creation of values of aggregate type unless strictly necessary, e.g. for return ABI handling, and clang follows that. The type argument of alloca and byval arguments is only meaningful for its size and alignment.
Yes, I agree. |
OK, so when clang emits initial unoptimized IR, the IR is fine. But it's just the beginning of the story of those aggregate values. |
Consider a C++ union like this:
Clang lowers it to LLVM-IR like this as LLVM has no concept of unions:
This is dangerous whenever one of the structs contains alignment padding that doesn't coincide with the other structs' padding. In the above example, LLVM's perspective is that the second, third, and fourth byte of the union are unused, but if the union happens to contain an
S2
, they are in fact used. Whenever something in LLVM iterates over structure members, there is thus a danger of data loss. This has come up before in various contexts (#53710, #64081, #76017, probably others that I haven't seen) and the workaround tends to be to make LLVM locally reinterpret the structure type as an opaque byte array or fix up its size. This does not scale, however, and there are likely more places that need this workaround but don't have it yet.@Artem-B pointed out on #53710 that the documentation says
and suggested I open this issue. Clang's way of lowering unions as structure types thus seems to be in violation of the assumptions that LLVM makes about structure types. Instead of applying workarounds throughout LLVM, the preferable solution would thus be to change Clang's union lowering to use byte arrays instead of structure types. The above example would then be represented as
which does not entice LLVM to make any assumptions that certain bytes might be padding.
The code in Clang that is responsible for lowering unions is in
CGRecordLowering::lowerUnion
:llvm-project/clang/lib/CodeGen/CGRecordLayoutBuilder.cpp
Lines 313 to 378 in 4497ec2
It even says that the complicated heuristic for deciding which one of the union's members the type should be based on is unnecessary. It already contains some fallback paths where it generates an opaque byte array if it can't identify an appropriate member.
Pinging the people involved in #64081 and #76017: @jacobly0, @nikic, @efriedma-quic, @ivafanas.
The text was updated successfully, but these errors were encountered: