Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SROA conflicts with the way clang lowers unions #64081

Closed
jacobly0 opened this issue Jul 24, 2023 · 7 comments
Closed

SROA conflicts with the way clang lowers unions #64081

jacobly0 opened this issue Jul 24, 2023 · 7 comments

Comments

@jacobly0
Copy link
Contributor

int puts(const char *);
static void f(int cond) {
    union {
        struct {
            unsigned _BitInt(6) a;
            char b;
        } a;
        char b[2];
    } u;
    char a[2];
    __builtin_memcpy(&u.b, "a", 2);
    if (cond) {
        u.a.a = '!';
        u.a.b = 0;
    }
    __builtin_memcpy(a, &u, 2);
    puts(a);
}
int main() {
    f(0);
}
$ clang --version
clang version 16.0.6
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/lib/llvm/16/bin
Configuration file: /etc/clang/clang.cfg
$ clang -O0 repro.c; ./a.out
a
$ clang -O1 repro.c; ./a.out
!

godbolt

I initially reduced to a simpler LLVM IR repro for this bug, but it's not clear to me if that is a bug in SROA or just invalid IR, so I backported it to C to demonstrate that clang is able to produce equivalent IR.

The issue appears to be that clang selects the union member with the largest alignment as the representation for the union, which in this case is arbitrary since they have the same alignment, and indeed swapping the order of the union fields avoids the bug. Clang then uses this as the type of the alloca, which SROA takes to mean that the top two bits of u.a.a are not meaningful, when the source language says that those bits are perfectly meaningful when accessing u.b[0].

Since this was reduced from another language, I can either change the way unions are lowered in that compiler (which currently uses the same method as clang, and I'm not sure what a better way would look like), or wait for an LLVM release that doesn't miscompile when lowering to the current IR.

@nikic
Copy link
Contributor

nikic commented Jul 24, 2023

Reduced: https://llvm.godbolt.org/z/hEbPGz69q

; RUN: opt -S -passes=sroa
@.str = global [2 x i8] [i8 97, i8 97]

define void @f(i1 %cond) {
entry:
  %u = alloca { i6, i8 }, align 1
  %a = alloca [2 x i8], align 1
  call void @llvm.memcpy.p0.p0.i64(ptr align 1 %u, ptr align 1 @.str, i64 2, i1 false)
  br i1 %cond, label %if.then, label %if.end

if.then:                                          ; preds = %entry
  store i6 -31, ptr %u, align 1
  %b = getelementptr inbounds i8, ptr %u, i64 1
  store i8 0, ptr %b, align 1
  br label %if.end

if.end:                                           ; preds = %if.then, %entry
  call void @llvm.memcpy.p0.p0.i64(ptr align 1 %a, ptr align 1 %u, i64 2, i1 false)
  call void @puts(ptr %a)
  ret void
}

declare void @llvm.memcpy.p0.p0.i64(ptr, ptr, i64, i1)

declare void @puts(ptr)

From a brief glance at the code,

using the type store size might be culprit.

@efriedma-quic
Copy link
Collaborator

Since this was reduced from another language, I can either change the way unions are lowered in that compiler (which currently uses the same method as clang, and I'm not sure what a better way would look like)

The issue here looks like SROA is specifically getting confused by integer types that aren't multiples of 8; you could probably make your frontend avoid emitting allocas with such types as a workaround. (The "largest alignment" thing is just to try to make the generated types look simpler; it's not actually necessary for correctness.)


Really, almost every place that's accessing AllocaInst::getAllocatedType() should be using AllocaInst::getAllocationSize() instead.

@jacobly0
Copy link
Contributor Author

I have reduced yet another issue to this bug:

%A = type { i32 }
%B = type { i1, i3 }

define i32 @main() {
  %1 = alloca %A, align 4
  %2 = alloca %B, align 1
  %3 = alloca i8, align 1
  store i8 42, ptr %3, align 1
  %4 = load i8, ptr %3, align 1
  %5 = trunc i8 %4 to i7
  %6 = getelementptr inbounds { i1, i8 }, ptr %2, i64 0, i32 1
  store i7 %5, ptr %6, align 1
  %7 = getelementptr inbounds { <{ %B, [2 x i8] }>, i1, [3 x i8] }, ptr %1, i64 0, i32 0, i32 0
  call void @llvm.memcpy.p0.p0.i64(ptr align 1 %7, ptr align 1 %2, i64 2, i1 false)
  %8 = call i8 @getC(ptr nonnull readonly align 4 %1)
  %9 = zext i8 %8 to i32
  ret i32 %9
}

define i8 @getC(ptr nonnull readonly align 4 %0) {
  %2 = getelementptr inbounds %A, ptr %0, i32 0, i32 0
  %3 = getelementptr inbounds %B, ptr %2, i32 0, i32 1
  %4 = load i8, ptr %3, align 1
  ret i8 %4
}

declare void @llvm.memcpy.p0.p0.i64(ptr noalias nocapture writeonly, ptr noalias nocapture readonly, i64, i1 immarg)
$ opt -version
LLVM (http://llvm.org/):
  LLVM version 17.0.4+libcxx
  Optimized build.
  Default target: x86_64-pc-linux-gnu
  Host CPU: znver4
$ opt --passes= repro.ll | lli; echo $?
42
$ opt --passes=sroa repro.ll | lli; echo $?
2

@bjope
Copy link
Collaborator

bjope commented Jan 4, 2024

We got some regressions downstream after the fix in 54067c5

I think this test program show the problem (or parts of it):
https://godbolt.org/z/WfrfEErPE

#define X 70

typedef struct {
  _BitInt(X) a;
  _BitInt(X) b;
} S;

extern S bar(_BitInt(X), _BitInt(X));

S foo(int *p) {
  S s;
  s.a = 0;
  s.b = 0;
  for (int i = 0; i < 100; ++i) {
    S t = s;
    s = bar(t.a, t.b);
  }
  return s;
}

When the frontend is emitting memcpy to perform an aggregate copy (this could for example be a EmitAggExprToLValue) such a memcpy would end up copying padding. That doesn't mean that the padding is of much interest, and the IR might just be reading the aggregate members without any type punning(?) as in this issue. But with this patch such memcpy:s seem to prevent SROA from eliminating the alloca. And we also get an annoying memcpy inside the loop.

Not quite sure how to deal with this.

What you say is that the type used in the alloca shouldn't really matter, it's only the store size that is of importance. In other words, it is fully legal to do things like

  %p = alloca i1
  store i6 255, ptr %p

and

  %p = alloca i6
  ...
  %x = load i2, ptr %p

Thus, SROA can't look at the alloca type to know how the temporary variable is accessed. But maybe it can look at the users to figure out how the stack slot is used. And if it is accessed using a single single-value-type, there won't be a need to use memcpy?

@bjope
Copy link
Collaborator

bjope commented Jan 4, 2024

Hm. Looks like the above example wasn't really regressed by this patch but maybe something else that has changed since clang 17.

But if changing X to 28 like this: https://godbolt.org/z/35eqh5EG8 then it will show a difference depending on if I revert 54067c5 or not.

@nikic
Copy link
Contributor

nikic commented Jan 4, 2024

Hm. Looks like the above example wasn't really regressed by this patch but maybe something else that has changed since clang 17.

This is probably due to introduction of writable, which isn't emitted by clang yet. I'll try to implement this soon.

@nikic
Copy link
Contributor

nikic commented Jan 5, 2024

Hm. Looks like the above example wasn't really regressed by this patch but maybe something else that has changed since clang 17.

This is probably due to introduction of writable, which isn't emitted by clang yet. I'll try to implement this soon.

Done in #77116. I confirmed that this removes the memcpy in your example.

bjope added a commit to bjope/llvm-project that referenced this issue Jan 10, 2024
… types

The bugfix in commit 54067c5, related to
  llvm#64081
limited the ability of SROA to handle non byte-sized types
when used in aggregates that are memcpy'd.

Adding some test cases involving tbaa.struct annotations
that might allow us to eliminate more memcpy/alloca.
bjope added a commit to bjope/llvm-project that referenced this issue Jan 11, 2024
…a.struct

The bugfix in commit 54067c5, related to
  llvm#64081
limited the ability of SROA to handle non byte-sized types
when used in aggregates that are memcpy'd.

Main problem was that the LLVM types used in an alloca doesn't
always reflect if the stack slot can be used to for multiple
types (typically unions). So even if we for example have
  %p = alloca i6
that stack slot may be used for other types than i6. And it
would be legal to for example store an i8 to that stack slot.

Thus, if %p was dereferenced in a memcpy we needed to consider
that also padding bits (seen from the i6 perspective) could be
of importance.

The limitation added to SROA in commit 54067c5 resulted
in huge regressions for a downstream target. Since the frontend
typically emit memcpy for aggregate copy it seems quite normal
that one end up with a memcpy that is copying padding bits even
when there are no unions or type punning. So that limitation
seem unfortunate in general.

In this patch we try to lift the restrictions a bit. If the
memcpy is decorated with tbaa.struct metadata we look at that
metadata. If we find out if the slice used by our new alloca is
touching memory described by a single scalar type according to
the tbaa.struct, then we assume that the type derived for the
new alloca is correct for all accesses made to the stack slot.
And then we can allow replacing the memcpy by regular load/store
instructions operating on that type (disregarding any padding
bits).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants