Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

missed optimization with C constructors of sparse structures #40122

Open
arndb mannequin opened this issue Feb 19, 2019 · 7 comments
Open

missed optimization with C constructors of sparse structures #40122

arndb mannequin opened this issue Feb 19, 2019 · 7 comments
Labels
bugzilla Issues migrated from bugzilla missed-optimization

Comments

@arndb
Copy link
Mannequin

arndb mannequin commented Feb 19, 2019

Bugzilla Link 40776
Version trunk
OS Linux
Blocks #4440
CC @Meinersbur,@nickdesaulniers,@zygoloid,@stephenhines

Extended Description

I stumbled over a file with excessive stack usage in the Linux kernel when building with clang, but not with gcc, reduced the test case to:

struct i2c_adapter {
  char name[48];
  int a[1000];
} hexium_attach_hexium;
void hexium_attach(void) {
    hexium_attach_hexium = (struct i2c_adapter){"hexium gemini"};
}

$ clang-8 -Werror -c hexium-gemini.c -Wframe-larger-than=100 -O2
hexium-gemini.c:5:6: error: stack frame size of 4056 bytes in function 'hexium_attach' [-Werror,-Wframe-larger-than=]

Comparing the output shows that clang not only allocates stack space for the extra copy of the structure, but also fails to optimize the string assignment:

https://godbolt.org/z/7jpy_y

I worked around it in the kernel sources by using an explict memset and strscpy, but this seems like a generally useful optimization that could be added.

@arndb
Copy link
Mannequin Author

arndb mannequin commented Feb 19, 2019

After playing around with it a little more, I see that clang avoids the stack allocation and memcpy as well if I initialize any other member besides the string to a nonzero value:

struct i2c_adapter {
  char name[48];
  int a[1000];
} s;
void hexium_attach(void) {
    s = (struct i2c_adapter){"h", {1}};
}

https://godbolt.org/z/-zgFHU

@nickdesaulniers
Copy link
Member

I think this is bug very early on in the optimization pipeline; within SROA.

Comparing Arnd's second example w/ 1 (result optimized) vs explicit 0 (result unoptimized) or implicit 0 (same, unoptimized).

$ clang -O0 -Xclang -disable-O0-optnone x0.c -emit-llvm -c -g0 -S -o x.0.ll
$ clang -O0 -Xclang -disable-O0-optnone x1.c -emit-llvm -c -g0 -S -o x.1.ll
$ opt -O2 x.0.ll -S -print-after-all 2> x.0.txt >/dev/null
$ opt -O2 x.1.ll -S -print-after-all 2> x.1.txt >/dev/null
$ meld x.0.txt x.1.txt

It looks like the after the 4th pass (of many) things start to diverge:
*** IR Dump After SROA ***
; x.0.ll
%1 = alloca %struct.i2c_adapter, align 4
; x.1.ll
%.sroa.0 = alloca [48 x i8], align 4
%.sroa.4 = alloca [999 x i32]

For x.1.ll:
SROA is able to split the local allocation into 2 alloca's, 1 memset, and 2 memcpys. Later optimizations are able to prune this down to 0 alloca's, 1 memset, and 1 memcpy.

For x.0.ll:
SROA is not able to split the local allocation into 2. After optimizations, we're left with 1 (large) alloca, 1 memset, and 2 memcpys (alloca a local, memset the zero's, memcpy the string, memcpy the local to the global).

So the next thing for me to dig into is, "why is SROA not able to correctly split the large alloca in this case?"

@nickdesaulniers
Copy link
Member

x.0.txt

@nickdesaulniers
Copy link
Member

x.1.txt

@nickdesaulniers
Copy link
Member

x0.ll

@nickdesaulniers
Copy link
Member

x1.ll

@nickdesaulniers
Copy link
Member

nickdesaulniers commented May 6, 2020

I see Arnd submitted a few more kernel patches for this, as it seems to be a recurring issue.

Playing with the example more, I think another issue is the assignment of char[] members via a string literal. I think that bug may be earlier, in Clang's codegen of LLVM IR.

For example:

struct foo {
    char bar [1000];
} my_foo;

void bad_lazy_init(void) {
    my_foo = (struct foo){
        .bar = "h",
    };
}

produces a 4000B .asciiz string "h<3999 zeros>" but

.bar = {104, 0}

produces much more concise code (and LLVM IR). For an AST node like:

StringLiteral 0xde7fde8 <line:7:16> 'char [1000]' "h"

in a compound literal, I wonder if we should try to optimize that in clang codegen, or tackle this strictly in SROA?

@llvmbot llvmbot transferred this issue from llvm/llvm-bugzilla-archive Dec 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugzilla Issues migrated from bugzilla missed-optimization
Projects
None yet
Development

No branches or pull requests

2 participants