Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NativeAOT] Link-time-optimize unmanaged portions of the runtime on Linux #86083

Open
MichalStrehovsky opened this issue May 11, 2023 · 2 comments
Labels
area-NativeAOT-coreclr help wanted [up-for-grabs] Good issue for external contributors
Milestone

Comments

@MichalStrehovsky
Copy link
Member

We currently compile unmanaged portions of the runtime without LTO or PGO because the runtime is placed in an .a file that gets linked using an unknown linker that exists on the user machine. LTO requires a linker that knows how to interpret the bitcode in non-ELF object files.

We can however apply LTO on .a files. See David's prototype at https://gist.github.com/davidwrighton/385035ffd24b88c39c2e7d5cf0274907.

How to use from David:

First, compile FileA.o and FileB.o which were compiled via command lines like:
clang -O3 -flto -c FileB.c
clang -O3 -flto -c FileA.c

Those commands produce FileA.o, and FileB.o which are NOT ELF object files, but instead are the Bitcode file format.

Then run a command line like…
LtoOptimize --plugin /usr/lib/llvm-14/lib/libLTO.so -o FileA.o -o FileB.o -O Optimized.o --symbol=CanOnlyAlwaysReturn2WithLTO

and produce an ELF file Optimized.o that has a function CanOnlyAlwaysReturn2WithLTO which was optimized in a manner which requires LTO to produce optimal output.

I also added the ability to dump the set of symbol names in FileA and FileB, as well as the ability to compile ALL symbols from both FileA and FileB.

So the theory is that we could:

  1. Build the unmanaged portion of the runtime with LTO enabled.
  2. Run a tool similar to the one from the gist to perform LTO on the library and produce an optimized .o file
  3. Pack the .o back into an .a (now .a like any other, with no bitcode)
  4. Profit

We'd need to measure if this is indeed profitable and worth the engineering costs. Success not guaranteed. Might be better to first just enable LTO locally and use a linker that can handle it E2E (i.e. turn on LTO and compile with ILC as usual, expecting the linker step to do the LTO) and get some measurements. GC perf would be the most interesting to measure, so do something that stresses the GC and measure with/without LTO.

@MichalStrehovsky MichalStrehovsky added help wanted [up-for-grabs] Good issue for external contributors area-NativeAOT-coreclr labels May 11, 2023
@ghost ghost added the untriaged New issue has not been triaged by the area owner label May 11, 2023
@ghost
Copy link

ghost commented May 11, 2023

Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas
See info in area-owners.md if you want to be subscribed.

Issue Details

We currently compile unmanaged portions of the runtime without LTO or PGO because the runtime is placed in an .a file that gets linked using an unknown linker that exists on the user machine. LTO requires a linker that knows how to interpret the bitcode in non-ELF object files.

We can however apply LTO on .a files. See David's prototype at https://gist.github.com/davidwrighton/385035ffd24b88c39c2e7d5cf0274907.

How to use from David:

First, compile FileA.o and FileB.o which were compiled via command lines like:
clang -O3 -flto -c FileB.c
clang -O3 -flto -c FileA.c

Those commands produce FileA.o, and FileB.o which are NOT ELF object files, but instead are the Bitcode file format.

Then run a command line like…
LtoOptimize --plugin /usr/lib/llvm-14/lib/libLTO.so -o FileA.o -o FileB.o -O Optimized.o --symbol=CanOnlyAlwaysReturn2WithLTO

and produce an ELF file Optimized.o that has a function CanOnlyAlwaysReturn2WithLTO which was optimized in a manner which requires LTO to produce optimal output.

I also added the ability to dump the set of symbol names in FileA and FileB, as well as the ability to compile ALL symbols from both FileA and FileB.

So the theory is that we could:

  1. Build the unmanaged portion of the runtime with LTO enabled.
  2. Run a tool similar to the one from the gist to perform LTO on the library and produce an optimized .o file
  3. Pack the .o back into an .a (now .a like any other, with no bitcode)
  4. Profit

We'd need to measure if this is indeed profitable and worth the engineering costs. Success not guaranteed. Might be better to first just enable LTO locally and use a linker that can handle it E2E (i.e. turn on LTO and compile with ILC as usual, expecting the linker step to do the LTO) and get some measurements. GC perf would be the most interesting to measure, so do something that stresses the GC and measure with/without LTO.

Author: MichalStrehovsky
Assignees: -
Labels:

help wanted, area-NativeAOT-coreclr

Milestone: -

@MichalStrehovsky MichalStrehovsky added this to the 8.0.0 milestone May 11, 2023
@ghost ghost removed the untriaged New issue has not been triaged by the area owner label May 11, 2023
@agocke agocke added this to AppModel May 22, 2023
@agocke agocke modified the milestones: 8.0.0, Future Jul 5, 2023
@MichalStrehovsky
Copy link
Member Author

I did a local experiment where I simply passed -flto to the runtime build and also to the linker invocation at publish time.

For TodosApi the improvement would be about 1% in RPS and a small improvement to latency as well.

In theory, this could also be addressed with #83611.

  Before 1 Before 2 Before 3 Before 4 After 1 After 2 After 3 After 4
Requests/sec 210,819 213,503 210,256 209,506 218,472 211,535 212,683 211,815
Mean latency (ms) 1.36 1.34 1.36 1.38 1.31 1.36 1.34 1.36
Max latency (ms) 21.05 28.89 25.83 30.28 24.52 27.99 25.17 24.58
Max Time in GC (%) 19 18 19 17 18 19 18 19
Max Working Set (MB) 95 98 101 100 102 101 93 97

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-NativeAOT-coreclr help wanted [up-for-grabs] Good issue for external contributors
Projects
Status: No status
Development

No branches or pull requests

2 participants