[NativeAOT] Link-time-optimize unmanaged portions of the runtime on Linux #86083

MichalStrehovsky · 2023-05-11T05:27:01Z

We currently compile unmanaged portions of the runtime without LTO or PGO because the runtime is placed in an .a file that gets linked using an unknown linker that exists on the user machine. LTO requires a linker that knows how to interpret the bitcode in non-ELF object files.

We can however apply LTO on .a files. See David's prototype at https://gist.github.com/davidwrighton/385035ffd24b88c39c2e7d5cf0274907.

How to use from David:

First, compile FileA.o and FileB.o which were compiled via command lines like:
clang -O3 -flto -c FileB.c
clang -O3 -flto -c FileA.c

Those commands produce FileA.o, and FileB.o which are NOT ELF object files, but instead are the Bitcode file format.

Then run a command line like…
LtoOptimize --plugin /usr/lib/llvm-14/lib/libLTO.so -o FileA.o -o FileB.o -O Optimized.o --symbol=CanOnlyAlwaysReturn2WithLTO

and produce an ELF file Optimized.o that has a function CanOnlyAlwaysReturn2WithLTO which was optimized in a manner which requires LTO to produce optimal output.

I also added the ability to dump the set of symbol names in FileA and FileB, as well as the ability to compile ALL symbols from both FileA and FileB.

So the theory is that we could:

Build the unmanaged portion of the runtime with LTO enabled.
Run a tool similar to the one from the gist to perform LTO on the library and produce an optimized .o file
Pack the .o back into an .a (now .a like any other, with no bitcode)
Profit

We'd need to measure if this is indeed profitable and worth the engineering costs. Success not guaranteed. Might be better to first just enable LTO locally and use a linker that can handle it E2E (i.e. turn on LTO and compile with ILC as usual, expecting the linker step to do the LTO) and get some measurements. GC perf would be the most interesting to measure, so do something that stresses the GC and measure with/without LTO.

ghost · 2023-05-11T05:27:08Z

Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas
See info in area-owners.md if you want to be subscribed.

Issue Details

We currently compile unmanaged portions of the runtime without LTO or PGO because the runtime is placed in an .a file that gets linked using an unknown linker that exists on the user machine. LTO requires a linker that knows how to interpret the bitcode in non-ELF object files.

We can however apply LTO on .a files. See David's prototype at https://gist.github.com/davidwrighton/385035ffd24b88c39c2e7d5cf0274907.

How to use from David:

First, compile FileA.o and FileB.o which were compiled via command lines like:
clang -O3 -flto -c FileB.c
clang -O3 -flto -c FileA.c

Those commands produce FileA.o, and FileB.o which are NOT ELF object files, but instead are the Bitcode file format.

Then run a command line like…
LtoOptimize --plugin /usr/lib/llvm-14/lib/libLTO.so -o FileA.o -o FileB.o -O Optimized.o --symbol=CanOnlyAlwaysReturn2WithLTO

and produce an ELF file Optimized.o that has a function CanOnlyAlwaysReturn2WithLTO which was optimized in a manner which requires LTO to produce optimal output.

I also added the ability to dump the set of symbol names in FileA and FileB, as well as the ability to compile ALL symbols from both FileA and FileB.

So the theory is that we could:

Build the unmanaged portion of the runtime with LTO enabled.
Run a tool similar to the one from the gist to perform LTO on the library and produce an optimized .o file
Pack the .o back into an .a (now .a like any other, with no bitcode)
Profit

We'd need to measure if this is indeed profitable and worth the engineering costs. Success not guaranteed. Might be better to first just enable LTO locally and use a linker that can handle it E2E (i.e. turn on LTO and compile with ILC as usual, expecting the linker step to do the LTO) and get some measurements. GC perf would be the most interesting to measure, so do something that stresses the GC and measure with/without LTO.

Author:	MichalStrehovsky
Assignees:	-
Labels:	`help wanted`, `area-NativeAOT-coreclr`
Milestone:	-

MichalStrehovsky · 2024-02-28T11:55:13Z

I did a local experiment where I simply passed -flto to the runtime build and also to the linker invocation at publish time.

For TodosApi the improvement would be about 1% in RPS and a small improvement to latency as well.

In theory, this could also be addressed with #83611.

	Before 1	Before 2	Before 3	Before 4	After 1	After 2	After 3	After 4
Requests/sec	210,819	213,503	210,256	209,506	218,472	211,535	212,683	211,815
Mean latency (ms)	1.36	1.34	1.36	1.38	1.31	1.36	1.34	1.36
Max latency (ms)	21.05	28.89	25.83	30.28	24.52	27.99	25.17	24.58
Max Time in GC (%)	19	18	19	17	18	19	18	19
Max Working Set (MB)	95	98	101	100	102	101	93	97

MichalStrehovsky added help wanted [up-for-grabs] Good issue for external contributors area-NativeAOT-coreclr labels May 11, 2023

ghost added the untriaged New issue has not been triaged by the area owner label May 11, 2023

MichalStrehovsky added this to the 8.0.0 milestone May 11, 2023

ghost removed the untriaged New issue has not been triaged by the area owner label May 11, 2023

agocke added this to AppModel May 22, 2023

agocke modified the milestones: 8.0.0, Future Jul 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NativeAOT] Link-time-optimize unmanaged portions of the runtime on Linux #86083

[NativeAOT] Link-time-optimize unmanaged portions of the runtime on Linux #86083

MichalStrehovsky commented May 11, 2023

ghost commented May 11, 2023

MichalStrehovsky commented Feb 28, 2024

[NativeAOT] Link-time-optimize unmanaged portions of the runtime on Linux #86083

[NativeAOT] Link-time-optimize unmanaged portions of the runtime on Linux #86083

Comments

MichalStrehovsky commented May 11, 2023

ghost commented May 11, 2023

MichalStrehovsky commented Feb 28, 2024