-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add PGO+LTO Makefile #45641
Add PGO+LTO Makefile #45641
Conversation
Can we use |
I tried |
It's not just compile time improvements, also runtime: $ cat alloc.jl
@time for i in 1:1000000000
string(i)
end
$ julia ./alloc.jl
|
How does this affect runtime? |
Might be that we misconfigured it :) |
From
The symbols with So it seems |
OK, so this is the C side of the runtime which gets optimised, but the code generated by Julia should still be the same, right? |
0f4e2a6
to
30d0057
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, this looks really nice! Excellent work! I think there are a few more pieces that would be really helpful for this:
- Tracking down why
Clang_jll
doesn't work. I'll be happy to help investigate whether we're building it wrong or what. Being able to use that would improve the ergonomics of this significantly, IMO. - A smoke-test script that can be run to build julia, generate an example trace, then rebuild Julia with that profile data. We can, for example, run that on CI to ensure that we don't break this in the future.
contrib/pgo-lto/Makefile
Outdated
stage1: export CFLAGS=-fprofile-generate=$(PROFILE_DIR) -Xclang -mllvm -Xclang -vp-counters-per-site=$(COUNTERS_PER_SITE) | ||
stage1: export CXXFLAGS=-fprofile-generate=$(PROFILE_DIR) -Xclang -mllvm -Xclang -vp-counters-per-site=$(COUNTERS_PER_SITE) | ||
stage1: export LDFLAGS=-fuse-ld=lld -flto=thin -fprofile-generate=$(PROFILE_DIR) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually didn't know you could attach environment variables to targets like this in Make
! This is very cool!
Quick demonstration to anyone else watching, who wants to understand better how this interacts with rules and dependencies:
$ cat Makefile
all: foo bar foobar
# This rule will have `$FOO` defined within it
foo:
@echo "[foo] FOO: $${FOO}"
# This rule will not
bar:
@echo "[bar] FOO: $${FOO}"
# Even though this rule depends on `foo`, it won't have `$FOO` defined.
foobar: foo bar
@echo "[foobar] FOO: $${FOO}"
# Attach an environment variable to `foo`
foo: export FOO=foo
$ make
[foo] FOO: foo
[bar] FOO:
[foobar] FOO:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is neat! :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It wasn't so neat in the end, cause the variables are also set on prerequisites.
Regarding When comparing system clang/lld vs Julia's clang/lld, it seems the system version generates just a Edit okay, so the issue is potentially that the static lib has a
Possibly related: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100896 ? |
That's probably it... // ctors.c
#include <stdio.h>
static void ctor() { puts("ctor"); }
static void dtor() { puts("dtor"); }
static void (*const ctors []) ()
__attribute__ ((used, section (".ctors"), aligned (sizeof (void *))))
= { ctor };
static void (*const dtors []) ()
__attribute__ ((used, section (".dtors"), aligned (sizeof (void *))))
= { dtor };
// init_array.c
#include <stdio.h>
static void init() { puts ("init_array"); }
static void fini () { puts ("fini_array"); }
static void (*const init_array []) ()
__attribute__ ((used, section (".init_array"), aligned (sizeof (void *))))
= { init };
static void (*const fini_array []) ()
__attribute__ ((used, section (".fini_array"), aligned (sizeof (void *))))
= { fini };
// main.c
#include <stdio.h>
int main() { puts("hello world");} $ clang -fuse-ld=ld main.c ctors.c init_array.c -o with_ld
$ ./with_ld
ctor
init_array
hello world
fini_array
dtor
$ clang -fuse-ld=lld main.c ctors.c init_array.c -o with_lld
$ ./with_lld
init_array
hello world
fini_array
So the solution is to configure GCC < 11 targeting Linux with Edit: https://github.com/JuliaBinaryWrappers/GCCBootstrap_jll.jl/releases uses |
Will this be something used in the official or nightly binaries, or will it be available only to people that build julia on their own? Asking as I do not know exactly how the |
If you have a recent clang and lld on your system, it might be best to try commit c949fc0 and go through the 6 steps at the top of this PR. In the newer commits the idea is to use a patched Yggdrasil version of clang so you no longer need clang installed, but currently make gets stuck in an infinite loop, and I haven't had time to check why yet. |
contrib/pgo-lto/Makefile
Outdated
$(MAKE) -C $(STAGE0_BUILD)/deps install-clang install-llvm install-llvm-tools | ||
# Turn [cd]tors into init/fini_array sections in libclang_rt, since lld | ||
# doesn't do that, and otherwise the profile constructor is not executed | ||
find $< -name 'libclang_rt.profile-*.a' -exec objcopy --rename-section .ctors=.init_array --rename-section .dtors=.fini_array {} + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these not have opposite ordering?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That could be, need to check. It's likely there's only one global so might not be an issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
$ nm --defined-only ./clang/14.0.5/lib/linux/libclang_rt.profile-x86_64.a | grep GLOBAL
0000000000000000 t _GLOBAL__sub_I_InstrProfilingRuntime.cpp
It's not certain yet; we'll need to do pretty extensive testing to ensure that it wouldn't e.g. speed up some workloads, but slow down others. Most likely what this will be used for is for application-specific Julia builds, e.g. you have a workload and you want Julia to run 10% faster on that workload, so you can profile it on exactly that workload. |
229819b
to
6f56ffd
Compare
Adds a convenient way to enable PGO+LTO on Julia and LLVM together: 1. `cd contrib/pgo-lto` 2. `make -j$(nproc) stage1` 3. `make clean-profiles` 4. `./stage1.build/julia -O3 -e 'using Pkg; Pkg.add("LoopVectorization"); Pkg.test("LoopVectorization")'` 5. `make -j$(nproc) stage2` This results quite often in spectacular speedups for time to first X as it reduces the time spent in LLVM optimization passes by 25 or even 30%. Example 1: ```julia using LoopVectorization function f!(a, b) @turbo for i in eachindex(a) a[i] *= b[i] end return a end f!(rand(1), rand(1)) ``` ```console $ time ./julia -O3 lv.jl ``` Without PGO+LTO: 14.801s With PGO+LTO: 11.978s (-19%) Example 2: ```console $ time ./julia -e 'using Pkg; Pkg.test("Unitful");' ``` Without PGO+LTO: 1m47.688s With PGO+LTO: 1m35.704s (-11%) Example 3 (taken from issue JuliaLang#45395, which is almost only LLVM): ```console $ JULIA_LLVM_ARGS=-time-passes ./julia script-45395.jl ``` Without PGO+LTO: ``` ===-------------------------------------------------------------------------=== ... Pass execution timing report ... ===-------------------------------------------------------------------------=== Total Execution Time: 101.0130 seconds (98.6253 wall clock) ---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name --- 53.6961 ( 54.7%) 0.1050 ( 3.8%) 53.8012 ( 53.3%) 53.8045 ( 54.6%) Unroll loops 25.5423 ( 26.0%) 0.0072 ( 0.3%) 25.5495 ( 25.3%) 25.5444 ( 25.9%) Global Value Numbering 7.1995 ( 7.3%) 0.0526 ( 1.9%) 7.2521 ( 7.2%) 7.2517 ( 7.4%) Induction Variable Simplification 5.0541 ( 5.1%) 0.0098 ( 0.3%) 5.0639 ( 5.0%) 5.0561 ( 5.1%) Combine redundant instructions JuliaLang#2 ``` Wit PGO+LTO: ``` ===-------------------------------------------------------------------------=== ... Pass execution timing report ... ===-------------------------------------------------------------------------=== Total Execution Time: 72.6507 seconds (70.1337 wall clock) ---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name --- 36.0894 ( 51.7%) 0.0825 ( 2.9%) 36.1719 ( 49.8%) 36.1738 ( 51.6%) Unroll loops 16.5713 ( 23.7%) 0.0129 ( 0.5%) 16.5843 ( 22.8%) 16.5794 ( 23.6%) Global Value Numbering 5.9047 ( 8.5%) 0.0395 ( 1.4%) 5.9442 ( 8.2%) 5.9438 ( 8.5%) Induction Variable Simplification 4.7566 ( 6.8%) 0.0078 ( 0.3%) 4.7645 ( 6.6%) 4.7575 ( 6.8%) Combine redundant instructions JuliaLang#2 ``` Or -28% time spent in LLVM. --- Finally there's a significant reduction in binary sizes. For libLLVM.so: ``` 79M usr/lib/libLLVM-13jl.so (before) 67M usr/lib/libLLVM-13jl.so (after) ``` And it can be reduced by another 2MB with `--icf=safe` when using LLD as a linker anways. Turn into makefile Newline Use two out of source builds Ignore profiles + build dirs Add --icf=safe stage0 setup prebuilt clang with [cd]tors->init/fini patch
6f56ffd
to
dbffe91
Compare
This should now build Julia with BB's LLVM |
I'm getting an error building stage1 here (on a fresh clone):
|
I've seen that before when a different |
I think so?
Tracing execution in GDB, it looks like we're dispatching to the codegen stubs instead of the actual compiler, Ah, this is once more caused by the libstdc++ we helpfully put there (I'm on Arch, so have a recent libc):
Removing that makes the build continue. |
Yeah, Julia's libstdc++ detection looks wrong with clang. It uses the default Fortran compiler, assumes it's GCC, and uses the libstdc++ in its install prefix. But clang will search for the most recent GCC on the system and use the libstdc++ shipped with that. @staticfloat maybe it's more reliable to use something along the lines of |
I can reproduce the speed-up, but some at least seems to come from the LLVM source build that's involved (either the fact that it's a source build, or the
|
Probably |
A comment by @gbaraldi is that we should check whether the PGO-attained performance benefit is portable across systems. The easiest way for that is if we could trick the buildbots into generating PGO-optimized binaries, and run PkgEval on that. For testing purposes, maybe we could commit a merged profile trace (e.g. from running |
Here's a Without changing the Makefile: cd contrib/pgo-lto
make stage0 -j$(nproc)
touch stage1
mkdir profiles
curl -Lfs https://github.com/JuliaLang/julia/files/9048880/data.tar.gz | tar -zxf- -C profiles/
touch profiles/merged.prof
make stage2 -j$(nproc) |
Turns out this is because
Workaround: use Another issue is that Julia when built with LTO and PGO drops the |
bump? |
Any update on this? |
IMO we should merge this. Tagging triage to confirm. |
Triage thinks this is a good idea, provided there is a roadmap to eventually turn it on by default. Feel free to merge when ready. |
I was just wondering if this PR would be added before the 1.11 feature freeze? Thank you. |
yes. |
Is it on the roadmap for this to become the default way julia is built? |
if someone makes the PR to do that :) |
|
||
AFTER_STAGE1_MESSAGE:='Run `make clean-profiles` to start with a clean slate. $\ | ||
Then run Julia to collect realistic profile data, for example: `$(STAGE1_BUILD)/julia -O3 -e $\ | ||
'\''using Pkg; Pkg.add("LoopVectorization"); Pkg.test("LoopVectorization")'\''`. This $\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a reminder to whoever potentially pursues this in the future, LoopVectorization
is somewhat sunsetting and probably a different default should be picked for profiling. If someone among the core devs has good suggestions, I would be happy to set up some of the future PRs related to fixing this and making PGO/LTO default.
We should at least have CI pipeline that checks that this doesn't break. |
I seem to be able to compile julia with the default makefile, but trying to run
leads to some hash mismatch for a downloaded file. Not sure where to look for that:
|
For the version 10.0, we were able to derive a PKGBUILD (Archlinux based package description) that make full use of PGO following the step of the first post on this tread. https://github.com/CachyOS/CachyOS-PKGBUILDS/blob/master/julia/PKGBUILD |
Nice, I didn't realize this got merged. It did not bitrot? |
Adds a convenient way to enable PGO+LTO on Julia and LLVM together:
cd contrib/pgo-lto
make -j$(nproc) stage1
make clean-profiles
./stage1.build/julia -O3 -e 'using Pkg; Pkg.add("LoopVectorization"); Pkg.test("LoopVectorization")'
make -j$(nproc) stage2
* Output looks roughly like as follows
This results quite often in spectacular speedups for time to first X as
it reduces the time spent in LLVM optimization passes by 25 or even 30%.
Example 1:
$ time ./julia -O3 lv.jl
Without PGO+LTO: 14.801s
With PGO+LTO: 11.978s (-19%)
Example 2:
$ time ./julia -e 'using Pkg; Pkg.test("Unitful");'
Without PGO+LTO: 1m47.688s
With PGO+LTO: 1m35.704s (-11%)
Example 3 (taken from issue #45395, which is almost only LLVM):
$ JULIA_LLVM_ARGS=-time-passes ./julia script-45395.jl
Without PGO+LTO:
With PGO+LTO:
Or -28% time spent in LLVM.
perf
reports show this is mostly fewer instructions and reduction in icache misses.Finally there's a significant reduction in binary sizes. For libLLVM.so:
And it can be reduced by another 2MB with
--icf=safe
when using LLD asa linker anyways.