Add PGO+LTO Makefile #45641

haampie · 2022-06-10T20:42:04Z

Adds a convenient way to enable PGO+LTO on Julia and LLVM together:

cd contrib/pgo-lto
make -j$(nproc) stage1
make clean-profiles
./stage1.build/julia -O3 -e 'using Pkg; Pkg.add("LoopVectorization"); Pkg.test("LoopVectorization")'
make -j$(nproc) stage2

* Output looks roughly like as follows

$ make -C contrib/pgo-lto top 
make: Entering directory '/dev/shm/julia/contrib/pgo-lto'
llvm-profdata show --topn=50 /dev/shm/julia/contrib/pgo-lto/profiles/merged.prof | c++filt
Instrumentation level: IR  entry_first = 0
Total functions: 85943
Maximum function count: 7867557260
Maximum internal block count: 3468437590
Top 50 functions with the largest internal block counts: 
  llvm::BitVector::operator|=(llvm::BitVector const&), max count = 7867557260
  LateLowerGCFrame::ComputeLiveness(State&), max count = 3468437590
  llvm::hashing::detail::hash_combine_recursive_helper::hash_combine_recursive_helper(), max count = 1742259834
  llvm::SUnit::addPred(llvm::SDep const&, bool), max count = 511396575
  llvm::LiveRange::overlaps(llvm::LiveRange const&, llvm::CoalescerPair const&, llvm::SlotIndexes const&) const, max count = 508061762
  llvm::StringMapImpl::LookupBucketFor(llvm::StringRef), max count = 505682177
  std::map<llvm::BasicBlock*, BBState, std::less<llvm::BasicBlock*>, std::allocator<std::pair<llvm::BasicBlock* const, BBState> > >::operator[](llvm::BasicBlock* const&), max count = 395628888
  llvm::LiveRange::advanceTo(llvm::LiveRange::Segment const*, llvm::SlotIndex) const, max count = 384642728
  llvm::LiveRange::isLiveAtIndexes(llvm::ArrayRef<llvm::SlotIndex>) const, max count = 380291040
  llvm::PassRegistry::enumerateWith(llvm::PassRegistrationListener*), max count = 352313953
  ijl_method_instance_add_backedge, max count = 349608221
  llvm::SUnit::ComputeHeight(), max count = 336604330
  llvm::LiveRange::advanceTo(llvm::LiveRange::Segment*, llvm::SlotIndex), max count = 331030109
  llvm::SmallPtrSetImplBase::insert_imp(void const*), max count = 272966545
  llvm::LiveIntervals::checkRegMaskInterference(llvm::LiveInterval&, llvm::BitVector&), max count = 257449540
  LateLowerGCFrame::ComputeLiveSets(State&), max count = 252096274
  /dev/shm/julia/src/jltypes.c:has_free_typevars, max count = 230879464
  ijl_get_pgcstack, max count = 216953592
  LateLowerGCFrame::RefineLiveSet(llvm::BitVector&, State&, std::vector<int, std::allocator<int> > const&), max count = 188013152
  /dev/shm/julia/src/flisp/flisp.c:apply_cl, max count = 174863813
  /dev/shm/julia/src/flisp/builtins.c:fl_memq, max count = 168621603

This results quite often in spectacular speedups for time to first X as
it reduces the time spent in LLVM optimization passes by 25 or even 30%.

Example 1:

using LoopVectorization
function f!(a, b)
    @turbo for i in eachindex(a)
        a[i] *= b[i]
    end
    return a
end
f!(rand(1), rand(1))

$ time ./julia -O3 lv.jl

Without PGO+LTO: 14.801s
With PGO+LTO: 11.978s (-19%)

Example 2:

$ time ./julia -e 'using Pkg; Pkg.test("Unitful");'

Without PGO+LTO: 1m47.688s
With PGO+LTO: 1m35.704s (-11%)

Example 3 (taken from issue #45395, which is almost only LLVM):

$ JULIA_LLVM_ARGS=-time-passes ./julia script-45395.jl

Without PGO+LTO:

===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 101.0130 seconds (98.6253 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  53.6961 ( 54.7%)   0.1050 (  3.8%)  53.8012 ( 53.3%)  53.8045 ( 54.6%)  Unroll loops
  25.5423 ( 26.0%)   0.0072 (  0.3%)  25.5495 ( 25.3%)  25.5444 ( 25.9%)  Global Value Numbering
   7.1995 (  7.3%)   0.0526 (  1.9%)   7.2521 (  7.2%)   7.2517 (  7.4%)  Induction Variable Simplification
   6.0541 (  5.1%)   0.0098 (  0.3%)   5.0639 (  5.0%)   5.0561 (  5.1%)  Combine redundant instructions #2

With PGO+LTO:

===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 72.6507 seconds (70.1337 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  36.0894 ( 51.7%)   0.0825 (  2.9%)  36.1719 ( 49.8%)  36.1738 ( 51.6%)  Unroll loops
  16.5713 ( 23.7%)   0.0129 (  0.5%)  16.5843 ( 22.8%)  16.5794 ( 23.6%)  Global Value Numbering
   5.9047 (  8.5%)   0.0395 (  1.4%)   5.9442 (  8.2%)   5.9438 (  8.5%)  Induction Variable Simplification
   4.7566 (  6.8%)   0.0078 (  0.3%)   4.7645 (  6.6%)   4.7575 (  6.8%)  Combine redundant instructions #2

Or -28% time spent in LLVM.

perf reports show this is mostly fewer instructions and reduction in icache misses.

Finally there's a significant reduction in binary sizes. For libLLVM.so:

79M	usr/lib/libLLVM-13jl.so (before)
67M	usr/lib/libLLVM-13jl.so (after)

And it can be reduced by another 2MB with --icf=safe when using LLD as
a linker anyways.

Two out-of-source builds would be better than a single in-source build, so that it's easier to find good profile data

vchuravy · 2022-06-10T20:59:32Z

Can we use make -C deps install-clang / Clang_jll as the stage0? We should formalize that for ASAN as well.

haampie · 2022-06-10T21:00:52Z

I tried Clang_jll as well, and things compile fine, but no profile data is generated. Maybe some relocation issue, I didn't look into it further.

contrib/pgo-lto/Make.user.pgo-lto

haampie · 2022-06-10T22:00:09Z

It's not just compile time improvements, also runtime:

$ cat alloc.jl
@time for i in 1:1000000000
    string(i)
end
$ julia ./alloc.jl

Before: 43.057379 seconds (2.00 G allocations: 89.332 GiB, 6.71% gc time)
After:  34.928795 seconds (2.00 G allocations: 89.332 GiB, 6.95% gc time)

oscardssmith · 2022-06-10T23:26:22Z

How does this affect runtime?

vchuravy · 2022-06-10T23:58:43Z

I tried Clang_jll as well, and things compile fine, but no profile data is generated. Maybe some relocation issue, I didn't look into it further.

Might be that we misconfigured it :)

haampie · 2022-06-11T12:08:12Z

How does this affect runtime?

From perf record:

symbol	before	after
julia_dec_41286*	8.66%	22.66%
jl_gc_pool_alloc_noinline	9.00%	11.49%
ijl_alloc_string	7.07%	9.72%
ijl_array_to_string	17.45%	5.61%
ijl_string_to_array	19.73%	3.62%
julia_ndigits0zpb_50016*	1.92%	3.23%
julia_YY.stringYY.443_30211*	1.47%	1.74%
ijl_get_pgcstack	0.39%	0.89%
gc_sweep_pool	0.37%	0.45%
add_page	0.30%	0.33%

The symbols with * are in sys.o, the others are libjulia-internal.so.

So it seems ijl_array_to_string and ijl_string_to_array are optimized better

giordano · 2022-06-11T12:32:01Z

So it seems ijl_array_to_string and ijl_string_to_array are optimized better

OK, so this is the C side of the runtime which gets optimised, but the code generated by Julia should still be the same, right?

staticfloat

Overall, this looks really nice! Excellent work! I think there are a few more pieces that would be really helpful for this:

Tracking down why Clang_jll doesn't work. I'll be happy to help investigate whether we're building it wrong or what. Being able to use that would improve the ergonomics of this significantly, IMO.
A smoke-test script that can be run to build julia, generate an example trace, then rebuild Julia with that profile data. We can, for example, run that on CI to ensure that we don't break this in the future.

staticfloat · 2022-06-14T20:25:05Z

contrib/pgo-lto/Makefile

+stage1: export CFLAGS=-fprofile-generate=$(PROFILE_DIR) -Xclang -mllvm -Xclang -vp-counters-per-site=$(COUNTERS_PER_SITE)
+stage1: export CXXFLAGS=-fprofile-generate=$(PROFILE_DIR) -Xclang -mllvm -Xclang -vp-counters-per-site=$(COUNTERS_PER_SITE)
+stage1: export LDFLAGS=-fuse-ld=lld -flto=thin -fprofile-generate=$(PROFILE_DIR)


I actually didn't know you could attach environment variables to targets like this in Make! This is very cool!

Quick demonstration to anyone else watching, who wants to understand better how this interacts with rules and dependencies:

$ cat Makefile all: foo bar foobar # This rule will have `$FOO` defined within it foo: @echo "[foo] FOO: $${FOO}" # This rule will not bar: @echo "[bar] FOO: $${FOO}" # Even though this rule depends on `foo`, it won't have `$FOO` defined. foobar: foo bar @echo "[foobar] FOO: $${FOO}" # Attach an environment variable to `foo` foo: export FOO=foo

$ make [foo] FOO: foo [bar] FOO: [foobar] FOO:

Yeah, this is neat! :D

It wasn't so neat in the end, cause the variables are also set on prerequisites.

contrib/pgo-lto/Makefile

haampie · 2022-06-15T09:26:57Z

Regarding lld, I tried both LLVM 13 and 14, both have the issue. When you do clang -fprofile-generate, clang adds the static compiler-rt library to the linker invocation, together with a flag -u__llvm_profile_runtime. This flag forces an object file defining this symbol to be linked (I guess they want finer granularity than --whole-archive), and this object file contains a global whose constructor registers an atexit hook. The issue is that the constructor of this class is never called...

When comparing system clang/lld vs Julia's clang/lld, it seems the system version generates just a .init_array section, whereas Julia's version adds both an .init_array and a .ctors section to the ELF file. So likely that's an issue, but I can't explain why.

Edit okay, so the issue is potentially that the static lib has a .ctors section instead of a .init_array?

$ ar x /home/harmen/Documents/projects/julia/usr/lib/clang/14.0.3/lib/linux/libclang_rt.profile-x86_64.a InstrProfilingRuntime.cpp.o
$ objdump -x InstrProfilingRuntime.cpp.o
...
  4 .ctors        00000008  0000000000000000  0000000000000000  00000048  2**3
                  CONTENTS, ALLOC, LOAD, RELOC, DATA
...

Possibly related: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100896 ?

haampie · 2022-06-15T10:12:51Z

That's probably it... ld translates ctors into init_array and merges the sections, whereas lld keeps the sections the same. From an old thread in the GCC bug tracker: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=46770:

// ctors.c
#include <stdio.h>
static void ctor() { puts("ctor"); }
static void dtor() { puts("dtor"); }
static void (*const ctors []) ()
  __attribute__ ((used, section (".ctors"), aligned (sizeof (void *))))
  = { ctor };
static void (*const dtors []) ()
  __attribute__ ((used, section (".dtors"), aligned (sizeof (void *))))
  = { dtor };

// init_array.c
#include <stdio.h>
static void init() { puts ("init_array"); }
static void fini () { puts ("fini_array"); }
static void (*const init_array []) ()
  __attribute__ ((used, section (".init_array"), aligned (sizeof (void *))))
  = { init };
static void (*const fini_array []) ()
  __attribute__ ((used, section (".fini_array"), aligned (sizeof (void *))))
  = { fini };

// main.c
#include <stdio.h>
int main() { puts("hello world");}

$ clang -fuse-ld=ld main.c ctors.c init_array.c -o with_ld 
$ ./with_ld
ctor
init_array
hello world
fini_array
dtor
$ clang -fuse-ld=lld main.c ctors.c init_array.c -o with_lld 
$ ./with_lld
init_array
hello world
fini_array

$ readelf -S with_ld | grep -E '(ctors|init_array)'
  [18] .init_array       INIT_ARRAY       0000000000403df0  00002df0
$ readelf -S with_lld | grep -E '(ctors|init_array)'
  [12] .ctors            PROGBITS         00000000002004c8  000004c8
  [21] .init_array       INIT_ARRAY       0000000000202900  00000900

So the solution is to configure GCC < 11 targeting Linux with --enable-initfini-array? (https://reviews.llvm.org/D45508 did not land). As I understand it, when using Clang_jll locally, it will still look for a system GCC from which it takes among other things crtbegin.o/crtend.o. And those files should have .ctors/.dtors sections with required sentinel values that make those constructors/destructors work, meaning support for ctors/dtors is enabled/disabled at compile time of GCC, and there's no way to fix that, except for using linkers that take care of ctor->init_array or renaming those sections with objcopy (which sounds painful).

Edit: https://github.com/JuliaBinaryWrappers/GCCBootstrap_jll.jl/releases uses .init/fini_array

Krastanov · 2022-06-27T00:48:16Z

Will this be something used in the official or nightly binaries, or will it be available only to people that build julia on their own? Asking as I do not know exactly how the contrib folder is used. Please excuse the tangential question.

haampie · 2022-06-27T11:43:25Z

If you have a recent clang and lld on your system, it might be best to try commit c949fc0 and go through the 6 steps at the top of this PR.

In the newer commits the idea is to use a patched Yggdrasil version of clang so you no longer need clang installed, but currently make gets stuck in an infinite loop, and I haven't had time to check why yet.

vchuravy · 2022-06-27T14:24:35Z

contrib/pgo-lto/Makefile

+	$(MAKE) -C $(STAGE0_BUILD)/deps install-clang install-llvm install-llvm-tools
+	# Turn [cd]tors into init/fini_array sections in libclang_rt, since lld
+	# doesn't do that, and otherwise the profile constructor is not executed
+	find $< -name 'libclang_rt.profile-*.a' -exec objcopy --rename-section .ctors=.init_array --rename-section .dtors=.fini_array {} +


Do these not have opposite ordering?

That could be, need to check. It's likely there's only one global so might not be an issue.

$ nm --defined-only ./clang/14.0.5/lib/linux/libclang_rt.profile-x86_64.a | grep GLOBAL 0000000000000000 t _GLOBAL__sub_I_InstrProfilingRuntime.cpp

staticfloat · 2022-06-27T14:36:01Z

Will this be something used in the official or nightly binaries, or will it be available only to people that build julia on their own?

It's not certain yet; we'll need to do pretty extensive testing to ensure that it wouldn't e.g. speed up some workloads, but slow down others. Most likely what this will be used for is for application-specific Julia builds, e.g. you have a workload and you want Julia to run 10% faster on that workload, so you can profile it on exactly that workload.

@turbo

Adds a convenient way to enable PGO+LTO on Julia and LLVM together: 1. `cd contrib/pgo-lto` 2. `make -j$(nproc) stage1` 3. `make clean-profiles` 4. `./stage1.build/julia -O3 -e 'using Pkg; Pkg.add("LoopVectorization"); Pkg.test("LoopVectorization")'` 5. `make -j$(nproc) stage2` This results quite often in spectacular speedups for time to first X as it reduces the time spent in LLVM optimization passes by 25 or even 30%. Example 1: ```julia using LoopVectorization function f!(a, b) @turbo for i in eachindex(a) a[i] *= b[i] end return a end f!(rand(1), rand(1)) ``` ```console $ time ./julia -O3 lv.jl ``` Without PGO+LTO: 14.801s With PGO+LTO: 11.978s (-19%) Example 2: ```console $ time ./julia -e 'using Pkg; Pkg.test("Unitful");' ``` Without PGO+LTO: 1m47.688s With PGO+LTO: 1m35.704s (-11%) Example 3 (taken from issue JuliaLang#45395, which is almost only LLVM): ```console $ JULIA_LLVM_ARGS=-time-passes ./julia script-45395.jl ``` Without PGO+LTO: ``` ===-------------------------------------------------------------------------=== ... Pass execution timing report ... ===-------------------------------------------------------------------------=== Total Execution Time: 101.0130 seconds (98.6253 wall clock) ---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name --- 53.6961 ( 54.7%) 0.1050 ( 3.8%) 53.8012 ( 53.3%) 53.8045 ( 54.6%) Unroll loops 25.5423 ( 26.0%) 0.0072 ( 0.3%) 25.5495 ( 25.3%) 25.5444 ( 25.9%) Global Value Numbering 7.1995 ( 7.3%) 0.0526 ( 1.9%) 7.2521 ( 7.2%) 7.2517 ( 7.4%) Induction Variable Simplification 5.0541 ( 5.1%) 0.0098 ( 0.3%) 5.0639 ( 5.0%) 5.0561 ( 5.1%) Combine redundant instructions JuliaLang#2 ``` Wit PGO+LTO: ``` ===-------------------------------------------------------------------------=== ... Pass execution timing report ... ===-------------------------------------------------------------------------=== Total Execution Time: 72.6507 seconds (70.1337 wall clock) ---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name --- 36.0894 ( 51.7%) 0.0825 ( 2.9%) 36.1719 ( 49.8%) 36.1738 ( 51.6%) Unroll loops 16.5713 ( 23.7%) 0.0129 ( 0.5%) 16.5843 ( 22.8%) 16.5794 ( 23.6%) Global Value Numbering 5.9047 ( 8.5%) 0.0395 ( 1.4%) 5.9442 ( 8.2%) 5.9438 ( 8.5%) Induction Variable Simplification 4.7566 ( 6.8%) 0.0078 ( 0.3%) 4.7645 ( 6.6%) 4.7575 ( 6.8%) Combine redundant instructions JuliaLang#2 ``` Or -28% time spent in LLVM. --- Finally there's a significant reduction in binary sizes. For libLLVM.so: ``` 79M usr/lib/libLLVM-13jl.so (before) 67M usr/lib/libLLVM-13jl.so (after) ``` And it can be reduced by another 2MB with `--icf=safe` when using LLD as a linker anways. Turn into makefile Newline Use two out of source builds Ignore profiles + build dirs Add --icf=safe stage0 setup prebuilt clang with [cd]tors->init/fini patch

haampie · 2022-06-28T12:53:56Z

This should now build Julia with BB's LLVM

maleadt · 2022-06-28T13:53:42Z

I'm getting an error building stage1 here (on a fresh clone):

    JULIA contrib/pgo-lto/stage1.build/usr/lib/julia/corecompiler.ji
ERROR: `ccall` requires the compiler

haampie · 2022-06-28T14:54:48Z

I've seen that before when a different libstdc++.so is used during linking & runtime. Does ./contrib/pgo-lto/stage0.build/usr/tools/clang -v pick a sensible GCC?

maleadt · 2022-06-28T15:05:32Z

Does ./contrib/pgo-lto/stage0.build/usr/tools/clang -v pick a sensible GCC?

I think so?

$ ./stage0.build/usr/tools/clang -v
clang version 14.0.5 (/depot/downloads/clones/llvm-project.git-5a9787eb535c2edc5dea030cc221c1d60f38c9f42344f410e425ea2139e233aa 3c1151c0f6c5b265ec2b3a176fe12be4b23252bf)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/tim/Julia/src/julia/contrib/pgo-lto/./stage0.build/usr/tools
Found candidate GCC installation: /usr/lib/gcc/x86_64-pc-linux-gnu/10.3.0
Found candidate GCC installation: /usr/lib/gcc/x86_64-pc-linux-gnu/11.3.0
Found candidate GCC installation: /usr/lib/gcc/x86_64-pc-linux-gnu/12.1.0
Found candidate GCC installation: /usr/lib64/gcc/x86_64-pc-linux-gnu/10.3.0
Found candidate GCC installation: /usr/lib64/gcc/x86_64-pc-linux-gnu/11.3.0
Found candidate GCC installation: /usr/lib64/gcc/x86_64-pc-linux-gnu/12.1.0
Selected GCC installation: /usr/lib64/gcc/x86_64-pc-linux-gnu/12.1.0
Candidate multilib: .;@m64
Candidate multilib: 32;@m32
Selected multilib: .;@m64
Found CUDA installation: /opt/cuda, version

Tracing execution in GDB, it looks like we're dispatching to the codegen stubs instead of the actual compiler, jl_generate_fptr_for_unspecialized_fallback instead of jl_generate_fptr_for_unspecialized_impl.

Ah, this is once more caused by the libstdc++ we helpfully put there (I'm on Arch, so have a recent libc):

$ ldd usr/lib/libjulia-codegen.so.1.9
usr/lib/libjulia-codegen.so.1.9: /home/tim/Julia/src/julia/contrib/pgo-lto/stage1.build/usr/lib/libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by /home/tim/Julia/src/julia/contrib/pgo-lto/stage1.build/usr/lib/libLLVM-14jl.so)

Removing that makes the build continue.

haampie · 2022-06-28T15:33:54Z

Yeah, Julia's libstdc++ detection looks wrong with clang. It uses the default Fortran compiler, assumes it's GCC, and uses the libstdc++ in its install prefix. But clang will search for the most recent GCC on the system and use the libstdc++ shipped with that.

@staticfloat maybe it's more reliable to use something along the lines of $(CXX) test.cc && ldd a.out | grep ...?

maleadt · 2022-06-28T16:19:24Z

I can reproduce the speed-up, but some at least seems to come from the LLVM source build that's involved (either the fact that it's a source build, or the -flto=thin, or ...). Doing a Pkg.test("Unitful"):

Julia with LLVM from BB
./bin/julia -e   86.49s user 6.18s system 108% cpu 1:25.78 total
./bin/julia -e   85.12s user 5.64s system 107% cpu 1:24.23 total

PGO+LTO stage2 without `-fprofile-use`
./bin/julia -e   83.23s user 6.09s system 108% cpu 1:22.41 total
./bin/julia -e   82.89s user 6.08s system 108% cpu 1:22.10 total

Actual PGO+LTO
./bin/julia -e   78.25s user 5.90s system 108% cpu 1:17.28 total
./bin/julia -e   78.55s user 6.09s system 108% cpu 1:17.77 total

haampie · 2022-06-29T09:21:00Z

Probably stage0 should be optional s.t. package managers can use their own clang.

maleadt · 2022-07-05T18:11:45Z

A comment by @gbaraldi is that we should check whether the PGO-attained performance benefit is portable across systems. The easiest way for that is if we could trick the buildbots into generating PGO-optimized binaries, and run PkgEval on that. For testing purposes, maybe we could commit a merged profile trace (e.g. from running Base.runtests) and modify the main Makefile to use it?

haampie · 2022-07-05T19:20:52Z

Here's a merged.prof file generated using Base.runtests on x86, someone with an M1 could try just the stage 2 build? I don't have access to aarch64 right now.

Without changing the Makefile:

cd contrib/pgo-lto
make stage0 -j$(nproc)
touch stage1
mkdir profiles
curl -Lfs https://github.com/JuliaLang/julia/files/9048880/data.tar.gz | tar -zxf- -C profiles/
touch profiles/merged.prof
make stage2 -j$(nproc)

haampie · 2022-07-05T21:18:00Z

Interestingly, those additional measurements show that LTO itself doesn't yield much speed-up compared to a plain source build

Turns out this is because -flto was not part of the cflags/cxxflags, I must have dropped those flags during a force push :(. However, adding them back runs into counter overflow issues really quickly, which happens for sure with Base.runtests() as training data. I've seen this before, and it is probable that this is a bug in LLVM, since these numbers are impossible:

Maximum function count: 17870283321406155538
Maximum internal block count: 17582052945254417008
Top 50 functions with the largest internal block counts: 
  llvm::IRBuilderBase::CreateAlloca(llvm::Type*, llvm::Value*, llvm::Twine const&), max count = 17870283321406155538

Workaround: use -fprofile-instr-generate=$(PROFILE_DIR)/prof-%p.profraw, which seems to work fine, but the downside is it generates ~ 30GB of profile data instead of ~ 15MB.

Another issue is that Julia when built with LTO and PGO drops the jl_crc32c symbol... Edit: I think there's some actual issue with the crc32c software version fallback... but it was triggered by passing CC= on the command line, which results in a missing CC += -march=... flag :(. Will fix julia's CC behavior in a separate PR then, cause julia overrides CC if passed as an environment variable, which is a pain.

gbaraldi · 2023-02-06T22:58:09Z

bump?

lseman · 2023-12-20T13:21:14Z

Any update on this?

oscardssmith · 2023-12-20T13:48:47Z

IMO we should merge this. Tagging triage to confirm.

LilithHafner · 2023-12-21T13:47:13Z

Triage thinks this is a good idea, provided there is a roadmap to eventually turn it on by default. Feel free to merge when ready.

gitboy16 · 2024-02-09T08:54:06Z

I was just wondering if this PR would be added before the 1.11 feature freeze? Thank you.

oscardssmith · 2024-02-09T14:37:24Z

yes.

Krastanov · 2024-02-09T14:42:11Z

Is it on the roadmap for this to become the default way julia is built?

oscardssmith · 2024-02-09T14:44:22Z

if someone makes the PR to do that :)

Krastanov · 2024-02-09T14:49:40Z

contrib/pgo-lto/Makefile

+
+AFTER_STAGE1_MESSAGE:='Run `make clean-profiles` to start with a clean slate. $\
+    Then run Julia to collect realistic profile data, for example: `$(STAGE1_BUILD)/julia -O3 -e $\
+    '\''using Pkg; Pkg.add("LoopVectorization"); Pkg.test("LoopVectorization")'\''`. This $\


Just a reminder to whoever potentially pursues this in the future, LoopVectorization is somewhat sunsetting and probably a different default should be picked for profiling. If someone among the core devs has good suggestions, I would be happy to set up some of the future PRs related to fixing this and making PGO/LTO default.

vchuravy · 2024-02-09T15:12:00Z

We should at least have CI pipeline that checks that this doesn't break.

Krastanov · 2024-02-09T15:36:53Z

I seem to be able to compile julia with the default makefile, but trying to run

cd contrib/pgo-lto
make -j$(nproc) stage1

leads to some hash mismatch for a downloaded file. Not sure where to look for that:

  ERROR: sha512 checksum failure on llvm-julia-16.0.6-2.tar.gz, should be:
      6f2513adea1b939229c9be171e7ce41e488b3cfaa2e615912c4bc1ddaf0ab2e7
      5df213a5d5db80105d6473a8017b0656016bbbb085ef00a38073519668885884
  But `sha512sum /home/stefan/Documents/LocalScratchSpace/julia-pgo/deps/srccache/llvm-julia-16.0.6-2.tar.gz | awk '{ print $1; }'` results in:
      5f2f88b4673b13780fa819c78cb27fc5dab77c2976768ae4f7863b904c911e39
      fc18ee85d212e512a7c60081c74efd1fa2e7142b78002982533b7326ff808f24

lseman · 2024-02-12T21:00:45Z

For the version 10.0, we were able to derive a PKGBUILD (Archlinux based package description) that make full use of PGO following the step of the first post on this tread.

https://github.com/CachyOS/CachyOS-PKGBUILDS/blob/master/julia/PKGBUILD

haampie · 2024-02-12T21:45:26Z

Nice, I didn't realize this got merged. It did not bitrot?

giordano added the building Build system, or building Julia or its dependencies label Jun 10, 2022

giordano reviewed Jun 10, 2022

View reviewed changes

contrib/pgo-lto/Make.user.pgo-lto Outdated Show resolved Hide resolved

contrib/pgo-lto/Make.user.pgo-lto Outdated Show resolved Hide resolved

haampie force-pushed the ttfx-improvements branch from 0f4e2a6 to 30d0057 Compare June 12, 2022 16:52

staticfloat reviewed Jun 14, 2022

View reviewed changes

vchuravy reviewed Jun 27, 2022

View reviewed changes

haampie force-pushed the ttfx-improvements branch 3 times, most recently from 229819b to 6f56ffd Compare June 28, 2022 10:42

haampie force-pushed the ttfx-improvements branch from 6f56ffd to dbffe91 Compare June 28, 2022 10:47

fix escaping

9cca557

zamazan4ik mentioned this pull request Dec 5, 2023

Enable Profile-Guided Optimization (PGO) for more packages msys2/MINGW-packages#19273

Open

Merge branch 'master' into ttfx-improvements

7f9b382

oscardssmith added performance Must go faster triage This should be discussed on a triage call labels Dec 20, 2023

LilithHafner mentioned this pull request Dec 21, 2023

Typos CI check crashed #52599

Closed

Merge branch 'master' into ttfx-improvements

9a09a00

LilithHafner mentioned this pull request Dec 21, 2023

Fix typos CI job for PRs that don't edit any files or do edit binary files #52600

Merged

Merge branch 'master' into ttfx-improvements

5e9b6a8

LilithHafner removed the triage This should be discussed on a triage call label Dec 21, 2023

oscardssmith merged commit 36b7d3b into JuliaLang:master Feb 9, 2024
8 checks passed

Krastanov reviewed Feb 9, 2024

View reviewed changes

haampie deleted the ttfx-improvements branch February 12, 2024 21:43

Zentrik mentioned this pull request Mar 26, 2024

Add building and testing of PGO+LTO LLVM and Julia JuliaCI/julia-buildkite#345

Open

giordano mentioned this pull request Jul 26, 2024

Add BOLT Makefile #54107

Merged

7 tasks

giordano mentioned this pull request Aug 2, 2024

Backports release 1.11 #55344

Merged

68 tasks

Add PGO+LTO Makefile #45641

Add PGO+LTO Makefile #45641

Conversation

haampie commented Jun 10, 2022 • edited Loading

vchuravy commented Jun 10, 2022

haampie commented Jun 10, 2022

haampie commented Jun 10, 2022 • edited Loading

oscardssmith commented Jun 10, 2022 • edited Loading

vchuravy commented Jun 10, 2022

haampie commented Jun 11, 2022 • edited Loading

giordano commented Jun 11, 2022

staticfloat left a comment

Choose a reason for hiding this comment

staticfloat Jun 14, 2022

Choose a reason for hiding this comment

haampie Jun 15, 2022

Choose a reason for hiding this comment

haampie Jun 28, 2022

Choose a reason for hiding this comment

haampie commented Jun 15, 2022 • edited Loading

haampie commented Jun 15, 2022 • edited Loading

Krastanov commented Jun 27, 2022

haampie commented Jun 27, 2022

vchuravy Jun 27, 2022

Choose a reason for hiding this comment

haampie Jun 27, 2022

Choose a reason for hiding this comment

haampie Jun 28, 2022

Choose a reason for hiding this comment

staticfloat commented Jun 27, 2022

haampie commented Jun 28, 2022

maleadt commented Jun 28, 2022 • edited Loading

haampie commented Jun 28, 2022

maleadt commented Jun 28, 2022 • edited Loading

haampie commented Jun 28, 2022 • edited Loading

maleadt commented Jun 28, 2022

haampie commented Jun 29, 2022

maleadt commented Jul 5, 2022

haampie commented Jul 5, 2022

haampie commented Jul 5, 2022 • edited Loading

gbaraldi commented Feb 6, 2023

lseman commented Dec 20, 2023

oscardssmith commented Dec 20, 2023

LilithHafner commented Dec 21, 2023

gitboy16 commented Feb 9, 2024

oscardssmith commented Feb 9, 2024

Krastanov commented Feb 9, 2024

oscardssmith commented Feb 9, 2024

Krastanov Feb 9, 2024

Choose a reason for hiding this comment

vchuravy commented Feb 9, 2024

Krastanov commented Feb 9, 2024

lseman commented Feb 12, 2024

haampie commented Feb 12, 2024

haampie commented Jun 10, 2022 •

edited

Loading

haampie commented Jun 10, 2022 •

edited

Loading

oscardssmith commented Jun 10, 2022 •

edited

Loading

haampie commented Jun 11, 2022 •

edited

Loading

haampie commented Jun 15, 2022 •

edited

Loading

haampie commented Jun 15, 2022 •

edited

Loading

maleadt commented Jun 28, 2022 •

edited

Loading

maleadt commented Jun 28, 2022 •

edited

Loading

haampie commented Jun 28, 2022 •

edited

Loading

haampie commented Jul 5, 2022 •

edited

Loading