Parallelize LLVM image generation #47797

pchintalapudi · 2022-12-04T21:34:31Z

By chopping up the LLVM module into N smaller modules, and doing some of the multiversioning work upfront, we can achieve large speedups from throwing multiple threads at sysimage building.

Should probably wait for #47184 to avoid large merge conflicts.

@vchuravy is there a good test for these changes on a package image workload?

TODO:
~~- [ ] figure out why target uses of relocation slots are not being picked up in the pre-optimization scanner~~

lower-bound the thread count by the actual amount of work available
remove debug timing prints

IanButterworth · 2022-12-05T15:44:26Z

Sounds great!

If this means that threading would used for the pkgimage generation during package precompilation, then we might need a slightly cleverer load balancing approach during Pkg.precompile given that it uses multi-processing already.

I can imagine a Pkg.precompile approach where for the early stages of the dep tree where there are many deps that can be done in parallel, multi-process is prioritized and this is still single threaded, then when the tree gets down to low numbers of deps, it bumps this multithreading N up.

That could nicely get away from the long tail effect in Pkg.precompile

staticfloat · 2022-12-05T16:43:59Z

This reduces the Outputting sysimage file... portion from 142s to 20s for me, using 32 cores.

vchuravy · 2022-12-05T19:06:45Z

Yeah we were particularly interested in reducing the long-tail here were single very large packages take excessive time.

petvana · 2022-12-06T19:02:20Z

Just out of curiosity: Would it be possible to reduce peak memory usage for building large sysimages by using more chunks than threads with some basic scheduler? (It may also help to distribute the load for CPUs with fewer threads.)

pchintalapudi · 2022-12-06T19:09:42Z

Just out of curiosity: Would it be possible to reduce peak memory usage for building large sysimages by using more chunks than threads with some basic scheduler? (It may also help to distribute the load for CPUs with fewer threads.)

It's possible to chop up modules in this way, but judging by the build failures we actually use the most memory when we dump the sysimage bytes at the end, rather than during optimization. If that stops being an issue, we may want to revisit this.

IanButterworth · 2022-12-29T04:03:35Z

I just tried this rebased to have pkgimages and via the old code load precompilation I don't see >100% cpu.
Does it need to be switched on specifically for pkgimage generation?

pchintalapudi · 2022-12-29T05:05:38Z

Is there a print along the lines of Adding outputs with n threads? That should say if it's attempting to use multiple threads.

IanButterworth · 2022-12-29T05:29:38Z

All I see is this (and single thread utilization)

julia> using GLMakie
[ Info: Precompiling GLMakie [e9467ef8-e4e7-5192-8a1a-b1aee30e663a]

julia>

create_expr_cache forwards both stdout and stderr from the child process, so it seems to not be happening.

pchintalapudi · 2022-12-29T07:10:43Z

Does it at least appear in the system image building after rebase? If not, then something might have gone wrong there. In any case, I'm working on adapting to the pkgimg changes seperately so we'll see how that goes.

IanButterworth · 2022-12-29T13:02:21Z

It does

pchintalapudi · 2023-01-07T02:04:29Z

@IanButterworth could you try with this set of commits? I personally don't see any of the debug timing dumping I'm doing during the package load process, but if I set JULIA_IMAGE_THREADS to 4 and run using OrdinaryDiffEq I do see 400% CPU usage, which suggests that it will speed up package image compilation.

IanButterworth · 2023-01-07T06:31:17Z

It does appear to be working. On some packages the LLVM phase during precompilation is quite short with this speedup, so can be easy to miss in a cpu monitor.

I also decided to try unrestricting this during Pkg.precompile (by reverting JuliaLang/Pkg.jl#3273) as I am now thinking that given the LLVM stage is a short fraction of the precompilation phase of a package with this PR, the likelihood of many threads clashing from packages precompiling in parallel and overwhelming the system seems low.

master

shell> rm -rf ~/.julia/compiled/v1.10/

(@v1.10) pkg> activate /home/ian/Documents/GitHub/Makie.jl/GLMakie
  Activating project at `~/Documents/GitHub/Makie.jl/GLMakie`

Precompiling environment...
  197 dependencies successfully precompiled in 191 seconds

This PR + JuliaLang/Pkg.jl#3273 reverted

shell> rm -rf ~/.julia/compiled/v1.10/

(@v1.10) pkg> activate /home/ian/Documents/GitHub/Makie.jl/GLMakie
  Activating project at `~/Documents/GitHub/Makie.jl/GLMakie`

(GLMakie) pkg> precompile
Precompiling environment...
  197 dependencies successfully precompiled in 129 seconds

So that's a minute faster, or now takes 2/3 of the time master takes, and it behaved well, the animation remained smooth.
That's on a 16 core linux machine

For completeness, during julia build outputting the sysimage on master takes 120s, this PR takes 21s.

Seeing how windows behaves during the same test would be interesting, I think, so I pushed JuliaLang/julia@9647a58 (#47797) so that others can test. Note that that commit needs to be dropped before merge.

pchintalapudi · 2023-01-07T07:43:15Z

That looks like a decent time savings, do we see similar time savings with PackageCompiler custom system images?

giordano · 2023-01-07T14:22:36Z

Using an environment with only Plots.jl. v1.8.3:

(tmp) pkg> instantiate
Precompiling project...
  135 dependencies successfully precompiled in 38 seconds

master (de73c26):

(tmp) pkg> precompile
Precompiling environment...
  134 dependencies successfully precompiled in 76 seconds

This PR (9647a58):

(tmp) pkg> precompile
Precompiling environment...
  134 dependencies successfully precompiled in 63 seconds

it’s stil far from v1.8 times, but slightly faster than master.

pchintalapudi · 2023-01-07T20:14:16Z

@IanButterworth After thinking a little more, I think we'll probably run into bad situations if we precompile a project with multiple heavy top-level dependencies, rather than just one. Right now we only use half the system threads for parallel precompilation, so if there are 3 or more heavy packages at the toplevel we'll probably see those thread clashes become more problematic.

giordano · 2023-01-10T12:50:04Z

GNU Make 4.4 has a nice new feature:

* New feature: Improved support for -l / --load-average
  On systems that provide /proc/loadavg (Linux), GNU Make will use it to
  determine the number of runnable jobs and use this as the current load,
  avoiding the need for heuristics.
  Implementation provided by Sven C. Dack <sdack@gmx.com>

Is there any hope to implement something similar (aiming to maintain a given load average, rather than using a fixed number of threads)? I guess the challenge would also be to do it for all platforms.

pchintalapudi · 2023-01-10T14:30:39Z

We pick the number of threads at the start time and partition the work based on the number of threads, rather than how make has a fixed job list with dependencies. We could potentially use the system load to select a more aggressive fraction of the core count for these jobs, but I don't know if system load will respond quickly enough for it to be a useful heuristic for these jobs.

staticfloat · 2023-01-10T17:21:48Z

I think the most likely useful approach is going to be similar to -j in make; because we know each thread will approximately saturate a single core, we know that the ideal number of threads across all processes is roughly the number of cores in the system. To solve this, we’d need a way to coordinate thread creation across processes, e.g. we create a Julia subprocess and refuse to let it create threads of the parent already has N threads alive. This could actually be useful for normal Julia workloads as well, but in any case, I think it’s fine to overload the system for a minute or two for now. This is a big improvement and overloading the system in the short term should not be considered a blocker, IMO.

Krastanov · 2023-01-10T17:31:30Z

A question from the peanut gallery: are you saying the system will be overloaded in terms of CPU usage, or in terms of memory? I am asking because when parallel precompilation was enabled some time ago there were a lot of novices who were confused why is Julia crashing with out-of-memory errors on their systems. I have students who have to start julia with a single thread, otherwise precompilation of DiffEq and something else at the same time would definitely cause the OOM killer to kill programs on their device.

Either way, looking forward to this being merged. Just trying to make sure I know what to tell students with machines with no more than 8gb of RAM.

pchintalapudi · 2023-03-03T21:01:38Z

@staticfloat I'm now thinking that memory limiting should be out of scope here. It looks like Pkg.jl will be controlling the number of threads for precompile and therefore Pkg should be responsible for any memory limiting. The other case where this is useful is when building system images, in which case I don't think it's unreasonable to ask people to set the environment variable if that's an issue. What do you think about this?

pchintalapudi · 2023-03-03T23:02:00Z

@nanosoldier runtests(configuration=(precompile=false, julia_flags=["--pkgimages=yes"], env=["JULIA_IMAGE_THREADS=8"]))

nanosoldier · 2023-03-04T10:47:01Z

Your package evaluation job has completed - possible new issues were detected.
A full report can be found here.

IanButterworth · 2023-03-04T15:34:29Z

Are the pkgeval runs equivalent here apart from the commit? Or is one run under rr etc?

pchintalapudi · 2023-03-04T16:26:49Z

The last failures were caused by unfortunate timing. A linked change between master and JET.jl went through in the last 3 days that broke packages dependent on JET. Rebasing eliminated that error, but I want to make sure nothing else has changed since then.

pchintalapudi · 2023-03-05T01:35:39Z

@nanosoldier runtests(["EQDSKReader", "NSGAII", "NeidArchive", "ReliabilityDiagrams", "Survey", "QuantumOptics", "Syslogs", "Scalpels", "MakieThemes", "PopGenCore", "NonlinearSolveMINPACK", "DiffEqCallbacks", "UnfoldMakie", "JosephsonCircuits", "REoptLite", "PersistentCollections", "QuantEcon", "RPRMakie", "Bokeh", "REopt", "LogicToolkit", "WaterWaves1D"], configuration = (precompile = false, julia_flags = ["--pkgimages=yes"], env = ["JULIA_IMAGE_THREADS=8"]))

pchintalapudi · 2023-03-05T01:36:13Z

If this comes back without obvious related failures I think we can call this clean from a testing perspective.

nanosoldier · 2023-03-05T05:24:06Z

Your package evaluation job has completed - possible new issues were detected.
A full report can be found here.

giordano · 2023-03-05T05:45:06Z

Tests of Scalpels.jl basically fail because two different environments are used in the two runs of the tests, only the one for this PR having an old buggy version of Polinomials.jl.

@maleadt Using different environments makes apple-to-apple comparisons hard and can add needless noise like in this case, is there a way to avoid that?

maleadt · 2023-03-05T20:51:56Z

Tests of Scalpels.jl basically fail because two different environments are used in the two runs of the tests, only the one for this PR having an old buggy version of Polinomials.jl.

Both tests use Polynomials 2.0.25 during package installation, and v1.2.1 during testing; I don't see a difference between both runs? FWIW, PkgEval.jl checks-out and fixes the registry during set-up, so there aren't multiple environments at play.

giordano · 2023-03-05T23:27:19Z

Uhm, ok, got confused comparing the environments of the two runs, didn't realise installation and test environments were different. Scalpels.jl sets needlessly restrictive compat bounds for the test environment, I opened a PR to remove that: RvSpectML/Scalpels.jl#4. Still unclear why the warning which causes the test failure is triggered only in one case, but that had been fixed in Polynomials.jl by JuliaMath/Polynomials.jl#442.

vchuravy

This looks good from my side.

staticfloat · 2023-03-06T19:38:53Z

LGTM!

timholy · 2023-03-06T19:44:33Z

Woohoo! Thanks @pchintalapudi, looking forward to seeing this in action.

staticfloat · 2023-03-06T20:03:50Z

From 454 seconds to 67 seconds, in one fell swoop, you've reduced overall build CI times by 30%. (The assert build is even more impressive, from 700 seconds to 101 seconds ).

The buildbots salute you sir.

KristofferC · 2023-03-07T08:20:55Z

@pchintalapudi, I noticed my M1 pro only seems to be using 4 cores for generating the sysimage but the M1 pro has 8 (or 6) high performance cores and manually bumping it using the env variable improves performance (42s -> 33s for sysimage). Of course I can just set that env variable but maybe the defaults could be tweaked a bit?

pchintalapudi · 2023-03-08T23:13:46Z

I'm hesitant to generally increase the number of cores allocated, as users with low power systems will probably want to be doing something else with the rest of their compute while waiting for image compilation. Also, keeping the core count at ~half means we are very much less likely to see oversubscribing in Pkg.jl runs, since even if two packages overlap we will not have oversubscribed the system. That being said, for the system image or PackageCompiler in particular I think we can just set JULIA_IMAGE_THREADS to Sys.CPU_THREADS, since we don't expect underpowered users to be running those commands anyways.

vtjnash · 2023-03-17T17:38:39Z

src/aotcompile.cpp

+    for (auto &F : M.functions()) {
+        if (!F.isDeclaration()) {
+            if (!Preserve.contains(&F)) {
+                F.deleteBody();


I am not sure this results in legal IR, since there may still be attributes attached to the module for the declaration (such as DISubprogram and llvm.dbg.cu) which should not be added there. We used to have a function (prepare_call) for getting a proper declaration for a definition, but I think we removed that at some point. I thought maybe OrcJIT would have it, but it doesn't look like it includes support for handling debug information correctly either: https://llvm.org/doxygen/CompileOnDemandLayer_8cpp_source.html

Can we just drop all metadata on the function? Since we're still attaching metadata to the real definition the linker might save us right?

We probably don't need the metadata, though note this is module metadata, not on the function

Module metadata seems more complicated since we're not allowed to drop module flags. However, we could go back to using CloneModule either before or after serialization, at a time cost. Presumably CloneModule should handle the debug info correctly.

We emitted them into separate modules to avoid this issue. Can we keep them separate?

It requires a bit more care to keep the modules separate since GPUCompiler also uses this codepath and probably assumes a single module, which I'm more hesitant to mess with.

For DISubprogram, the LangRef says it's legal to have it on a declaration for callsite debug info, so I think it's valid to leave it there despite deleting the function body. I suspect the same reasoning to apply to global variable declarations, though the LangRef does not mention that, but stripping them also shouldn't affect correctness. llvm.dbg.cu doesn't have much documentation, but I don't see anything obvious that LLVM doesn't already do (https://github.com/llvm/llvm-project/blob/main/llvm/lib/Transforms/Utils/CloneFunction.cpp#L280 copies all the compile units, irrespective of if they're related to the function that was just cloned) so I'm not really seeing anything obviously wrong here.

It requires a bit more care to keep the modules separate since GPUCompiler also uses this codepath and probably assumes a single module, which I'm more hesitant to mess with.

We could merge it in GPUCompiler.jl, generally I would be in favor for exposing the work-list to GPUC, since we might want to implement caching differently for Enzyme vs CUDA

pchintalapudi force-pushed the pc/psys3 branch 4 times, most recently from 3a28ebe to c8dd1ef Compare December 4, 2022 22:03

brenhinkeller added compiler:llvm For issues that relate to LLVM compiler:latency Compiler latency compiler:codegen Generation of LLVM IR and native code labels Dec 5, 2022

IanButterworth mentioned this pull request Dec 7, 2022

Disable parallel LLVM image generation for now JuliaLang/Pkg.jl#3273

Merged

pchintalapudi force-pushed the pc/psys3 branch from e38c366 to 38c454a Compare December 7, 2022 17:09

pchintalapudi force-pushed the pc/psys3 branch 2 times, most recently from 06f01ed to 594fa00 Compare January 7, 2023 01:38

pchintalapudi force-pushed the pc/psys3 branch from 8bb913e to acc54d9 Compare March 3, 2023 20:49

giordano mentioned this pull request Mar 5, 2023

Remove useless dependencies and compat bounds in test environment RvSpectML/Scalpels.jl#4

Closed

vchuravy reviewed Mar 6, 2023

View reviewed changes

vchuravy requested a review from maleadt March 6, 2023 00:37

pchintalapudi added 3 commits March 5, 2023 23:14

Move dbgs under LLVM_DEBUG

27f1ccd

Add some documentation

6b8ec27

Update documentation

5108b40

staticfloat merged commit 0f5f62c into master Mar 6, 2023

staticfloat deleted the pc/psys3 branch March 6, 2023 19:39

pchintalapudi added a commit that referenced this pull request Mar 6, 2023

Backport #47797 to 1.9

454d32b

vtjnash reviewed Mar 17, 2023

View reviewed changes

ktdq mentioned this pull request Apr 5, 2023

Parallel image generation works in julia nightly, but is still single threaded in PackageCompiler JuliaLang/PackageCompiler.jl#799

Closed

fonsp mentioned this pull request Apr 11, 2023

Julia 1.9 pkgimages disabled by default julia-actions/julia-runtest#82

Open

vchuravy mentioned this pull request Jun 19, 2023

Multiversioning does not rewrite instruction uses on the first target #50212

Open

Parallelize LLVM image generation #47797

Parallelize LLVM image generation #47797

Conversation

pchintalapudi commented Dec 4, 2022 • edited Loading

IanButterworth commented Dec 5, 2022 • edited Loading

staticfloat commented Dec 5, 2022

vchuravy commented Dec 5, 2022

petvana commented Dec 6, 2022

pchintalapudi commented Dec 6, 2022

IanButterworth commented Dec 29, 2022

pchintalapudi commented Dec 29, 2022

IanButterworth commented Dec 29, 2022

pchintalapudi commented Dec 29, 2022

IanButterworth commented Dec 29, 2022

pchintalapudi commented Jan 7, 2023

IanButterworth commented Jan 7, 2023 • edited Loading

pchintalapudi commented Jan 7, 2023

giordano commented Jan 7, 2023 • edited Loading

pchintalapudi commented Jan 7, 2023

giordano commented Jan 10, 2023

pchintalapudi commented Jan 10, 2023

staticfloat commented Jan 10, 2023

Krastanov commented Jan 10, 2023

pchintalapudi commented Mar 3, 2023

pchintalapudi commented Mar 3, 2023

nanosoldier commented Mar 4, 2023

IanButterworth commented Mar 4, 2023

pchintalapudi commented Mar 4, 2023

pchintalapudi commented Mar 5, 2023

pchintalapudi commented Mar 5, 2023

nanosoldier commented Mar 5, 2023

giordano commented Mar 5, 2023 • edited Loading

maleadt commented Mar 5, 2023 • edited Loading

giordano commented Mar 5, 2023

vchuravy left a comment

Choose a reason for hiding this comment

staticfloat commented Mar 6, 2023

timholy commented Mar 6, 2023

staticfloat commented Mar 6, 2023 • edited by giordano Loading

KristofferC commented Mar 7, 2023

pchintalapudi commented Mar 8, 2023

vtjnash Mar 17, 2023

Choose a reason for hiding this comment

pchintalapudi Mar 17, 2023

Choose a reason for hiding this comment

vtjnash Mar 17, 2023

Choose a reason for hiding this comment

pchintalapudi Mar 17, 2023

Choose a reason for hiding this comment

vtjnash Mar 17, 2023

Choose a reason for hiding this comment

pchintalapudi Mar 17, 2023

Choose a reason for hiding this comment

vchuravy Mar 19, 2023

Choose a reason for hiding this comment

pchintalapudi commented Dec 4, 2022 •

edited

Loading

IanButterworth commented Dec 5, 2022 •

edited

Loading

IanButterworth commented Jan 7, 2023 •

edited

Loading

giordano commented Jan 7, 2023 •

edited

Loading

giordano commented Mar 5, 2023 •

edited

Loading

maleadt commented Mar 5, 2023 •

edited

Loading

staticfloat commented Mar 6, 2023 •

edited by giordano

Loading