-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelize LLVM image generation #47797
Conversation
3a28ebe
to
c8dd1ef
Compare
Sounds great! If this means that threading would used for the pkgimage generation during package precompilation, then we might need a slightly cleverer load balancing approach during I can imagine a That could nicely get away from the long tail effect in |
This reduces the |
Yeah we were particularly interested in reducing the long-tail here were single very large packages take excessive time. |
Just out of curiosity: Would it be possible to reduce peak memory usage for building large sysimages by using more chunks than threads with some basic scheduler? (It may also help to distribute the load for CPUs with fewer threads.) |
It's possible to chop up modules in this way, but judging by the build failures we actually use the most memory when we dump the sysimage bytes at the end, rather than during optimization. If that stops being an issue, we may want to revisit this. |
e38c366
to
38c454a
Compare
I just tried this rebased to have pkgimages and via the old code load precompilation I don't see >100% cpu. |
Is there a print along the lines of |
All I see is this (and single thread utilization)
|
Does it at least appear in the system image building after rebase? If not, then something might have gone wrong there. In any case, I'm working on adapting to the pkgimg changes seperately so we'll see how that goes. |
It does |
06f01ed
to
594fa00
Compare
@IanButterworth could you try with this set of commits? I personally don't see any of the debug timing dumping I'm doing during the package load process, but if I set |
It does appear to be working. On some packages the LLVM phase during precompilation is quite short with this speedup, so can be easy to miss in a cpu monitor. I also decided to try unrestricting this during Pkg.precompile (by reverting JuliaLang/Pkg.jl#3273) as I am now thinking that given the LLVM stage is a short fraction of the precompilation phase of a package with this PR, the likelihood of many threads clashing from packages precompiling in parallel and overwhelming the system seems low. master
This PR + JuliaLang/Pkg.jl#3273 reverted
So that's a minute faster, or now takes 2/3 of the time master takes, and it behaved well, the animation remained smooth. For completeness, during julia build outputting the sysimage on master takes 120s, this PR takes 21s. Seeing how windows behaves during the same test would be interesting, I think, so I pushed JuliaLang/julia@ |
That looks like a decent time savings, do we see similar time savings with PackageCompiler custom system images? |
Using an environment with only
master (de73c26):
This PR (9647a58):
it’s stil far from v1.8 times, but slightly faster than master. |
@IanButterworth After thinking a little more, I think we'll probably run into bad situations if we precompile a project with multiple heavy top-level dependencies, rather than just one. Right now we only use half the system threads for parallel precompilation, so if there are 3 or more heavy packages at the toplevel we'll probably see those thread clashes become more problematic. |
GNU Make 4.4 has a nice new feature:
Is there any hope to implement something similar (aiming to maintain a given load average, rather than using a fixed number of threads)? I guess the challenge would also be to do it for all platforms. |
We pick the number of threads at the start time and partition the work based on the number of threads, rather than how make has a fixed job list with dependencies. We could potentially use the system load to select a more aggressive fraction of the core count for these jobs, but I don't know if system load will respond quickly enough for it to be a useful heuristic for these jobs. |
I think the most likely useful approach is going to be similar to -j in make; because we know each thread will approximately saturate a single core, we know that the ideal number of threads across all processes is roughly the number of cores in the system. To solve this, we’d need a way to coordinate thread creation across processes, e.g. we create a Julia subprocess and refuse to let it create threads of the parent already has N threads alive. This could actually be useful for normal Julia workloads as well, but in any case, I think it’s fine to overload the system for a minute or two for now. This is a big improvement and overloading the system in the short term should not be considered a blocker, IMO. |
A question from the peanut gallery: are you saying the system will be overloaded in terms of CPU usage, or in terms of memory? I am asking because when parallel precompilation was enabled some time ago there were a lot of novices who were confused why is Julia crashing with out-of-memory errors on their systems. I have students who have to start julia with a single thread, otherwise precompilation of DiffEq and something else at the same time would definitely cause the OOM killer to kill programs on their device. Either way, looking forward to this being merged. Just trying to make sure I know what to tell students with machines with no more than 8gb of RAM. |
@staticfloat I'm now thinking that memory limiting should be out of scope here. It looks like Pkg.jl will be controlling the number of threads for precompile and therefore Pkg should be responsible for any memory limiting. The other case where this is useful is when building system images, in which case I don't think it's unreasonable to ask people to set the environment variable if that's an issue. What do you think about this? |
@nanosoldier |
Your package evaluation job has completed - possible new issues were detected. |
Are the pkgeval runs equivalent here apart from the commit? Or is one run under rr etc? |
The last failures were caused by unfortunate timing. A linked change between master and JET.jl went through in the last 3 days that broke packages dependent on JET. Rebasing eliminated that error, but I want to make sure nothing else has changed since then. |
@nanosoldier |
If this comes back without obvious related failures I think we can call this clean from a testing perspective. |
Your package evaluation job has completed - possible new issues were detected. |
Tests of Scalpels.jl basically fail because two different environments are used in the two runs of the tests, only the one for this PR having an old buggy version of Polinomials.jl. @maleadt Using different environments makes apple-to-apple comparisons hard and can add needless noise like in this case, is there a way to avoid that? |
Both tests use Polynomials 2.0.25 during package installation, and v1.2.1 during testing; I don't see a difference between both runs? FWIW, PkgEval.jl checks-out and fixes the registry during set-up, so there aren't multiple environments at play. |
Uhm, ok, got confused comparing the environments of the two runs, didn't realise installation and test environments were different. Scalpels.jl sets needlessly restrictive compat bounds for the test environment, I opened a PR to remove that: RvSpectML/Scalpels.jl#4. Still unclear why the warning which causes the test failure is triggered only in one case, but that had been fixed in Polynomials.jl by JuliaMath/Polynomials.jl#442. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good from my side.
LGTM! |
Woohoo! Thanks @pchintalapudi, looking forward to seeing this in action. |
From 454 seconds to 67 seconds, in one fell swoop, you've reduced overall build CI times by 30%. (The assert build is even more impressive, from 700 seconds to 101 seconds ). The buildbots salute you sir. |
@pchintalapudi, I noticed my M1 pro only seems to be using 4 cores for generating the sysimage but the M1 pro has 8 (or 6) high performance cores and manually bumping it using the env variable improves performance (42s -> 33s for sysimage). Of course I can just set that env variable but maybe the defaults could be tweaked a bit? |
I'm hesitant to generally increase the number of cores allocated, as users with low power systems will probably want to be doing something else with the rest of their compute while waiting for image compilation. Also, keeping the core count at ~half means we are very much less likely to see oversubscribing in Pkg.jl runs, since even if two packages overlap we will not have oversubscribed the system. That being said, for the system image or PackageCompiler in particular I think we can just set JULIA_IMAGE_THREADS to Sys.CPU_THREADS, since we don't expect underpowered users to be running those commands anyways. |
for (auto &F : M.functions()) { | ||
if (!F.isDeclaration()) { | ||
if (!Preserve.contains(&F)) { | ||
F.deleteBody(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure this results in legal IR, since there may still be attributes attached to the module for the declaration (such as DISubprogram and llvm.dbg.cu) which should not be added there. We used to have a function (prepare_call) for getting a proper declaration for a definition, but I think we removed that at some point. I thought maybe OrcJIT would have it, but it doesn't look like it includes support for handling debug information correctly either: https://llvm.org/doxygen/CompileOnDemandLayer_8cpp_source.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just drop all metadata on the function? Since we're still attaching metadata to the real definition the linker might save us right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably don't need the metadata, though note this is module metadata, not on the function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Module metadata seems more complicated since we're not allowed to drop module flags. However, we could go back to using CloneModule either before or after serialization, at a time cost. Presumably CloneModule should handle the debug info correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We emitted them into separate modules to avoid this issue. Can we keep them separate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It requires a bit more care to keep the modules separate since GPUCompiler also uses this codepath and probably assumes a single module, which I'm more hesitant to mess with.
For DISubprogram, the LangRef says it's legal to have it on a declaration for callsite debug info, so I think it's valid to leave it there despite deleting the function body. I suspect the same reasoning to apply to global variable declarations, though the LangRef does not mention that, but stripping them also shouldn't affect correctness. llvm.dbg.cu doesn't have much documentation, but I don't see anything obvious that LLVM doesn't already do (https://github.com/llvm/llvm-project/blob/main/llvm/lib/Transforms/Utils/CloneFunction.cpp#L280 copies all the compile units, irrespective of if they're related to the function that was just cloned) so I'm not really seeing anything obviously wrong here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It requires a bit more care to keep the modules separate since GPUCompiler also uses this codepath and probably assumes a single module, which I'm more hesitant to mess with.
We could merge it in GPUCompiler.jl, generally I would be in favor for exposing the work-list to GPUC, since we might want to implement caching differently for Enzyme vs CUDA
By chopping up the LLVM module into
N
smaller modules, and doing some of the multiversioning work upfront, we can achieve large speedups from throwing multiple threads at sysimage building.Should probably wait for #47184 to avoid large merge conflicts.
@vchuravy is there a good test for these changes on a package image workload?
TODO:
- [ ] figure out why target uses of relocation slots are not being picked up in the pre-optimization scanner