-
Notifications
You must be signed in to change notification settings - Fork 12.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use thinlto and pgo for x86_64 windows release packaging #71067
Conversation
+cc @pbo-linaro This has been on the todo list a long time. Why not x86 also? We can't do bootstrapping thinlto because the linker will run out of memory. I think it's okay to focus on x86_64 for performance. Maybe we can even stop building the 32-bit binaries eventually. |
Why stop with LTO+PGO? BOLTing Clang/LLD should yield another sizeable improvement. Added my recent BOLT stats for Clang/LLD-18 (on Linux): |
That would be a nice addition @zmodem. Regarding the external source file used for profiling ( Finally, aarch64 build could benefit from that too, but that can be done later. |
My understanding was the BOLT for PE/COFF binaries isn't production ready.
Thinking about this some more, using I've pushed a new commit to train by building Clang's Sema.cpp instead. It requires running CMake one more time, but I think that's okay. The performance seems to be the same. (To answer your questions anyway, the origin of pgo_training-1.ii is documented here: https://source.chromium.org/chromium/chromium/src/+/main:tools/clang/scripts/build.py;l=1044 The story of the -2 version is here: https://bugs.chromium.org/p/chromium/issues/detail?id=984067)
Build times on my workstation:
Performance, using the clang produced above to do a release build of clang (metrics are best-of-two):
I was surprised that PGO helped so much by itself. Given that ThinLTO builds are so much slower to produce, I'm tempted to say we should just go with PGO for now. What do you think?
Yes, and hopefully we can refactor some of the code to be shared at that point. |
@maksfb or @aaupov might have some insights to share about PE/COEFF binary support of BOLT or might be interested in a BOLT-optimized Windows release package. |
Thanks for taking the time to check the different configurations. Using LLVM source code for profiling data is a good solution IMHO. The price for ThinLTO is surprisingly high (in memory usage too I guess), and compared to PGO alone, the added value is pretty low in this case. Considering the numbers, I agree with you to only do PGO, as it's almost "free". If we go with the PGO only version, it could be nice to directly factorize this for all architecture ( |
There's no BOLT support for PE/COFF, nor any plans on our end. |
Part of the high cost is that it applies not just to clang, lld, etc., but all the binaries, including the ones only used for testing.
Agreed, but can we do it in a follow-up when applying this to more targets? (For example, the PGO part should work for x86 too.) |
Oh yes, I missed that. So memory usage might be more a problem when linking (with thinLTO) all binaries. Better to not enable this for now. No problem to do it in another PR. Indeed, all the architectures (x86, x64, arm64) could benefit from PGO. |
A quick side question regarding PGO (while listening in on this topic); how portable is the profile information across builds? If I e.g. build Clang on Linux, then do some profiling with that, I can use that to do a new Clang build for Linux with PGO. But can I use the same profile information for PGO when I, still on Linux, cross compile Clang to run on Windows? |
Apologies for the late reply; I got distracted by other things.
It can work across targets, but e.g. if the function names from the profile and the current build don't match because of mangling, those functions will not benefit from the profile info. |
I extracted the profile generation into a separate "function". It doesn't work on 32-bit x86 because the linker runs out of memory, but it would be interesting to hear whether it works for arm64! |
Ah, thanks! That makes sense. So between Linux and MinGW x86_64, arm and aarch64, it should mostly match. (I guess that the exception is for cases where types are defined differently, and those types are visible in function signatures.) Cases where functions take C++ standard library types probably won't match. Linux and MSVC builds probably wouldn't match at all as the C++ symbol name mangling is different. One interesting case is where the mangling mostly matches but one case has an extra symbol prefix while the other one doesn't - like Windows/i386, and macOS on all architectures. It would be neat if it would be possible to apply a transformation on the profile information, to add/remove such a prefix, to make it match the other target better. CC @ZequanWu - how feasible would this be? |
I used the new version of the script to package the 17.0.5 binaries, so if someone wants to try the PGO-built toolchain, it's here: https://github.com/llvm/llvm-project/releases/download/llvmorg-17.0.5/LLVM-17.0.5-win64.exe |
IIUC, you want to remove such mangled name prefix when using profile from x64 but targeting to windows i386. I'm not sure about this. @xur-llvm might know this. BTW, what's the extra symbol prefix? Is it prefixed to all symbol on Windows i386? |
My primary usecase would be to profile on x86_64 Linux, and apply this profile when I cross-build binaries for Windows i386. (The symbol names should mostly match without other tricks when cross building binaries for Windows for other architectures.) The same mechanism would help if profiling on x86_64 Windows and using the profile for a build for Windows i386 as well, or profiling on Linux but using the profile on macOS, or vice versa as well.
Almost, yes. On macOS, essentially all extern symbols have an extra underscore at the start. On Windows i386, all cdecl symbols have an extra underscore at the start as well. MSVC C++ ABI symbols don't have such an extra underscore on i386 though. But the Itanium ABI/name mangling used in MinGW should be essentially the same as on other Itanium platforms, modulo an extra leading underscore. And the same goes for Linux<->macOS. |
Applying this to 17.0.4 makes the toolchain 22% faster (as measured by building clang).
Applying this to 17.0.4 makes the installer 5% smaller (298 MB vs. 315 MB) and the toolchain 20% faster (as measured by building clang).