-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement function multi versioning in sysimg #21849
Conversation
I agree with your choice of AVX2 giving the most benefit as a cutoff, if implementing 2 choices of instruction sets is much simpler to do than 3 or more. Haswell was introduced in 2013 (though are there any low-end Haswell or newer processors that have AVX2 disabled?). For Sandy Bridge (2011) and Ivy Bridge (2012) it might be useful to have AVX, but maybe not worth it for now. How does LLVM handle the operating system software needing to be new enough for AVX? Does it work correctly if the system the binaries get built on is very old? |
Implementing more than 2 is not hard. Just want to get the necessary LLVM pieces together first... I don't think there's anything too much needed from LLVM, though we might want to disable avx/avx2 for old LLVM versions (if we don't drop them first) since we currently do that for some llvm versions Line 7180 in b1f668d
The multiversioning implemented here is purely an optimization and doesn't change external interface. (That'll be future work.) So there shouldn't be any requirement on system libs. LLVM seems to be pretty good at avoiding AVX - SSE transition panelty so calling SSE library from AVX code shouldn't cause performance issue, though it's another story on KNL, which is why KNL support will need to clone almost every functions....... As for the default and binary release. I think the binary release can use sse2(default),avx,avx2+fma on x86_64. Not sure if we need to change default build. We can also include avx512 and knl if people are fine with the additional code size and if we actually have users. |
There are no avx512 kernels in openblas yet (OpenMathLib/OpenBLAS#991) so I suspect most people on KNL are going to want source builds anyway. The effect on code size will be interesting to measure. |
K.... I guess when they have that it'll be a good time to revisite this. There'll probably also be Xeon's with AVX512 at that point and it'll probably make sense to enable avx512 support without KNL support (the two have very different cost model, avx512-skylake will likely not need cloning all functions). Yeah, being able to replace them is essentially the clean up work. |
You can get one on Google Cloud right now. |
5aa6111
to
a0c2543
Compare
src/processor.cpp
Outdated
#include "julia.h" | ||
#include "julia_internal.h" | ||
|
||
extern "C" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
am I missing something, or why have the file be .cpp
instead of .c
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There'll be string manipulation functions here for parsing the command line args, parsing /proc/cpuinfo
, generating LLVM feature string and I don't really want to do that in C (especially not for the output part). Basically I want to put the processor detection logic together instead of in 3+ files. This include the codegen initialization code at
Line 7089 in 3d5685a
static inline std::string getNativeTarget() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah okay. there won't be a C or C++ API somewhere in llvm or libuv for whatever you plan on using /proc/cpuinfo
for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LLVM does have it but it's not exposing all the details (e.g. I find no way to access what features is enabled by a cpu, which may or may not be needed). More importantly we should be able to load the multi-versioning sysimg with LLVM disabled (or we need a antiprocessor.c
............)
OpenBLAS also has it, though obviously not exposed in a useful way.
I'm not aware of such code in libuv.....
0aefb4e
to
75c5056
Compare
base/sysimg.jl
Outdated
return s | ||
end | ||
|
||
test_clone_g(Float64[], 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe hide these in module Test
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be gone in the final version. It's just here for testing. It's pretty hard to actually come up with a useful test case that handles all machines the test can run on....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very true. Another nice one to add support for with this infrastructure is @llvm.fma.f64
(e.g. move the
Line 222 in a204bfd
# Disable LLVM's fma if it is incorrect, e.g. because LLVM falls back |
src/llvm-mv.cpp
Outdated
// This file is a part of Julia. License is MIT: https://julialang.org/license | ||
|
||
// Function multi-versioning | ||
#define DEBUG_TYPE "julia_mv" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's actually call this file multiversioning
or ifunc
, mv is too cryptic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ifunc is pretty jargony, +1 for multiversioning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was expecting comments on this. Yeah, this will be renamed... mv
was just simple to type....
src/processor.cpp
Outdated
|
||
// Cache of information recovered from jl_cpuid. | ||
// In a multithreaded environment, there will be races on subnormal_flags, | ||
// but they are harmless idempotent races. If we ever embrace C11, then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
c++11
, and don't we already require that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do. And I've already replaced this with a global variable with dynamic initialization on my local version.
src/dump.c
Outdated
if (nfunc && fptrs && fidxs) { | ||
for (size_t i = 0; i < nfunc; i++) { | ||
size_t fi = fidxs[i]; | ||
sysimg_fvars[fi] = fptrs[i]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would rather leave this as a shared, read-only table and instead emit N tables (e.g. base table + (sparse?) overlay tables for each architecture). Otherwise we leak memory here, since we'll never want to reference it again after finishing loading. dispatchf
seems still useful, to return the index of the appropriate overlay table to use (or, in the future, perhaps an ordered list of overlay tables).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't fully decide on this yet but in any case I'll probably make the big table read only.
I'll probably make a separate relocation table that's mutable and use that for dispatch instead and just add another loop in debuginfo.cpp. The memory usage can be further reduced using double indirection during the call but since how little the memory overhead actually is I prefer to keep it using only one indirection.
4839391
to
693fd81
Compare
src/dump.c
Outdated
// TODO make sure the sysimg and the JIT agrees on the ABI. | ||
// This shouldn't be a problem for any required C types on any platforms we support | ||
// but could be a problem from optional types. In particular, we need to make sure | ||
// the two agrees on the usuable register sizes so that functions that takes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
usable
functions that take a vector
02f5024
to
ae49faf
Compare
@@ -61,15 +61,6 @@ let exename = `$(Base.julia_cmd()) --precompiled=yes --startup-file=no` | |||
@test !success(`$exename -L`) | |||
@test !success(`$exename --load`) | |||
|
|||
# --cpu-target | |||
# NOTE: this test only holds true if image_file is a shared library. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why don't we want to run this test still?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It won't make sense anymore. This was testing that a cpu target different from the one used to compile sysimg will throw an error but it will no longer be the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't an invalid setting for that option still error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. It could be an CPU name that we don't recognize but llvm does so it shouldn't be blocked at this level and should be passed to LLVM. Unfortunately LLVM does not throw an error in that case and I'm not aware of any method to ask LLVM if it is a supported target name at runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not aware of any method to ask LLVM if it is a supported target name at runtime
You mean jl_TargetMachine->getMCSubtargetInfo()->isCPUStringValid()
😜
(added in https://llvm.org/svn/llvm-project/llvm/trunk@223147)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't seem to be a way to do that without looking up the CPU first?
Just a status update. This is still largely WIP but a few boring part are done.
|
So the 32bit linux segfault is really funny. LLVM 3.9 didn't realize that xmm8 is not available in 32bit mode so it wanted to emit vcvtsi2sdl %ecx, %xmm8, %xmm3
vmovsd %xmm2, -8(%esi)
vmovsd %xmm3, (%esi) which it illegally encoded as c5 bb 2a d9 vcvtsi2sdl %ecx, %xmm8, %xmm3
c5 fb 11 56 f8 vmovsd %xmm2, -8(%esi)
c5 fb 11 1e vmovsd %xmm3, (%esi) However, the correct decoding of that is actually c5 bb 2a d9 c5 fb ldsl -70919894(%ebx), %edi
11 56 f8 adcl %edx, -8(%esi)
c5 fb 11 1e vmovsd %xmm3, (%esi) Which modifies the data segment register and causes the next instruction (as well as almost any other memory ops) to segfault (since it has a, well, faulty segment....) Seems to be fixed in 4.0 so disabling avx on LLVM < 4.0 seems to be the correct solution..... |
And for the reference the simplified code that I used to reproduce this is A = reshape(1.0:4.0, 2, 2)
B = Array{Complex{Float64}}(2, 2)
B .= 1 + 2im
@noinline f() = error()
function _broadcast2!(B::AbstractArray, A, A2, iter1, iter2)
for I2 in iter2
for I1 in iter1
s1, s2 = A.dims
(((1 <= I1) & (I1 <= s1)) && ((1 <= I2) & (I2 <= s2))) || f()
ele = A.parent[I1 + (I2 - 1) * 2]
bele = ele + A2
B[I1, I2] = bele
end
end
end
_broadcast2!(B, A, Complex(false, false), 1:2, 1:2) It's not particularly reliable since it likely strongly depend on register pressure. |
src/processor.cpp
Outdated
const char *start = ext_features.c_str(); | ||
for (const char *p = start; *p; p++) { | ||
if (*p == ',' || *p == '\0') { | ||
features.emplace_back(start, p - start); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*p
can't == '\0'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is a bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
auto dir = opendir("/sys/devices/system/cpu"); | ||
if (!dir) | ||
return; | ||
while (auto entry = readdir(dir)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use readdir_r
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deprecated.
if (maxidx >= 0) { | ||
list.erase(std::remove_if(list.begin(), list.end(), [&] (std::pair<uint32_t,CPUID> &ele) { | ||
int idx = find(ele.first); | ||
return idx != -1 && idx < maxidx; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But why not just write idx != maxidx
?
insert_before); | ||
} | ||
else { | ||
jl_safe_printf("Unknown const use."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's very bad to let the wrong version running.
worse than calling abort
(instead of, say, a warning)?
Alright, one review comment open (in processor.cpp) and I think this is ready to merge. |
I've also got some local updates that tweaks the x86 handling a little. Will push later. Handling of directly exported function is still missing, which should just mean that use of |
…ti-versioning and improved cpu selection/detection. * Remove cpuid specific binary To be replaced by function multi-versioning * Remove old CPU target handling in sysimg The feature check is not complete since X86 features bits are also in many other CPUID results, including ones that we are actually interested in like AVX2. * Add invalid CPU name and help handling in codegen initialization. Update `--cpu-target` test since target mismatch will not raise an error anymore. CPU names unknown to LLVM would.
Use C++ instead of C since there will be a lot of string handling and indirect interaction with LLVM in this file. The meta-programing capability is also pretty useful.
Also implement the API needed by other part of the runtime, especially LLVM JIT, cloning pass and sysimg initialization. Most of the detection (both name and features) code is copied from LLVM. However, it's very hard for us to use LLVM because there are informations we need that is not provided by LLVM (some of which is non-trivial to expose in a target independent way), including, 1. Feature dependencies 2. Features related to a CPU name 3. Feature name validation (needed when we expose this to julia code) 4. Effect on ABI (vector register size) 5. Serialization format 6. Cloning heuristic Additionally, the detection code itself is only a small part of the code for each arch and we need to support multiple LLVM versions so copying the LLVM code shouldn't cause too much problem by itself.
28a215d
to
8d0b26b
Compare
Also implement internal runtime API. The detection code avoid using `/proc/cpuinfo` whenever possible and should be much more reliable than the one in LLVM. It also contains a much larger CPUID table to decode CPU names. Compare to X86, the feature encoding/decoding is more complex due to the way LLVM takes attributes. Certain information (arch version) also needs to be moved between name and feature list.
Fallback to LLVM for CPU name and feature detection. Still allow cloning but is restricted to exact name matching dispatch.
Remove ARM specific workaround and do not assume the CPU name contains all information about the code. This should now allow disassemble of code that's not required by the base ISA or even not supported on the current CPU. This is especially important on ARM, which does not disassemble many armv7 instructions previously.
This is currently ignoring the sysimg data. Will be fixed when we have the actual cloning pass.
* Implementing function cloning pass. * Hook up debug info lookup and sysimg loading to the processor initialization API.
* Enable test function multiversioning on the CI We can't do too much cloning on the CI before hitting the timeout or memory limit... Also avoid turning on cloning on circle CI since we seem to be very close to the memory limit. * Add devdoc
8d0b26b
to
ded8d46
Compare
This broke the linux32 buildog tests: https://buildog.julialang.org/#/builders/41/builds/239 |
Current TODO:
Implement more flexible cloning check
Handle "PLT" cloning
Generate dispatch data
Update sysimg loading
Fix
code_native
? Figure out what's wrong with AVX on x86.
Original post
This is a WIP PR implementing function multi versioning in sysimg. It's still WIP but I'd like to get some feedback.
Current limitation
x86 only.
AArch64 support is easy to add though harder to test.
ARM with new enough glibc can also support this.
AVX2 only.
Because this is what I care about the most. The pass needs to be improved to handle multiple versions instead of only two.
Improve cloning heuristics.
This would be needed for KNL, which might need cloning the whole code section. It shouldn't be that bad. Current heuristic clones < 10% of the functions. The code section is still much smaller than the rodata section FWIW....
User interface
-C
option should probably be extended to support this. We might want to parse it ourselves instead of letting LLVM do it since we'll need to emit feature detection code and also sort them.This should also replace/extend the current cpuid logic in
dump.c
Make JIT code use native features too.
What works.
Sysimg generation finishes without crash
With
JULIA_CPU_TARGET=x86-64
the sysimg correctly generates code for avx2 + fma and gets loaded correctly.code_native
works automatically 🎉No dispatch overhead between cloned functions. (i.e. a direct call is generated from
test_clone_g
totest_clone_f
below. These two functions will obviously be removed in the final version)