Implement function multi versioning in sysimg #21849

yuyichao · 2017-05-13T17:49:48Z

Current TODO:

Implement more flexible cloning check
Handle "PLT" cloning
Generate dispatch data
Update sysimg loading
Fix code_native
? Figure out what's wrong with AVX on x86.

Original post

This is a WIP PR implementing function multi versioning in sysimg. It's still WIP but I'd like to get some feedback.

Current limitation

x86 only.

AArch64 support is easy to add though harder to test.
ARM with new enough glibc can also support this.
AVX2 only.

Because this is what I care about the most. The pass needs to be improved to handle multiple versions instead of only two.
Improve cloning heuristics.

This would be needed for KNL, which might need cloning the whole code section. It shouldn't be that bad. Current heuristic clones < 10% of the functions. The code section is still much smaller than the rodata section FWIW....
User interface

-C option should probably be extended to support this. We might want to parse it ourselves instead of letting LLVM do it since we'll need to emit feature detection code and also sort them.
This should also replace/extend the current cpuid logic in dump.c
Make JIT code use native features too.

What works.

Sysimg generation finishes without crash
With JULIA_CPU_TARGET=x86-64 the sysimg correctly generates code for avx2 + fma and gets loaded correctly.
code_native works automatically 🎉
No dispatch overhead between cloned functions. (i.e. a direct call is generated from test_clone_g to test_clone_f below. These two functions will obviously be removed in the final version)

tkelman · 2017-05-13T18:13:48Z

I agree with your choice of AVX2 giving the most benefit as a cutoff, if implementing 2 choices of instruction sets is much simpler to do than 3 or more. Haswell was introduced in 2013 (though are there any low-end Haswell or newer processors that have AVX2 disabled?). For Sandy Bridge (2011) and Ivy Bridge (2012) it might be useful to have AVX, but maybe not worth it for now. How does LLVM handle the operating system software needing to be new enough for AVX? Does it work correctly if the system the binaries get built on is very old?

yuyichao · 2017-05-13T19:07:42Z

Implementing more than 2 is not hard. Just want to get the necessary LLVM pieces together first...

I don't think there's anything too much needed from LLVM, though we might want to disable avx/avx2 for old LLVM versions (if we don't drop them first) since we currently do that for some llvm versions

julia/src/codegen.cpp

Line 7180 in b1f668d

HostFeatures["avx2"] = false;

The multiversioning implemented here is purely an optimization and doesn't change external interface. (That'll be future work.) So there shouldn't be any requirement on system libs. LLVM seems to be pretty good at avoiding AVX - SSE transition panelty so calling SSE library from AVX code shouldn't cause performance issue, though it's another story on KNL, which is why KNL support will need to clone almost every functions.......

As for the default and binary release. I think the binary release can use sse2(default),avx,avx2+fma on x86_64. Not sure if we need to change default build. We can also include avx512 and knl if people are fine with the additional code size and if we actually have users.

tkelman · 2017-05-13T19:17:13Z

There are no avx512 kernels in openblas yet (OpenMathLib/OpenBLAS#991) so I suspect most people on KNL are going to want source builds anyway. The effect on code size will be interesting to measure.

closes #14995 and could maybe replace some of #19486 ?

yuyichao · 2017-05-13T19:27:14Z

no avx512 kernels in openblas

K.... I guess when they have that it'll be a good time to revisite this. There'll probably also be Xeon's with AVX512 at that point and it'll probably make sense to enable avx512 support without KNL support (the two have very different cost model, avx512-skylake will likely not need cloning all functions).

closes #14995 and could maybe replace some of #19486 ?

Yeah, being able to replace them is essentially the clean up work.

Keno · 2017-05-13T20:00:39Z

K.... I guess when they have that it'll be a good time to revisite this. There'll probably also be Xeon's with AVX512 at that point and it'll probably make sense to enable avx512 support without KNL support (the two have very different cost model, avx512-skylake will likely not need cloning all functions).

You can get one on Google Cloud right now.

tkelman · 2017-05-13T20:53:32Z

src/processor.cpp

+#include "julia.h"
+#include "julia_internal.h"
+
+extern "C" {


am I missing something, or why have the file be .cpp instead of .c ?

There'll be string manipulation functions here for parsing the command line args, parsing /proc/cpuinfo, generating LLVM feature string and I don't really want to do that in C (especially not for the output part). Basically I want to put the processor detection logic together instead of in 3+ files. This include the codegen initialization code at

julia/src/codegen.cpp

Line 7089 in 3d5685a

static inline std::string getNativeTarget()

ah okay. there won't be a C or C++ API somewhere in llvm or libuv for whatever you plan on using /proc/cpuinfo for?

LLVM does have it but it's not exposing all the details (e.g. I find no way to access what features is enabled by a cpu, which may or may not be needed). More importantly we should be able to load the multi-versioning sysimg with LLVM disabled (or we need a antiprocessor.c............)

OpenBLAS also has it, though obviously not exposed in a useful way.

I'm not aware of such code in libuv.....

vtjnash · 2017-05-15T16:29:32Z

base/sysimg.jl

+    return s
+end
+
+test_clone_g(Float64[], 1)


maybe hide these in module Test?

This will be gone in the final version. It's just here for testing. It's pretty hard to actually come up with a useful test case that handles all machines the test can run on....

very true. Another nice one to add support for with this infrastructure is @llvm.fma.f64 (e.g. move the

julia/base/floatfuncs.jl

Line 222 in a204bfd

# Disable LLVM's fma if it is incorrect, e.g. because LLVM falls back

hack into the compiler)

vtjnash · 2017-05-15T16:32:13Z

src/llvm-mv.cpp

+// This file is a part of Julia. License is MIT: https://julialang.org/license
+
+// Function multi-versioning
+#define DEBUG_TYPE "julia_mv"


let's actually call this file multiversioning or ifunc, mv is too cryptic

ifunc is pretty jargony, +1 for multiversioning

I was expecting comments on this. Yeah, this will be renamed... mv was just simple to type....

vtjnash · 2017-05-15T16:33:51Z

src/processor.cpp

+
+// Cache of information recovered from jl_cpuid.
+// In a multithreaded environment, there will be races on subnormal_flags,
+// but they are harmless idempotent races.  If we ever embrace C11, then


c++11, and don't we already require that?

We do. And I've already replaced this with a global variable with dynamic initialization on my local version.

vtjnash · 2017-05-15T16:43:50Z

src/dump.c

+            if (nfunc && fptrs && fidxs) {
+                for (size_t i = 0; i < nfunc; i++) {
+                    size_t fi = fidxs[i];
+                    sysimg_fvars[fi] = fptrs[i];


I think I would rather leave this as a shared, read-only table and instead emit N tables (e.g. base table + (sparse?) overlay tables for each architecture). Otherwise we leak memory here, since we'll never want to reference it again after finishing loading. dispatchf seems still useful, to return the index of the appropriate overlay table to use (or, in the future, perhaps an ordered list of overlay tables).

I haven't fully decide on this yet but in any case I'll probably make the big table read only.
I'll probably make a separate relocation table that's mutable and use that for dispatch instead and just add another loop in debuginfo.cpp. The memory usage can be further reduced using double indirection during the call but since how little the memory overhead actually is I prefer to keep it using only one indirection.

tkelman · 2017-05-28T23:06:32Z

src/dump.c

+    // TODO make sure the sysimg and the JIT agrees on the ABI.
+    // This shouldn't be a problem for any required C types on any platforms we support
+    // but could be a problem from optional types. In particular, we need to make sure
+    // the two agrees on the usuable register sizes so that functions that takes


usable

functions that take a vector

tkelman · 2017-06-01T05:13:01Z

test/cmdlineargs.jl

@@ -61,15 +61,6 @@ let exename = `$(Base.julia_cmd()) --precompiled=yes --startup-file=no`
    @test !success(`$exename -L`)
    @test !success(`$exename --load`)

-    # --cpu-target
-    # NOTE: this test only holds true if image_file is a shared library.


why don't we want to run this test still?

It won't make sense anymore. This was testing that a cpu target different from the one used to compile sysimg will throw an error but it will no longer be the case.

shouldn't an invalid setting for that option still error?

No. It could be an CPU name that we don't recognize but llvm does so it shouldn't be blocked at this level and should be passed to LLVM. Unfortunately LLVM does not throw an error in that case and I'm not aware of any method to ask LLVM if it is a supported target name at runtime.

I'm not aware of any method to ask LLVM if it is a supported target name at runtime

You mean jl_TargetMachine->getMCSubtargetInfo()->isCPUStringValid() 😜

(added in https://llvm.org/svn/llvm-project/llvm/trunk@223147)

Doesn't seem to be a way to do that without looking up the CPU first?

yuyichao · 2017-06-01T15:41:48Z

Just a status update. This is still largely WIP but a few boring part are done.

X86 arch and feature detection is complete. This can detect as many features as LLVM-svn (sgx is disabled for now since it's hard to do and AFAICT LLVM doesn't do it correctly either)
ARM and AArch64 feature bits are done. CPU model detection have also gathered enough data. Thanks to people who provide CPU info of their phones and especially @staticfloat who provides one (or two?) kyro CPU id that isn't found elsewhere.
The crc32c change is a test for the new internal feature test API. It'll also be good to implement similar optimization for aarch64 especially since every aarch64 core I've seen so far supports crc32c.
CI runs are to make sure that the new arch/cloning is working on x86. The only issue left there seems to be the segfault on linux 32bit. I am able to reproduce it locally on llvm 3.9.1 but not 4.0. Maybe I'll just disable avx on 32bit linux for llvm < 4.0. It's unlikely going to matter much anyway....

yuyichao · 2017-06-03T04:37:57Z

So the 32bit linux segfault is really funny. LLVM 3.9 didn't realize that xmm8 is not available in 32bit mode so it wanted to emit

vcvtsi2sdl      %ecx, %xmm8, %xmm3
vmovsd  %xmm2, -8(%esi)
vmovsd  %xmm3, (%esi)

which it illegally encoded as

c5 bb 2a d9     vcvtsi2sdl      %ecx, %xmm8, %xmm3
c5 fb 11 56 f8  vmovsd  %xmm2, -8(%esi)
c5 fb 11 1e     vmovsd  %xmm3, (%esi)

However, the correct decoding of that is actually

c5 bb 2a d9 c5 fb       ldsl    -70919894(%ebx), %edi
11 56 f8        adcl    %edx, -8(%esi)
c5 fb 11 1e     vmovsd  %xmm3, (%esi)

Which modifies the data segment register and causes the next instruction (as well as almost any other memory ops) to segfault (since it has a, well, faulty segment....)

Seems to be fixed in 4.0 so disabling avx on LLVM < 4.0 seems to be the correct solution.....

yuyichao · 2017-06-03T04:41:54Z

And for the reference the simplified code that I used to reproduce this is

A = reshape(1.0:4.0, 2, 2)
B = Array{Complex{Float64}}(2, 2)
B .= 1 + 2im

@noinline f() = error()
function _broadcast2!(B::AbstractArray, A, A2, iter1, iter2)
    for I2 in iter2
        for I1 in iter1
            s1, s2 = A.dims
            (((1 <= I1) & (I1 <= s1)) && ((1 <= I2) & (I2 <= s2))) || f()
            ele = A.parent[I1 + (I2 - 1) * 2]
            bele = ele + A2
            B[I1, I2] = bele
        end
    end
end
_broadcast2!(B, A, Complex(false, false), 1:2, 1:2)

It's not particularly reliable since it likely strongly depend on register pressure.

vtjnash · 2017-10-11T16:43:56Z

src/processor.cpp

+    const char *start = ext_features.c_str();
+    for (const char *p = start; *p; p++) {
+        if (*p == ',' || *p == '\0') {
+            features.emplace_back(start, p - start);


*p can't == '\0'

Yes, this is a bug.

vtjnash · 2017-10-11T21:14:22Z

src/processor_arm.cpp

+    auto dir = opendir("/sys/devices/system/cpu");
+    if (!dir)
+        return;
+    while (auto entry = readdir(dir)) {


use readdir_r?

Deprecated.

vtjnash · 2017-10-12T03:58:37Z

src/processor_arm.cpp

+    if (maxidx >= 0) {
+        list.erase(std::remove_if(list.begin(), list.end(), [&] (std::pair<uint32_t,CPUID> &ele) {
+                    int idx = find(ele.first);
+                    return idx != -1 && idx < maxidx;


But why not just write idx != maxidx?

vtjnash · 2017-10-12T04:00:58Z

src/llvm-multiversioning.cpp

+                                                insert_before);
+        }
+        else {
+            jl_safe_printf("Unknown const use.");


It's very bad to let the wrong version running.

worse than calling abort (instead of, say, a warning)?

ensures that arguments will be passed through a posix shell correctly, (including those such as being added in #21849 for -C) fix #10120

vtjnash · 2017-10-12T16:57:40Z

Alright, one review comment open (in processor.cpp) and I think this is ready to merge.

yuyichao · 2017-10-12T17:04:32Z

I've also got some local updates that tweaks the x86 handling a little. Will push later.

Handling of directly exported function is still missing, which should just mean that use of jl_extern_c can't have multiversioning. A fix with slight semantic change should be possible (essentially emit a trampoline in this pass, not in codegen directly since this pass might want to clone that function too....) That should be fairly independent of other changes can be fixed later.

…ti-versioning and improved cpu selection/detection. * Remove cpuid specific binary To be replaced by function multi-versioning * Remove old CPU target handling in sysimg The feature check is not complete since X86 features bits are also in many other CPUID results, including ones that we are actually interested in like AVX2. * Add invalid CPU name and help handling in codegen initialization. Update `--cpu-target` test since target mismatch will not raise an error anymore. CPU names unknown to LLVM would.

Use C++ instead of C since there will be a lot of string handling and indirect interaction with LLVM in this file. The meta-programing capability is also pretty useful.

Also implement the API needed by other part of the runtime, especially LLVM JIT, cloning pass and sysimg initialization. Most of the detection (both name and features) code is copied from LLVM. However, it's very hard for us to use LLVM because there are informations we need that is not provided by LLVM (some of which is non-trivial to expose in a target independent way), including, 1. Feature dependencies 2. Features related to a CPU name 3. Feature name validation (needed when we expose this to julia code) 4. Effect on ABI (vector register size) 5. Serialization format 6. Cloning heuristic Additionally, the detection code itself is only a small part of the code for each arch and we need to support multiple LLVM versions so copying the LLVM code shouldn't cause too much problem by itself.

Also implement internal runtime API. The detection code avoid using `/proc/cpuinfo` whenever possible and should be much more reliable than the one in LLVM. It also contains a much larger CPUID table to decode CPU names. Compare to X86, the feature encoding/decoding is more complex due to the way LLVM takes attributes. Certain information (arch version) also needs to be moved between name and feature list.

Fallback to LLVM for CPU name and feature detection. Still allow cloning but is restricted to exact name matching dispatch.

Remove ARM specific workaround and do not assume the CPU name contains all information about the code. This should now allow disassemble of code that's not required by the base ISA or even not supported on the current CPU. This is especially important on ARM, which does not disassemble many armv7 instructions previously.

This is currently ignoring the sysimg data. Will be fixed when we have the actual cloning pass.

* Implementing function cloning pass. * Hook up debug info lookup and sysimg loading to the processor initialization API.

* Enable test function multiversioning on the CI We can't do too much cloning on the CI before hitting the timeout or memory limit... Also avoid turning on cloning on circle CI since we seem to be very close to the memory limit. * Add devdoc

vtjnash · 2017-11-06T18:45:58Z

This broke the linux32 buildog tests: https://buildog.julialang.org/#/builders/41/builds/239

yuyichao force-pushed the yyc/codegen/clone branch 2 times, most recently from 5aa6111 to a0c2543 Compare May 13, 2017 20:50

tkelman reviewed May 13, 2017

View reviewed changes

yuyichao force-pushed the yyc/codegen/clone branch 2 times, most recently from 0aefb4e to 75c5056 Compare May 13, 2017 22:36

vtjnash reviewed May 15, 2017

View reviewed changes

yuyichao force-pushed the yyc/codegen/clone branch 2 times, most recently from 4839391 to 693fd81 Compare May 28, 2017 22:41

tkelman reviewed May 28, 2017

View reviewed changes

yuyichao force-pushed the yyc/codegen/clone branch 12 times, most recently from 02f5024 to ae49faf Compare June 1, 2017 04:04

tkelman reviewed Jun 1, 2017

View reviewed changes

vtjnash reviewed Oct 11, 2017

View reviewed changes

vtjnash reviewed Oct 12, 2017

View reviewed changes

vtjnash added a commit that referenced this pull request Oct 12, 2017

correct posixly shell quoting

ef47e60

ensures that arguments will be passed through a posix shell correctly, (including those such as being added in #21849 for -C) fix #10120

yuyichao added 3 commits October 13, 2017 16:18

Move CPU feature detection functions to processor.cpp

cc1d4de

Use C++ instead of C since there will be a lot of string handling and indirect interaction with LLVM in this file. The meta-programing capability is also pretty useful.

yuyichao force-pushed the yyc/codegen/clone branch 2 times, most recently from 28a215d to 8d0b26b Compare October 13, 2017 22:29

yuyichao added 6 commits October 13, 2017 18:30

Implement fallback CPU detection and dispatch API

f9885df

Fallback to LLVM for CPU name and feature detection. Still allow cloning but is restricted to exact name matching dispatch.

Use processor.cpp API to initialize LLVM JIT

229864e

This is currently ignoring the sysimg data. Will be fixed when we have the actual cloning pass.

Implement function multi versioning in sysimg

6ebdd79

* Implementing function cloning pass. * Hook up debug info lookup and sysimg loading to the processor initialization API.

Finish up function multiversioning support

ded8d46

* Enable test function multiversioning on the CI We can't do too much cloning on the CI before hitting the timeout or memory limit... Also avoid turning on cloning on circle CI since we seem to be very close to the memory limit. * Add devdoc

yuyichao force-pushed the yyc/codegen/clone branch from 8d0b26b to ded8d46 Compare October 13, 2017 22:31

yuyichao merged commit 7e1dd6d into master Oct 18, 2017

yuyichao deleted the yyc/codegen/clone branch October 18, 2017 00:34

yuyichao mentioned this pull request Nov 17, 2017

segfault on Libdl.dllist() on 32-bit Linux #24643

Closed

Sacha0 mentioned this pull request Nov 18, 2017

Option to rebuild system image at installation #24646

Closed

yuyichao mentioned this pull request Nov 21, 2017

rr 5.0.0 can no longer trace Julia #24693

Closed

vchuravy mentioned this pull request Feb 28, 2018

Add cpu feature detection support for PPC #26252

Open

JeffBezanson mentioned this pull request Apr 23, 2018

output of -C help should be more helpful #26886

Open

vchuravy mentioned this pull request May 17, 2018

Unable to find compatible target in system image. when using valgrind #27131

Closed

yuyichao mentioned this pull request Jun 5, 2018

Raise an error for unsupported MARCH or JULIA_CPU_TARGET #27439

Closed

ararslan mentioned this pull request Aug 17, 2018

Updates for Julia 1.0 invenia/Memento.jl#101

Merged

JeffBezanson removed this from the 1.1 milestone Sep 27, 2018

yuyichao mentioned this pull request Sep 11, 2020

Multiversioning: support for aliases (from @ccallable) #37530

Merged

Implement function multi versioning in sysimg #21849

Implement function multi versioning in sysimg #21849

Conversation

yuyichao commented May 13, 2017 • edited Loading

tkelman commented May 13, 2017

yuyichao commented May 13, 2017

tkelman commented May 13, 2017

yuyichao commented May 13, 2017

Keno commented May 13, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tkelman May 28, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yuyichao commented Jun 1, 2017

yuyichao commented Jun 3, 2017

yuyichao commented Jun 3, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vtjnash commented Oct 12, 2017

yuyichao commented Oct 12, 2017

vtjnash commented Nov 6, 2017

yuyichao commented May 13, 2017 •

edited

Loading

tkelman May 28, 2017 •

edited

Loading