Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement function multi versioning in sysimg #21849

Merged
merged 9 commits into from
Oct 18, 2017
Merged

Conversation

yuyichao
Copy link
Contributor

@yuyichao yuyichao commented May 13, 2017

Current TODO:

  • Implement more flexible cloning check

  • Handle "PLT" cloning

  • Generate dispatch data

  • Update sysimg loading

  • Fix code_native

  • ? Figure out what's wrong with AVX on x86.


Original post

This is a WIP PR implementing function multi versioning in sysimg. It's still WIP but I'd like to get some feedback.

Current limitation

  • x86 only.

    AArch64 support is easy to add though harder to test.
    ARM with new enough glibc can also support this.

  • AVX2 only.

    Because this is what I care about the most. The pass needs to be improved to handle multiple versions instead of only two.

  • Improve cloning heuristics.

    This would be needed for KNL, which might need cloning the whole code section. It shouldn't be that bad. Current heuristic clones < 10% of the functions. The code section is still much smaller than the rodata section FWIW....

  • User interface

    -C option should probably be extended to support this. We might want to parse it ourselves instead of letting LLVM do it since we'll need to emit feature detection code and also sort them.
    This should also replace/extend the current cpuid logic in dump.c

  • Make JIT code use native features too.

What works.

  • Sysimg generation finishes without crash

  • With JULIA_CPU_TARGET=x86-64 the sysimg correctly generates code for avx2 + fma and gets loaded correctly.

  • code_native works automatically 🎉

  • No dispatch overhead between cloned functions. (i.e. a direct call is generated from test_clone_g to test_clone_f below. These two functions will obviously be removed in the final version)

@tkelman
Copy link
Contributor

tkelman commented May 13, 2017

I agree with your choice of AVX2 giving the most benefit as a cutoff, if implementing 2 choices of instruction sets is much simpler to do than 3 or more. Haswell was introduced in 2013 (though are there any low-end Haswell or newer processors that have AVX2 disabled?). For Sandy Bridge (2011) and Ivy Bridge (2012) it might be useful to have AVX, but maybe not worth it for now. How does LLVM handle the operating system software needing to be new enough for AVX? Does it work correctly if the system the binaries get built on is very old?

@yuyichao
Copy link
Contributor Author

Implementing more than 2 is not hard. Just want to get the necessary LLVM pieces together first...

I don't think there's anything too much needed from LLVM, though we might want to disable avx/avx2 for old LLVM versions (if we don't drop them first) since we currently do that for some llvm versions

HostFeatures["avx2"] = false;

The multiversioning implemented here is purely an optimization and doesn't change external interface. (That'll be future work.) So there shouldn't be any requirement on system libs. LLVM seems to be pretty good at avoiding AVX - SSE transition panelty so calling SSE library from AVX code shouldn't cause performance issue, though it's another story on KNL, which is why KNL support will need to clone almost every functions.......

As for the default and binary release. I think the binary release can use sse2(default),avx,avx2+fma on x86_64. Not sure if we need to change default build. We can also include avx512 and knl if people are fine with the additional code size and if we actually have users.

@tkelman
Copy link
Contributor

tkelman commented May 13, 2017

There are no avx512 kernels in openblas yet (OpenMathLib/OpenBLAS#991) so I suspect most people on KNL are going to want source builds anyway. The effect on code size will be interesting to measure.

closes #14995 and could maybe replace some of #19486 ?

@yuyichao
Copy link
Contributor Author

no avx512 kernels in openblas

K.... I guess when they have that it'll be a good time to revisite this. There'll probably also be Xeon's with AVX512 at that point and it'll probably make sense to enable avx512 support without KNL support (the two have very different cost model, avx512-skylake will likely not need cloning all functions).

closes #14995 and could maybe replace some of #19486 ?

Yeah, being able to replace them is essentially the clean up work.

@Keno
Copy link
Member

Keno commented May 13, 2017

K.... I guess when they have that it'll be a good time to revisite this. There'll probably also be Xeon's with AVX512 at that point and it'll probably make sense to enable avx512 support without KNL support (the two have very different cost model, avx512-skylake will likely not need cloning all functions).

You can get one on Google Cloud right now.

@yuyichao yuyichao force-pushed the yyc/codegen/clone branch 2 times, most recently from 5aa6111 to a0c2543 Compare May 13, 2017 20:50
#include "julia.h"
#include "julia_internal.h"

extern "C" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

am I missing something, or why have the file be .cpp instead of .c ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There'll be string manipulation functions here for parsing the command line args, parsing /proc/cpuinfo, generating LLVM feature string and I don't really want to do that in C (especially not for the output part). Basically I want to put the processor detection logic together instead of in 3+ files. This include the codegen initialization code at

static inline std::string getNativeTarget()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah okay. there won't be a C or C++ API somewhere in llvm or libuv for whatever you plan on using /proc/cpuinfo for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LLVM does have it but it's not exposing all the details (e.g. I find no way to access what features is enabled by a cpu, which may or may not be needed). More importantly we should be able to load the multi-versioning sysimg with LLVM disabled (or we need a antiprocessor.c............)

OpenBLAS also has it, though obviously not exposed in a useful way.

I'm not aware of such code in libuv.....

@yuyichao yuyichao force-pushed the yyc/codegen/clone branch 2 times, most recently from 0aefb4e to 75c5056 Compare May 13, 2017 22:36
base/sysimg.jl Outdated
return s
end

test_clone_g(Float64[], 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe hide these in module Test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be gone in the final version. It's just here for testing. It's pretty hard to actually come up with a useful test case that handles all machines the test can run on....

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very true. Another nice one to add support for with this infrastructure is @llvm.fma.f64 (e.g. move the

# Disable LLVM's fma if it is incorrect, e.g. because LLVM falls back
hack into the compiler)

src/llvm-mv.cpp Outdated
// This file is a part of Julia. License is MIT: https://julialang.org/license

// Function multi-versioning
#define DEBUG_TYPE "julia_mv"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's actually call this file multiversioning or ifunc, mv is too cryptic

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ifunc is pretty jargony, +1 for multiversioning

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was expecting comments on this. Yeah, this will be renamed... mv was just simple to type....


// Cache of information recovered from jl_cpuid.
// In a multithreaded environment, there will be races on subnormal_flags,
// but they are harmless idempotent races. If we ever embrace C11, then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

c++11, and don't we already require that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do. And I've already replaced this with a global variable with dynamic initialization on my local version.

src/dump.c Outdated
if (nfunc && fptrs && fidxs) {
for (size_t i = 0; i < nfunc; i++) {
size_t fi = fidxs[i];
sysimg_fvars[fi] = fptrs[i];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would rather leave this as a shared, read-only table and instead emit N tables (e.g. base table + (sparse?) overlay tables for each architecture). Otherwise we leak memory here, since we'll never want to reference it again after finishing loading. dispatchf seems still useful, to return the index of the appropriate overlay table to use (or, in the future, perhaps an ordered list of overlay tables).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't fully decide on this yet but in any case I'll probably make the big table read only.
I'll probably make a separate relocation table that's mutable and use that for dispatch instead and just add another loop in debuginfo.cpp. The memory usage can be further reduced using double indirection during the call but since how little the memory overhead actually is I prefer to keep it using only one indirection.

@yuyichao yuyichao force-pushed the yyc/codegen/clone branch 2 times, most recently from 4839391 to 693fd81 Compare May 28, 2017 22:41
src/dump.c Outdated
// TODO make sure the sysimg and the JIT agrees on the ABI.
// This shouldn't be a problem for any required C types on any platforms we support
// but could be a problem from optional types. In particular, we need to make sure
// the two agrees on the usuable register sizes so that functions that takes
Copy link
Contributor

@tkelman tkelman May 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

usable

functions that take a vector

@yuyichao yuyichao force-pushed the yyc/codegen/clone branch 12 times, most recently from 02f5024 to ae49faf Compare June 1, 2017 04:04
@@ -61,15 +61,6 @@ let exename = `$(Base.julia_cmd()) --precompiled=yes --startup-file=no`
@test !success(`$exename -L`)
@test !success(`$exename --load`)

# --cpu-target
# NOTE: this test only holds true if image_file is a shared library.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't we want to run this test still?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It won't make sense anymore. This was testing that a cpu target different from the one used to compile sysimg will throw an error but it will no longer be the case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't an invalid setting for that option still error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. It could be an CPU name that we don't recognize but llvm does so it shouldn't be blocked at this level and should be passed to LLVM. Unfortunately LLVM does not throw an error in that case and I'm not aware of any method to ask LLVM if it is a supported target name at runtime.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not aware of any method to ask LLVM if it is a supported target name at runtime

You mean jl_TargetMachine->getMCSubtargetInfo()->isCPUStringValid() 😜

(added in https://llvm.org/svn/llvm-project/llvm/trunk@223147)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't seem to be a way to do that without looking up the CPU first?

@yuyichao
Copy link
Contributor Author

yuyichao commented Jun 1, 2017

Just a status update. This is still largely WIP but a few boring part are done.

  • X86 arch and feature detection is complete. This can detect as many features as LLVM-svn (sgx is disabled for now since it's hard to do and AFAICT LLVM doesn't do it correctly either)
  • ARM and AArch64 feature bits are done. CPU model detection have also gathered enough data. Thanks to people who provide CPU info of their phones and especially @staticfloat who provides one (or two?) kyro CPU id that isn't found elsewhere.
  • The crc32c change is a test for the new internal feature test API. It'll also be good to implement similar optimization for aarch64 especially since every aarch64 core I've seen so far supports crc32c.
  • CI runs are to make sure that the new arch/cloning is working on x86. The only issue left there seems to be the segfault on linux 32bit. I am able to reproduce it locally on llvm 3.9.1 but not 4.0. Maybe I'll just disable avx on 32bit linux for llvm < 4.0. It's unlikely going to matter much anyway....

@yuyichao
Copy link
Contributor Author

yuyichao commented Jun 3, 2017

So the 32bit linux segfault is really funny. LLVM 3.9 didn't realize that xmm8 is not available in 32bit mode so it wanted to emit

vcvtsi2sdl      %ecx, %xmm8, %xmm3
vmovsd  %xmm2, -8(%esi)
vmovsd  %xmm3, (%esi)

which it illegally encoded as

c5 bb 2a d9     vcvtsi2sdl      %ecx, %xmm8, %xmm3
c5 fb 11 56 f8  vmovsd  %xmm2, -8(%esi)
c5 fb 11 1e     vmovsd  %xmm3, (%esi)

However, the correct decoding of that is actually

c5 bb 2a d9 c5 fb       ldsl    -70919894(%ebx), %edi
11 56 f8        adcl    %edx, -8(%esi)
c5 fb 11 1e     vmovsd  %xmm3, (%esi)

Which modifies the data segment register and causes the next instruction (as well as almost any other memory ops) to segfault (since it has a, well, faulty segment....)

Seems to be fixed in 4.0 so disabling avx on LLVM < 4.0 seems to be the correct solution.....

@yuyichao
Copy link
Contributor Author

yuyichao commented Jun 3, 2017

And for the reference the simplified code that I used to reproduce this is

A = reshape(1.0:4.0, 2, 2)
B = Array{Complex{Float64}}(2, 2)
B .= 1 + 2im

@noinline f() = error()
function _broadcast2!(B::AbstractArray, A, A2, iter1, iter2)
    for I2 in iter2
        for I1 in iter1
            s1, s2 = A.dims
            (((1 <= I1) & (I1 <= s1)) && ((1 <= I2) & (I2 <= s2))) || f()
            ele = A.parent[I1 + (I2 - 1) * 2]
            bele = ele + A2
            B[I1, I2] = bele
        end
    end
end
_broadcast2!(B, A, Complex(false, false), 1:2, 1:2)

It's not particularly reliable since it likely strongly depend on register pressure.

const char *start = ext_features.c_str();
for (const char *p = start; *p; p++) {
if (*p == ',' || *p == '\0') {
features.emplace_back(start, p - start);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*p can't == '\0'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a bug.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

auto dir = opendir("/sys/devices/system/cpu");
if (!dir)
return;
while (auto entry = readdir(dir)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use readdir_r?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deprecated.

if (maxidx >= 0) {
list.erase(std::remove_if(list.begin(), list.end(), [&] (std::pair<uint32_t,CPUID> &ele) {
int idx = find(ele.first);
return idx != -1 && idx < maxidx;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But why not just write idx != maxidx?

insert_before);
}
else {
jl_safe_printf("Unknown const use.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's very bad to let the wrong version running.

worse than calling abort (instead of, say, a warning)?

vtjnash added a commit that referenced this pull request Oct 12, 2017
ensures that arguments will be passed through a posix shell correctly,
(including those such as being added in #21849 for -C)

fix #10120
@vtjnash
Copy link
Member

vtjnash commented Oct 12, 2017

Alright, one review comment open (in processor.cpp) and I think this is ready to merge.

@yuyichao
Copy link
Contributor Author

I've also got some local updates that tweaks the x86 handling a little. Will push later.

Handling of directly exported function is still missing, which should just mean that use of jl_extern_c can't have multiversioning. A fix with slight semantic change should be possible (essentially emit a trampoline in this pass, not in codegen directly since this pass might want to clone that function too....) That should be fairly independent of other changes can be fixed later.

…ti-versioning and

improved cpu selection/detection.

* Remove cpuid specific binary

    To be replaced by function multi-versioning

* Remove old CPU target handling in sysimg

    The feature check is not complete since X86 features bits
    are also in many other CPUID results, including ones that
    we are actually interested in like AVX2.

* Add invalid CPU name and help handling in codegen initialization.

    Update `--cpu-target` test
    since target mismatch will not raise an error anymore.
    CPU names unknown to LLVM would.
Use C++ instead of C since there will be a lot of string handling and indirect interaction
with LLVM in this file. The meta-programing capability is also pretty useful.
Also implement the API needed by other part of the runtime,
especially LLVM JIT, cloning pass and sysimg initialization.

Most of the detection (both name and features) code is copied from LLVM.
However, it's very hard for us to use LLVM because there are informations we need that
is not provided by LLVM (some of which is non-trivial to expose in a target independent way),
including,

1. Feature dependencies
2. Features related to a CPU name
3. Feature name validation (needed when we expose this to julia code)
4. Effect on ABI (vector register size)
5. Serialization format
6. Cloning heuristic

Additionally, the detection code itself is only a small part of the code for each arch
and we need to support multiple LLVM versions so copying the LLVM code shouldn't cause too
much problem by itself.
@yuyichao yuyichao force-pushed the yyc/codegen/clone branch 2 times, most recently from 28a215d to 8d0b26b Compare October 13, 2017 22:29
Also implement internal runtime API.

The detection code avoid using `/proc/cpuinfo` whenever possible and should be much more
reliable than the one in LLVM. It also contains a much larger CPUID table to decode CPU names.

Compare to X86, the feature encoding/decoding is more complex due to the way LLVM takes
attributes. Certain information (arch version) also needs to be moved between name
and feature list.
Fallback to LLVM for CPU name and feature detection.
Still allow cloning but is restricted to exact name matching dispatch.
Remove ARM specific workaround and do not assume the CPU name contains all
information about the code.
This should now allow disassemble of code that's not required by the base ISA
or even not supported on the current CPU.
This is especially important on ARM, which does not disassemble many armv7 instructions previously.
This is currently ignoring the sysimg data.
Will be fixed when we have the actual cloning pass.
* Implementing function cloning pass.
* Hook up debug info lookup and sysimg loading to the processor initialization API.
* Enable test function multiversioning on the CI

  We can't do too much cloning on the CI before hitting the timeout or memory limit...
  Also avoid turning on cloning on circle CI since we seem to be very close to the memory limit.

* Add devdoc
@yuyichao yuyichao merged commit 7e1dd6d into master Oct 18, 2017
@yuyichao yuyichao deleted the yyc/codegen/clone branch October 18, 2017 00:34
@vtjnash
Copy link
Member

vtjnash commented Nov 6, 2017

This broke the linux32 buildog tests: https://buildog.julialang.org/#/builders/41/builds/239

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:codegen Generation of LLVM IR and native code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants