Multi-GPU support #5244

WillianJunior · 2020-09-03T23:40:50Z

WillianJunior
Sep 3, 2020

I'm currently working with multi-GPU resources, and thus attempting to enable multi-GPU access on Halide.
The only other reference to this feature is on a stackoverflow post (https://stackoverflow.com/questions/51810425/halide-multi-gpu-support), on which it could be done through overwriting the acquire/release context functions of runtime/cuda.cpp with AOT compilation. Unfortunately I need to use JIT code for my application, so this won't work for me.
At this time I'm modifying a local Halide build (from the 2019 release), mostly runtime/HalideRuntime.h and runtime/cuda.cpp.

I've implemented an array of cuda CUcontext and changed the acquire/release functions to support this, however I need to make some functions thread-safe. I believe that I cannot use halide_mutex's halide_mutex_lock/unlock since their implementations are empty. Also, I can't use pthread_mutex or c++ std::mutex since it is impossible to include external headers on Halide runtime sources (although I can forward declare the lock/unlock functions on runtime/runtime_internal.h, I can't do it for the pthread_mutex_t struct).

Is there any way to include these headers on runtime sources or link an external lib to runtime in order to externalize the use of mutexes?

Also, if not asking too much, I'm curious about how Halide compilation works, specifically how the runtime is compiled. I'm aware about the use of an internal IR which is inputted to LLVM and the use of the visitor pattern to gradually lower the pipeline representation to the point of finally lowering it to the target, but the runtime part still puzzles me.

abadams · 2020-09-03T23:49:37Z

abadams
Sep 3, 2020
Maintainer

You should be able to use halide_mutex just fine (e.g. see scoped_mutex_lock.h) That's what the rest of the runtime does. I'm not sure what you mean by empty implementations? They're implemented in synchronization_common.h

0 replies

abadams · 2020-09-03T23:56:52Z

abadams
Sep 3, 2020
Maintainer

To answer your second question: We individually compile the cpp files in src/runtime into fragments of llvm IR. We bake these bitcode files as constant strings into libHalide (not that they're in the llvm bitcode binary format, not the llvm assembly format, so they're not human-readable at this point but in principle it's the same thing). Then in LLVM_Runtime_Linker.cpp we use the Halide target string to select which of those components we want, deserialize them from the embedded string, and combine them using the llvm API into a single llvm module for the given target string. We then compile that to machine code to either use immediately in the same process (when JIT-compiling), or to output as a runtime static library. We might jam the generated code that runs the pipeline directly into the same llvm Module (in AOT mode without the no_runtime flag), or we might compile the runtime by itself (in JIT mode or AOT with no_runtime).

One additional complexity is that in JIT mode we compile into a few different pieces of machine code instead of one big one so that we can run slightly different target strings without making a whole new runtime. We put all the non-GPU stuff in one module, and then put things like CUDA, OpenCL, etc in their own separate modules, and just tell LLVM how to resolve symbols from each into the others.

0 replies

WillianJunior · 2020-09-04T00:05:32Z

WillianJunior
Sep 4, 2020
Author

On runtime/fake_thread_pool.cpp I saw the halide_mutex_lock/unlock functions.

Is there also a way for me to expose halide_mutex outside Halide runtime (e.g., my code which implements the pipeline)?

Also, thanks for the explanation, however I'll still need some time with it and the Halide code to fully comprehend.

0 replies

abadams · 2020-09-04T00:08:56Z

abadams
Sep 4, 2020
Maintainer

fake_thread_pool.cpp is apty named. It's only used for targets where threaded isn't supported or makes no sense. Currently that's webassembly and "no-os", which is for when you have no kernel and are running on bare metal.

I'm not entirely sure what you mean by outside the Halide runtime. If you're outside the Halide runtime in C++ code, you can use std::mutex. If you're inside the runtime, HalideRuntime.h declares the API for all the mutex stuff, and any runtime module can use it.

7 replies

WillianJunior Sep 5, 2020
Author

Interesting. However now I'm unable to find my runtime function though JITModule exports().find("...") on shared_runtimes(CUDA). I've implemented my functions on runtime/cuda.cpp and declared them on runtime/HalideRuntime.h (as with halide_reuse_device_allocations on ./runtime/allocation_cache.cpp). Also declared and implemented the JITModule and JITSharedRuntime functions on JITModule.cpp. Should I use SharedCudaContext and move it to JITModule.h?

Also, how and where JITModule.exports is populated with external (runtime) functions?

abadams Sep 5, 2020
Maintainer

It's possible that the problem is that they must be extern "C" and start with "halide_" or our logic won't export them from the runtime for external use.

WillianJunior Sep 5, 2020
Author

I updated the functions but it's still not working. Below an example:

HalideRuntime.h:

[...]
extern "C" {
extern int halide_get_multi_gpu_count();
[...]
}
[...]

cuda.cpp:

[...]
extern "C" {
static int gpu_count = -1;
[...]
WEAK int halide_get_multi_gpu_count() { return gpu_count; }
}
[...]

JITModule.cpp:

[...]
int JITSharedRuntime::get_multi_gpu_count() {
    std::lock_guard<std::mutex> lock(shared_runtimes_mutex);
    return shared_runtimes(CUDA).get_multi_gpu_count();
}

int JITModule::get_multi_gpu_count() {
    std::map<std::string, Symbol>::const_iterator f =
        exports().find("halide_get_multi_gpu_count");
    if (f != exports().end()) {
        return (reinterpret_bits<int (*)()>(f->second.address))();
    }
    return -9;
}
[...]

myCode.cpp:

[...]
std::cout << Halide::Internal::JITSharedRuntime::get_multi_gpu_count() << std::endl;
[...]

The output is -9, which shows that the halide_get_multi_gpu_count symbol was not found.

abadams Sep 5, 2020
Maintainer

The only difference I can tell between this and halide_reuse_device_allocations is that halide_reuse_device_allocations goes into the main shared runtime, not the cuda runtime.

An instance in which a function is called in the cuda runtime from host C++ is calculate_host_cuda_capability(Target). In that case the logic used to go find the relevant function (halide_cuda_device_interface) is lookup_runtime_routine in DeviceInterface.cpp Looks like it just iterates through every runtime module looking for the symbol in question. You could try that?

WillianJunior Sep 6, 2020
Author

I Think I made it.
Problem was: CUDA JITModule with the external code was not created as was MainShared through make_module().
As you pointed, on DeviceInterface.cpp the cuda function halide_cuda_device_interface() was being called, however, it was only found after JITSharedRuntime::get() was called with a CUDA target. The loop reason was that not all RuntimeKind necessarily built, so there were no direct reference to the CUDA module on the array, thus needing to attempt to find the function on all created modules.

I'll tinker some more on my code and later post a summary on how to access external runtime functions from outside Halide (or even host C++ Halide). Thanks again for all the help.

WillianJunior · 2020-09-06T16:05:53Z

WillianJunior
Sep 6, 2020
Author

Summary on how to forward a CUDA runtime function to external C++ host:

Declare your function on runtime/HalideRuntime.h or runtime/HalideRuntimeCuda.h (both work, but use CUDA header if necessary for code cleanliness) inside the
extern"C" clause (the runtime function name must begin with "halide_", otherwise it won't be found later on host C++ Halide):

[...]
extern "C" {
[...]
extern retType halide_[funcName](type1 arg, ...);
[...]

Implement your function on runtime/cuda.cpp:

[...]
WEAK retType halide_[funcName](type1 arg, ...) {...}
[...]

Create an extern interface on JITModule.h (this is what is called on C++ host code):

[...]
class JITSharedRuntime {
[...]
static retType [funcName](type1 arg, ...);
[...]
}
[...]

By default, only non-specific (no GPUs or other devices) code is exposed to host C++ through JITModule::exports(). As such we must create a runtime module with CUDA extern functions. Use this funciton on JITModule.cpp:

std::mutex cuda_runtimes_mutex;
void *get_cuda_runtime_extern_func(Target target, const char *f_name) {
    // Create extern CUDA runtime module only once
    static std::vector<JITModule> runtimes;
    if (runtimes.size() == 0) {
        cuda_runtimes_mutex.lock();
        if (runtimes.size() == 0) {
            runtimes = JITSharedRuntime::get(nullptr,
                                             target.with_feature(Target::JIT));
        }
        cuda_runtimes_mutex.unlock();
    }

    // Find function on the created modules
    for (size_t i = 0; i < runtimes.size(); i++) {
        std::map<std::string, JITModule::Symbol>::const_iterator f =
            runtimes[i].exports().find(f_name);
        if (f != runtimes[i].exports().end()) {
            return f->second.address;
        }
    }
    return NULL;
}

Finally, implement the hook to the extern code, still on JITModule.cpp:

void JITSharedRuntime::[funcName](Target &target, type1 arg, ...) {
    reinterpret_bits<retType (*)(type1, ...)>(get_cuda_runtime_extern_func(
        target, "halide_[funcName]"))(arg, ...);
}

Now it is possible to call your halide_[funcName](type1 arg, ...) on host C++ through:

[...]
target.set_feature(Halide::Target::CUDA);
[...]
Halide::Internal::JITSharedRuntime::[funcName](target, arg, ...);

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-GPU support #5244

{{title}}

Replies: 5 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Multi-GPU support #5244

WillianJunior Sep 3, 2020

Replies: 5 comments · 7 replies

abadams Sep 3, 2020 Maintainer

abadams Sep 3, 2020 Maintainer

WillianJunior Sep 4, 2020 Author

abadams Sep 4, 2020 Maintainer

WillianJunior Sep 5, 2020 Author

abadams Sep 5, 2020 Maintainer

WillianJunior Sep 5, 2020 Author

abadams Sep 5, 2020 Maintainer

WillianJunior Sep 6, 2020 Author

WillianJunior Sep 6, 2020 Author

WillianJunior
Sep 3, 2020

Replies: 5 comments 7 replies

abadams
Sep 3, 2020
Maintainer

abadams
Sep 3, 2020
Maintainer

WillianJunior
Sep 4, 2020
Author

abadams
Sep 4, 2020
Maintainer

WillianJunior Sep 5, 2020
Author

abadams Sep 5, 2020
Maintainer

WillianJunior Sep 5, 2020
Author

abadams Sep 5, 2020
Maintainer

WillianJunior Sep 6, 2020
Author

WillianJunior
Sep 6, 2020
Author