Multi-GPU support #5244
Replies: 5 comments 7 replies
-
You should be able to use halide_mutex just fine (e.g. see scoped_mutex_lock.h) That's what the rest of the runtime does. I'm not sure what you mean by empty implementations? They're implemented in synchronization_common.h |
Beta Was this translation helpful? Give feedback.
-
To answer your second question: We individually compile the cpp files in src/runtime into fragments of llvm IR. We bake these bitcode files as constant strings into libHalide (not that they're in the llvm bitcode binary format, not the llvm assembly format, so they're not human-readable at this point but in principle it's the same thing). Then in LLVM_Runtime_Linker.cpp we use the Halide target string to select which of those components we want, deserialize them from the embedded string, and combine them using the llvm API into a single llvm module for the given target string. We then compile that to machine code to either use immediately in the same process (when JIT-compiling), or to output as a runtime static library. We might jam the generated code that runs the pipeline directly into the same llvm Module (in AOT mode without the no_runtime flag), or we might compile the runtime by itself (in JIT mode or AOT with no_runtime). One additional complexity is that in JIT mode we compile into a few different pieces of machine code instead of one big one so that we can run slightly different target strings without making a whole new runtime. We put all the non-GPU stuff in one module, and then put things like CUDA, OpenCL, etc in their own separate modules, and just tell LLVM how to resolve symbols from each into the others. |
Beta Was this translation helpful? Give feedback.
-
On runtime/fake_thread_pool.cpp I saw the halide_mutex_lock/unlock functions. Is there also a way for me to expose halide_mutex outside Halide runtime (e.g., my code which implements the pipeline)? Also, thanks for the explanation, however I'll still need some time with it and the Halide code to fully comprehend. |
Beta Was this translation helpful? Give feedback.
-
fake_thread_pool.cpp is apty named. It's only used for targets where threaded isn't supported or makes no sense. Currently that's webassembly and "no-os", which is for when you have no kernel and are running on bare metal. I'm not entirely sure what you mean by outside the Halide runtime. If you're outside the Halide runtime in C++ code, you can use std::mutex. If you're inside the runtime, HalideRuntime.h declares the API for all the mutex stuff, and any runtime module can use it. |
Beta Was this translation helpful? Give feedback.
-
Summary on how to forward a CUDA runtime function to external C++ host: Declare your function on runtime/HalideRuntime.h or runtime/HalideRuntimeCuda.h (both work, but use CUDA header if necessary for code cleanliness) inside the
Implement your function on runtime/cuda.cpp:
Create an extern interface on JITModule.h (this is what is called on C++ host code):
By default, only non-specific (no GPUs or other devices) code is exposed to host C++ through JITModule::exports(). As such we must create a runtime module with CUDA extern functions. Use this funciton on JITModule.cpp:
Finally, implement the hook to the extern code, still on JITModule.cpp:
Now it is possible to call your halide_[funcName](type1 arg, ...) on host C++ through:
|
Beta Was this translation helpful? Give feedback.
-
I'm currently working with multi-GPU resources, and thus attempting to enable multi-GPU access on Halide.
The only other reference to this feature is on a stackoverflow post (https://stackoverflow.com/questions/51810425/halide-multi-gpu-support), on which it could be done through overwriting the acquire/release context functions of runtime/cuda.cpp with AOT compilation. Unfortunately I need to use JIT code for my application, so this won't work for me.
At this time I'm modifying a local Halide build (from the 2019 release), mostly runtime/HalideRuntime.h and runtime/cuda.cpp.
I've implemented an array of cuda CUcontext and changed the acquire/release functions to support this, however I need to make some functions thread-safe. I believe that I cannot use halide_mutex's halide_mutex_lock/unlock since their implementations are empty. Also, I can't use pthread_mutex or c++ std::mutex since it is impossible to include external headers on Halide runtime sources (although I can forward declare the lock/unlock functions on runtime/runtime_internal.h, I can't do it for the pthread_mutex_t struct).
Is there any way to include these headers on runtime sources or link an external lib to runtime in order to externalize the use of mutexes?
Also, if not asking too much, I'm curious about how Halide compilation works, specifically how the runtime is compiled. I'm aware about the use of an internal IR which is inputted to LLVM and the use of the visitor pattern to gradually lower the pipeline representation to the point of finally lowering it to the target, but the runtime part still puzzles me.
Beta Was this translation helpful? Give feedback.
All reactions