-
-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
discussion: porting Halide to nim #347
Comments
As discussed, this sounds like a very interesting idea and from a learning perspective I'd rather implement it from scratch, just like Arraymancer started as a way to learn NN from scratch instead of reusing Tensorflow or Torch. Overview of HalideHalide is a C++ domain specific language for computational image processing. It started with the idea of separating the algorithm (the what) from its efficient hardware implementation (the how). This is similar to a (functional programming) source file vs machine code when passed through an optimizing compiler. The "how" can be hinted to assist the compiler in its threading and vectorization by specifying cache blocking parameters, unrolling factor and threading hints. Furthermore Halide supports multiple backends (x86, ARM, Vulkan, DX12, OpenGL, OpenCL, Cuda ...) including a JIT backend that will dispatch with the latest instruction set supported (for example AVX512). The pipeline is: C++ DSL --> Halide Intermediate Representation (Computational Graph) --> LLVM (AOT or JIT) --> machine code. ## Advantages In general
For ML and DLThe main advantages are the portability with SIMD optimisations opportunities and computation fusion to prevent intermediate allocation and results. Parallelism in ML/DL is either straightforward (it's a single loop) or very complex and would need manual involvement:
Some DL layers that would benefits from Halide-like functionality:
Disadvantages
Note that OpenAI Dali (not to be confused with Nvidia DALI) apparently is a compiler that plugs in dynamic graphs framework.
Halide papers
Beyond HalideHalide has been expanded with autodiff capabilities in gradient-halide which may greatly enhance autograd which is a recurrent source of slowness in Arraymancer due to the need of a DAG which doesn't play well with reference counting. However while gradient-halide results are promising, TVM (a fork of Halide for DL) reported perf issues Related projects
Auto-scheduling - Polyhedral tiling
Approaches to portingWrappingI'm not really fan of wrapping unless we can do away with CMake and completely configure everything via Nim. This is the approach I took for the Arcade Learning Environment for reinforcement learning here with a Nim build script that basically replaces CMake and the DLL built can be called from another Nim: https://github.com/numforge/agent-smith/blob/master/third_party/ale_wrap.nim. This preserves C++ name mangling, puts everything in nimcache and can cleanly made into a nimble task. The alternative is something like nimtorch which requires either a build script (conda in NimTorch case), potentially a docker or building the library apart and then copying it. Once wrapped we need a way for Nim to generate valide Halide DSL, this means removing all the fluff Nim adds to a C/C++ file. Interestingly this is a very similar problem to generating Cuda kernel like what is done in Cudanim, this would also apply to OpenCL code generator idea from 2014, 2015, 2016 Reimplementing from scratchAs said in the beginning, from a learning perspective it's the most interesting though the time-to-market and the testing (+ fuzzing) would probably be years away. Interestingly, it seems like Halide started as FImage and had it's own assembler. The IR looks already quite similar to what we have now after only 3 months: https://github.com/halide/Halide/tree/5a2f90b7bcf4d9298e4068698f52731401e07dc8. There are 2 big parts of Halide from an engineering perspective:
Skipping the front-end since that has to be implemented to discuss the backend. Intermediate representationFor the IR there are several choices: Halide IR, Glow IR, TVM IR, SPIR-V or LLVM IR directly LLVM backendAssuming we want to support lots of framework, I don't see any alternative to LLVM for the backend both from a license, a platform and a maturity point of view which comes with it's own cost, notably compilation overhead and the space LLVM takes. Regarding API, I'm not yet sure which one Halide uses but LLVM offers 2:
Written from scratch backendsJit backendAssuming we target a couple specific platforms that shares a lot of similarity in the programming model (for example X86, ARM, Cuda, ROCm), this is also possible. Furthermore as you said, using Nim offers a tremendous metaprogramming advantage as evidenced by the very clean way i could implement my own JIT assembler in Laser: op_generator:
op MOV: # MOV(dst, src) load/copy src into destination
## Copy 64-bit register content to another register
[dst64, src64]: [rex(w=1), 0x89, modrm(Direct, reg = src64, rm = dst64)]
## Copy 32-bit register content to another register
[dst32, src32]: [ 0x89, modrm(Direct, reg = src32, rm = dst32)]
## Copy 16-bit register content to another register
[dst16, src16]: [ 0x66, 0x89, modrm(Direct, reg = src16, rm = dst16)]
## Copy 8-bit register content to another register
[dst8, src8]: [ 0x88, modrm(Direct, reg = src8, rm = dst8)]
## Copy 64-bit immediate value into register
[dst64, imm64]: [rex(w=1), 0xB8 + dst64] & imm64
## Copy 32-bit immediate value into register
[dst64, imm32]: [ 0xB8 + dst64] & imm32
## Copy 16-bit immediate value into register
[dst64, imm16]: [ 0x66, 0xB8 + dst64] & imm16
## Copy 32-bit immediate value into register
[dst32, imm32]: [ 0xB8 + dst32] & imm32
## Copy 16-bit immediate value into register
[dst32, imm16]: [ 0x66, 0xB8 + dst32] & imm16
## Copy 16-bit immediate value into register
[dst16, imm16]: [ 0x66, 0xB8 + dst16] & imm16
## Copy 8-bit immediate value into register
[dst8, imm8]: [ 0xB0 + dst8, imm8]
op LEA:
## Load effective address of the target label into a register
[dst64, label]: [rex(w=1), 0x8D, modrm(Direct, reg = dst64, rm = rbp)]
op CMP:
## Compare 32-bit immediate with 32-bit int at memory location stored in adr register
[adr, imm64]: [ rex(w=1), 0x81, modrm(Indirect, opcode_ext = 7, rm = adr[0])] & imm64
## Compare 32-bit immediate with 32-bit int at memory location stored in adr register
[adr, imm32]: [ 0x81, modrm(Indirect, opcode_ext = 7, rm = adr[0])] & imm32
## Compare 16-bit immediate with 16-bit int at memory location stored in adr register
[adr, imm16]: [ 0x66, 0x81, modrm(Indirect, opcode_ext = 7, rm = adr[0])] & imm16
## Compare 8-bit immediate with byte at memory location stored in adr register
[adr, imm8]: [ 0x80, modrm(Indirect, opcode_ext = 7, rm = adr[0]), imm8]
op JZ:
## Jump to label if zero flag is set
[label]: [0x0F, 0x84]
op JNZ:
## Jump to label if zero flag is not set
[label]: [0x0F, 0x85] JIT is the approach taken for x86-only libs:
There are significant challenges though:
AOT backendWe can also do ahead of time compilation with runtime dispatch depending on the target. This is the approach I take with Laser GEMM with a version for SSE, SSE2, SSE4.1, AVX, AVX+FMA, AVX2, AVX512, ... and we can add Cuda, OpenCL, etc. |
note that we can generate LLVM IR from Nim (or a DSL if we have a macro DSL=>nim); see this demo https://github.com/timotheecour/vitanim/blob/master/testcases/tests/t0127.nim shows how we can generate LLVM IR from a DSL
|
I find the need to go through a text file quite clunky and slow in the long term. Since LLVM has a bit code representation of its IR I was thinking of targeting that directly using bindings akin to LLVM-D. This is also what Halide is doing.. From an interface point of view, I probably can re-use the JIT Assembler for x86-64 I created here except that instead of assembling x86 I'm assembling LLVM IR. (Usage). It can be restricted to the subset valid for Nvidia targets. Alternatively, it seems like we may be able to directly JIT an AST representation without having to go through LLVM IR phase. |
Some advances in my research (will find a place to store that in the future, probably in Laser) Code generation
Alternative HPC compiler
Semantic pass and the expression problemHalide and PyTorch Glow are using the visitor pattern which looks to me like a workaround for a not expressive enough language. We need to be able to extend functions without touching the core ones and if possible data types as well. For example we could have the functions Data types are probably not that critical, i.e. we could use ADTs + Generics for Tensor[float32], VulkanTensor[float16], ... In research, there are several approaches to handle that:
|
opening this issue to track ideas for how to port Halide to nim.
This could be either using halide in Nim via:
proposal: approach 1
use a plugin at CT approach as done in nimterop (nimterop/nimterop#66) to transparently compile (using halide) a c++/Halide DSL, save to a shared library and cache it; eg:
[EDIT] proposal: approach 2
using https://github.com/fragcolor-xyz/nimline could be another possiblity, but would probalby need adaptation as halide compiler needs to be run on C++/halide program
related
halide: https://en.wikipedia.org/wiki/Halide_(programming_language)
some discussion on gitter https://gitter.im/Arraymancer/Lobby?at=5c4f1b7d13a2814df6ded8f5 + a few hours before that
HPC optimization library: Tensor Comprehensions by Facebook #203 briefly mentions halide but wasn't the focus of that issue
other language integrations:
tutorial http://halide-lang.org/tutorials/tutorial_lesson_16_rgb_generate.html
nlvm: https://github.com/arnetheduck/nlvm
The text was updated successfully, but these errors were encountered: