Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

discussion: porting Halide to nim #347

Open
timotheecour opened this issue Jan 28, 2019 · 4 comments
Open

discussion: porting Halide to nim #347

timotheecour opened this issue Jan 28, 2019 · 4 comments

Comments

@timotheecour
Copy link
Contributor

timotheecour commented Jan 28, 2019

opening this issue to track ideas for how to port Halide to nim.
This could be either using halide in Nim via:

  • A1: a wrapper (eg using nimterop) or
  • A2: re-implementing some ideas in nim directly, which although a lot of work would result in a much simpler solution given nim's better DSL abilities

proposal: approach 1

use a plugin at CT approach as done in nimterop (nimterop/nimterop#66) to transparently compile (using halide) a c++/Halide DSL, save to a shared library and cache it; eg:

# this is in Nim

import halide_funs

cPluginHalide: """ // this is compiled on the fly and cached, using a hash of the string, eg saved to: /tmp/nimhalide/libplugin_12abf23.dylib
// this is C++ halide code
Func blur_3x3(Func input) { // example from https://en.wikipedia.org/wiki/Halide_(programming_language)
  Func blur_x, blur_y;
  Var x, y, xi, yi;

  // The algorithm - no storage or order
  blur_x(x, y) = (input(x-1, y) + input(x, y) + input(x+1, y))/3;
  blur_y(x, y) = (blur_x(x, y-1) + blur_x(x, y) + blur_x(x, y+1))/3;

  // The schedule - defines order, locality; implies storage
  blur_y.tile(x, y, xi, yi, 256, 32)
        .vectorize(xi, 8).parallel(y);
  blur_x.compute_at(blur_y, x).vectorize(x, 8);

  return blur_y;
}"""

proc main() =
  let image = loadImage("img.jpg")
  blur_3x3(image)

[EDIT] proposal: approach 2

using https://github.com/fragcolor-xyz/nimline could be another possiblity, but would probalby need adaptation as halide compiler needs to be run on C++/halide program

related

@mratsim
Copy link
Owner

mratsim commented Jan 31, 2019

As discussed, this sounds like a very interesting idea and from a learning perspective I'd rather implement it from scratch, just like Arraymancer started as a way to learn NN from scratch instead of reusing Tensorflow or Torch.

Overview of Halide

Halide is a C++ domain specific language for computational image processing. It started with the idea of separating the algorithm (the what) from its efficient hardware implementation (the how). This is similar to a (functional programming) source file vs machine code when passed through an optimizing compiler.

The "how" can be hinted to assist the compiler in its threading and vectorization by specifying cache blocking parameters, unrolling factor and threading hints.

Furthermore Halide supports multiple backends (x86, ARM, Vulkan, DX12, OpenGL, OpenCL, Cuda ...) including a JIT backend that will dispatch with the latest instruction set supported (for example AVX512).

The pipeline is: C++ DSL --> Halide Intermediate Representation (Computational Graph) --> LLVM (AOT or JIT) --> machine code.

## Advantages

In general

  • Short code, much faster to develop new efficient algorithms.
  • No need to fiddle with platform specific optimisations like SIMD.
  • Portability
  • Computation can be fused across functions and more parallelism can be exposed.
  • Much easier to test different parallel schedule

For ML and DL

The main advantages are the portability with SIMD optimisations opportunities and computation fusion to prevent intermediate allocation and results.

Parallelism in ML/DL is either straightforward (it's a single loop) or very complex and would need manual involvement:

  • For example Halide is memory-bound on matrix multiplication (0.5~1 TFLOP/s at most while I can reach 1.8TFLOP/s with OpenBLAS or MKLDNN) Perf reporting in Halide is confusing, I revamped it and it reaches MKL perf on single socket CPUs.
  • Halide is unsuited for trees ensembles like random forests

Some DL layers that would benefits from Halide-like functionality:

  • Convolution (apparently Halide has a schedule that reaches MKL-DNN performance)
  • Locally-Connected Layer/Unshared Convolution (Implement unshared convolution / Locally-Connected Layer #255). Though this might be faster with summed-area table/integral image and then a simple for loop.
  • Bilinear
  • Pooling layers (Max, Average, Region of Interest Pooling)

Disadvantages

  1. Halide is pretty much a static graph compiler like Tensorflow XLA which is not that great for research and debugging. It's a pain to use Tensorflow for NLP, RNNs and GANs for example.

    The main issue is the lack of control flow like for loop and if/else branches.

Note that OpenAI Dali (not to be confused with Nvidia DALI) apparently is a compiler that plugs in dynamic graphs framework.

  1. Another disadvantage is the LLVM dependency at runtime, but that only concerns JIT. LLVM is a huge library.

Halide papers

Beyond Halide

Halide has been expanded with autodiff capabilities in gradient-halide which may greatly enhance autograd which is a recurrent source of slowness in Arraymancer due to the need of a DAG which doesn't play well with reference counting. However while gradient-halide results are promising, TVM (a fork of Halide for DL) reported perf issues

Related projects

Auto-scheduling - Polyhedral tiling

Approaches to porting

Wrapping

I'm not really fan of wrapping unless we can do away with CMake and completely configure everything via Nim. This is the approach I took for the Arcade Learning Environment for reinforcement learning here with a Nim build script that basically replaces CMake and the DLL built can be called from another Nim: https://github.com/numforge/agent-smith/blob/master/third_party/ale_wrap.nim. This preserves C++ name mangling, puts everything in nimcache and can cleanly made into a nimble task.

The alternative is something like nimtorch which requires either a build script (conda in NimTorch case), potentially a docker or building the library apart and then copying it.

Once wrapped we need a way for Nim to generate valide Halide DSL, this means removing all the fluff Nim adds to a C/C++ file. Interestingly this is a very similar problem to generating Cuda kernel like what is done in Cudanim, this would also apply to OpenCL code generator idea from 2014, 2015, 2016

Reimplementing from scratch

As said in the beginning, from a learning perspective it's the most interesting though the time-to-market and the testing (+ fuzzing) would probably be years away.

Interestingly, it seems like Halide started as FImage and had it's own assembler. The IR looks already quite similar to what we have now after only 3 months: https://github.com/halide/Halide/tree/5a2f90b7bcf4d9298e4068698f52731401e07dc8.

There are 2 big parts of Halide from an engineering perspective:

  • The frontend, which converts the DSL to Halide IR.
  • A lowering step from Halide IR to LLVM IR.
  • The backend which uses LLVM to generate code for the different targets.

Skipping the front-end since that has to be implemented to discuss the backend.

Intermediate representation

For the IR there are several choices: Halide IR, Glow IR, TVM IR, SPIR-V or LLVM IR directly

LLVM backend

Assuming we want to support lots of framework, I don't see any alternative to LLVM for the backend both from a license, a platform and a maturity point of view which comes with it's own cost, notably compilation overhead and the space LLVM takes.

Regarding API, I'm not yet sure which one Halide uses but LLVM offers 2:

Written from scratch backends
Jit backend

Assuming we target a couple specific platforms that shares a lot of similarity in the programming model (for example X86, ARM, Cuda, ROCm), this is also possible. Furthermore as you said, using Nim offers a tremendous metaprogramming advantage as evidenced by the very clean way i could implement my own JIT assembler in Laser:

https://github.com/numforge/laser/blob/56df4643530ada0a01d29116feb626d93ac11379/laser/photon_jit/x86_64/x86_64_ops.nim

op_generator:
  op MOV: # MOV(dst, src) load/copy src into destination
    ## Copy 64-bit register content to another register
    [dst64, src64]: [rex(w=1), 0x89, modrm(Direct, reg = src64, rm = dst64)]
    ## Copy 32-bit register content to another register
    [dst32, src32]: [          0x89, modrm(Direct, reg = src32, rm = dst32)]
    ## Copy 16-bit register content to another register
    [dst16, src16]: [    0x66, 0x89, modrm(Direct, reg = src16, rm = dst16)]

    ## Copy  8-bit register content to another register
    [dst8,  src8]:  [          0x88, modrm(Direct, reg = src8, rm = dst8)]

    ## Copy 64-bit immediate value into register
    [dst64, imm64]: [rex(w=1), 0xB8 + dst64] & imm64
    ## Copy 32-bit immediate value into register
    [dst64, imm32]: [          0xB8 + dst64] & imm32
    ## Copy 16-bit immediate value into register
    [dst64, imm16]: [    0x66, 0xB8 + dst64] & imm16

    ## Copy 32-bit immediate value into register
    [dst32, imm32]: [          0xB8 + dst32] & imm32
    ## Copy 16-bit immediate value into register
    [dst32, imm16]: [    0x66, 0xB8 + dst32] & imm16

    ## Copy 16-bit immediate value into register
    [dst16, imm16]: [    0x66, 0xB8 + dst16] & imm16
    ## Copy  8-bit immediate value into register
    [dst8,  imm8]:  [          0xB0 + dst8, imm8]

  op LEA:
    ## Load effective address of the target label into a register
    [dst64, label]: [rex(w=1), 0x8D, modrm(Direct, reg = dst64, rm = rbp)]

  op CMP:
    ## Compare 32-bit immediate with 32-bit int at memory location stored in adr register
    [adr, imm64]: [ rex(w=1), 0x81, modrm(Indirect, opcode_ext = 7, rm = adr[0])] & imm64
    ## Compare 32-bit immediate with 32-bit int at memory location stored in adr register
    [adr, imm32]: [           0x81, modrm(Indirect, opcode_ext = 7, rm = adr[0])] & imm32
    ## Compare 16-bit immediate with 16-bit int at memory location stored in adr register
    [adr, imm16]: [     0x66, 0x81, modrm(Indirect, opcode_ext = 7, rm = adr[0])] & imm16
    ## Compare 8-bit immediate with byte at memory location stored in adr register
    [adr, imm8]:  [           0x80, modrm(Indirect, opcode_ext = 7, rm = adr[0]), imm8]

  op JZ:
    ## Jump to label if zero flag is set
    [label]: [0x0F, 0x84]
  op JNZ:
    ## Jump to label if zero flag is not set
    [label]: [0x0F, 0x85]

JIT is the approach taken for x86-only libs:

There are significant challenges though:

  • Making that robust (need fuzzing)
  • Caching (Intel and AMD are still not caching their JIT-ted OpenCL kernels for reference)
  • Thread-safe
  • Register allocations
  • Optimizing
AOT backend

We can also do ahead of time compilation with runtime dispatch depending on the target. This is the approach I take with Laser GEMM with a version for SSE, SSE2, SSE4.1, AVX, AVX+FMA, AVX2, AVX512, ... and we can add Cuda, OpenCL, etc.

@timotheecour
Copy link
Contributor Author

timotheecour commented Feb 1, 2019

note that we can generate LLVM IR from Nim (or a DSL if we have a macro DSL=>nim); see this demo https://github.com/timotheecour/vitanim/blob/master/testcases/tests/t0127.nim shows how we can generate LLVM IR from a DSL

  • the DSL could be either pure nim:
DSLtoLLVMIR:
  proc entryPoint() {.exportc.} =
    for i in 0..<10:
      echo i*i
  • or something custom for image processing, as you were referring to, eg
DSLtoLLVMIR: # front-end that converts DSL to nim is needed here
  forEach x in a, y in b, z in c:
    z = x + y

@mratsim
Copy link
Owner

mratsim commented Feb 1, 2019

I find the need to go through a text file quite clunky and slow in the long term.

Since LLVM has a bit code representation of its IR I was thinking of targeting that directly using bindings akin to LLVM-D.

This is also what Halide is doing..

From an interface point of view, I probably can re-use the JIT Assembler for x86-64 I created here except that instead of assembling x86 I'm assembling LLVM IR. (Usage).

It can be restricted to the subset valid for Nvidia targets.

Alternatively, it seems like we may be able to directly JIT an AST representation without having to go through LLVM IR phase.

@mratsim
Copy link
Owner

mratsim commented Feb 6, 2019

Some advances in my research (will find a place to store that in the future, probably in Laser)

Code generation

  • Kai Nacke, maintainer of LDC (LLVM D Compiler), did a very nice talk about what he learned by implementing his own compiler from scratch at FOSDEM 2019. Slides.

Alternative HPC compiler

Semantic pass and the expression problem

Halide and PyTorch Glow are using the visitor pattern which looks to me like a workaround for a not expressive enough language.

We need to be able to extend functions without touching the core ones and if possible data types as well.

For example we could have the functions Add and Mul defined on Tensor and CudaTensor today and a user library implement Div and also extend Add and Mul on VulkanTensor.

Data types are probably not that critical, i.e. we could use ADTs + Generics for Tensor[float32], VulkanTensor[float16], ...

In research, there are several approaches to handle that:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants