-
-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
2023-12-31 - Longstanding missing features #616
Comments
Tensor backend - 1. Mutable operations on slices References:
Currently slicing returns immutable tensors and you need an extra var assignment to mutate the original tensor which is quite burdensome. i don't remember what were the errors, iirc tests gave wrong result and it was painful to debug. |
Tensor backend - 2. Nested parallelism We currently cannot benefit from nested parallelism in Arraymancer, a parallel function calling a parallel function will cause oversubscription due to OpenMP limitations, for example the Frobenius inner product Arraymancer/src/arraymancer/tensor/lapack.nim Lines 19 to 22 in f809e7f
Slowness of reduction operations like sigmnoid or softmax the more cores you have
Solution is switching to Weave (or Constantine's threadpool) as a backend. Related: |
Tensor backend - 3. Doc generation Doc generation is a pain point, see wishlist #488 (comment) and haxscramper/haxdoc#1 though having those auto-generated in CI is awesome: #556 |
Tensor backend - 4. Versioning / releases Reserved |
Tensor backend - 5. Slow transcendental functions (exp, log, sin, cos, tan, tanh, ...) Background: #265 (comment)
There is a tradeoff there between accuracy which is very needed for some applications, e.g. weather forecast where errors compound exponentially and others typically deep learning where quantizing on 4-bit is actually fine. See 8-way accuracy/speed comparison for AVX2 impl of exponentiation: https://github.com/mratsim/laser/blob/master/benchmarks/vector_math/bench_exp_avx2.nim#L282-L470 Either this and nested parallelism or both are a significant bottleneck for the Shakespeare char-rnn demo. |
Tensor backend - 6, 7, 8 OS-specific woes On Windows, BLAS and Lapack are a pain to deploy. Even in CI I did not figure out how to install Lapack Arraymancer/.github/workflows/ci.yml Lines 26 to 30 in f809e7f
See similar woes here: https://forum.nim-lang.org/t/10812 On MacOS, Apple Clang doesn't support OpenMP and requires installing LLVM via Homebrew. The solution to both would be to implement a pure Nim BLAS backed by a pure Nim multithreading runtime like Weave. Lastly, on MacOS, with has unified memory it's costless to use Tensor cores for matrix multiplication. |
The need for untyped tensors When I started Arraymancer, I first was really excited about the possibility of encoding all size information in the type system to have a robust library with everything a compile-time error. I quickly backpedaled due to ergonomic issues, see this May 2017 discussion here (time flies) andreaferretti/linear-algebra#5 (comment)
And there are other problems that surfaced later which confirmed that not putting dimensions in the type system was better ergonomic-wise, for example for the But actually putting the type is also problematic ergonomically:
The scientific computing world doesn't work with static types in serialization, when deserializing csv, json, ONNX or tfrecords, you are expected to read the format from the file. See also https://forum.nim-lang.org/t/10223#67808 Now, it's probable that libraries depend on Arraymancer support for static So a way forward to allow ergonomic serialization/deserialization (#163, #417) from common scientific format would be starting a new library, with a restricted set of types in mind (all types used in Numpy / PyTorch). The internal Arraymancer typed primitives can be reused however. |
Reserved |
Neural network backend - 1. Nvidia Cuda Cuda support has been broken since Nim v1.2: The culprit is: nim-lang/Nim#16936 There are 3 ways to fix this:
|
Neural Network backend - 2. Implementation woes: CPU forward, CPU backward, GPU forward, GPU backward, all optimized Implementing Neural Network layers is extremely time-consuming, you need:
This require several expertise in a team: linear-algebra and high-performance computing. And both require significant investment of time to achieve. The only way to compete is by creating a compiler, only the forward pass would need to be implemented and the rest is automatically derived. This led to start experimentations on a deep learning compiler called Lux, in Laser:
Then I identified that we need a multithreading runtime to power this compiler, which led to https://github.com/mratsim/weave/ Then I got pulled by other projects. Fortunately, there is another project with the same idea: https://github.com/can-lehmann/exprgrad that went as far as generating OpenCL code. |
Neural network backend - 3. Ergonomic serialization and deserialization of models This is a very important problem at the moment: References: The issue is that Model / Layers are statically typed which means you need end-user to properly type everything. The solution is to type-erase all layers and model, via inherited ref objects for example. |
Neural network backend - 4. Slowness of reduction operations like sigmnoid or softmax the more cores you have Another key problem which causes scaling problems. The issue is that reduction operations are always used in deep learning since any loss function is a reduction. See
Weave and Constantine's threadpool take into account those slowness, see benches:
In summary, we need to allow nested parallelism and use loop tiling / loop blocking as well |
Reserved |
Summary In summary, here is how I see how to make more progress on Arraymancer, tensor libraries, deep learning in Nim.
|
@mratsim Appreciate all the work you've done for the community man, not just for this library. Some time in Q2 I'll be able to return to a hobby project and will look into improving Arraymancer. I wonder if we can trim the need for various backends by leveraging WebGPU. It has first class support for compute shaders, it's nearly as fast as vulkan, portable to EVERY device imaginable, and can be ran natively (it isn't just for browser). I think that reduces a lot of work load. For the CPU backend 100% we'll want to use something like weave and a pure Nim BLAS implementation ( how hard can that be? :) ) |
For the tensor part of the library I also think that there are a few important features that are still missing. For example, we don’t support most of the tensor manipulation features (e.g. add, insert, roll…) and also some important signal processing algorithms (FFT, filtering and convolution). |
Is it possible to use FFTW as a backend for the fft implementation? I want to contribute to this project but I am not sure if the DFT algorithm needs to be written from the scratch or the existing library can be used. |
FFTW cannot be distributed with Arraymancer due its GPL license, but it's available as a separate package: https://github.com/SciNim/nimfftw3 What can be distributed with Arraymancer is https://github.com/scinim/impulse
@AngelEzquerra, see https://github.com/scinim/impulse for a low-level implementation. |
I didn't know about impulse. I can have a look, but what about @arnetheduck's nim-fffr (https://github.com/arnetheduck/nim-fftr)? It's MIT licensed and according to his benchmarks it's pretty fast. I've played a bit with it (and even made a couple of PR's to it) and it is pretty easy to use... |
The gap is large at the moment. And for large FFTs you really need multithreading. fftr
fftw
|
for prime powers, fftr uses bluestein which is quite inefficient, ie it's an algorithm implementation away to get it on par with fftw (if anything, https://github.com/ejmahler/RustFFT is one of the fastest ones, beating fftw too) - re multithreading, that's something you'd set up outside of the core fft algorithm I suspect |
For load-balancing, recursive divide-and-conquer algorithm are best because they ensure that there are plenty of tasks so no CPU is starved. This is very easy with Cooley-Tukey as function calls: https://github.com/mratsim/constantine/blob/bc5faaaef8e270b6e4913a704e1132f96bfe7349/research/kzg/fft_fr.nim#L91-L150 i.e. you replace all func simpleFT[F](
output: var View[F],
vals: View[F],
rootsOfUnity: View[F]
) =
# FFT is a recursive algorithm
# This is the base-case using a O(n²) algorithm
let L = output.len
var last {.noInit.}, v {.noInit.}: F
for i in 0 ..< L:
last.prod(vals[0], rootsOfUnity[0])
for j in 1 ..< L:
v.prod(vals[j], rootsOfUnity[(i*j) mod L])
last += v
output[i] = last
func fft_internal[F](
output: var View[F],
vals: View[F],
rootsOfUnity: View[F]
) =
if output.len <= 4:
simpleFT(output, vals, rootsOfUnity)
return
# Recursive Divide-and-Conquer
let (evenVals, oddVals) = vals.splitAlternate()
var (outLeft, outRight) = output.splitMiddle()
let halfROI = rootsOfUnity.skipHalf()
fft_internal(outLeft, evenVals, halfROI)
fft_internal(outRight, oddVals, halfROI)
let half = outLeft.len
var y_times_root{.noinit.}: F
for i in 0 ..< half:
# FFT Butterfly
y_times_root .prod(output[i+half], rootsOfUnity[i])
output[i+half] .diff(output[i], y_times_root)
output[i] += y_times_root
func fft*[F](
desc: FFTDescriptor[F],
output: var openarray[F],
vals: openarray[F]): FFT_Status =
if vals.len > desc.maxWidth:
return FFTS_TooManyValues
if not vals.len.uint64.isPowerOf2_vartime():
return FFTS_SizeNotPowerOfTwo
let rootz = desc.expandedRootsOfUnity
.toView()
.slice(0, desc.maxWidth-1, desc.maxWidth div vals.len)
var voutput = output.toView()
fft_internal(voutput, vals.toView(), rootz)
return FFTS_Success If using the for loop approach like on Wikipedia second pseudocode
We can do parallel-for loops. But an interface/concepts for either spawning tasks or parallelizing loop is needed, and then needs to be use for effective parallelization of the FFT. |
Do you have similar measurements for impulse? |
Not at the moment. It would be interesting to make some. |
@mratsim, I tried to install impulse via nimble but it failed. Assuming we used it to implement the FFT in arraymancer, would you expect to embed it in arraymancer or to install it as a dependency? If the latter, I assume it would require impulse to be a proper nimble package, right? |
Arraymancer has become a key piece of Nim ecosystem. Unfortunately I do not have the time to develop it further for several reasons:
Furthermore, since then Nim v2 introduced new interesting features like builtin memory management that works with multithreading or
views
that are quire relevant to Arraymancer.Let's go over the longstanding missing features to improve Arraymancer, we'll go over the tensor library and over the neural network library.
Tensor backend (~NumPy, ~SciPy)
Also: the need for untyped Tensors.
Neural network backend (~PyTorch)
The text was updated successfully, but these errors were encountered: