-
-
Notifications
You must be signed in to change notification settings - Fork 95
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Micro Benchmark #425
Comments
The latest one are at:
And it seems like I updated the benchmark result in the Python file but forgot to do the same in Julia/Nim. Here are the results:
Note that those benchmarks are on integer matrix multiplication. On floating point Julia, Nim, Python have mostly the same speed as they defer to BLAS which is pure Assembly (though I have a pure Nim floating point matrix multiplication that is as fast as OpenBLAS at https://github.com/numforge/laser/tree/master/laser/primitives/matrix_multiplication) Regarding benchmarks, 30 min is not needed. At the minimum you need to either disable turbo and frequency scaling to ensure your CPU is at max perf, or otherwise run a warmup workload for long enough so that the CPU switches to high performance before starting the benchmark. Then run benchmarks multiple times to have some ideas or mean + standard deviation. I'm not reporting them but it's small and the scripts are available for everyone to dive in and investigate. For those benchmarks, the difference is so big that benchmark accuracy is an acceptable epsilon when you want to compare implementations. Now if we really want to be accurate we would need to kill all processes, also run a workload with a known number of cycles, measure how many cycles it takes with RDTSC, abort if there is too much deviation but that's quite a lot of work. |
Interesting, can you actually tell when that is the case, or do you simply know how long it takes on your system (if it is deterministically). |
Interesting results. You should add to main arraymancer website. There is no pipeline like this description. |
A CPU at 3GHz can process 3 basic instructions (1 cycle/instr) per nanoseconds. In 1ms is million times that and is empirically long enough for a CPU to ramp up to max performance.
I've updated the under the hood speed article at https://mratsim.github.io/Arraymancer/uth.speed.html Ultimately, unless people are working with int values larger than 2 billions which cannot be exactly represented by float64 it is always faster to convert to float64 and Arraymancer, Julia, Python would have the same speed in that case (using the underlying BLAS). Regarding OpenGL/Vulkan/DirectX, unfortunately they don't provide the proper primitives to do deep learning. You can hack to use textures to store your tensor but it's quite impractical to develop on them including Vulkan Compute. For now the best way forward for that would be to have a backend that produces LLVM IR in a JIT manner similar to Halide, and have LLVM produce Cuda/OpenGL/DirectX/Vulkan directly see discussion. This is something that I started working on in the Lux compiler as a Laser subproject: https://github.com/numforge/laser/tree/master/laser/lux_compiler, though i don't have time to dedicate to it at the moment. Your last comment requires using async with Arraymancer with scheduled retraining and model swapping or alternatively use a deep learning model that supports "online_learning" like the reinforcement learning models. |
I found micro benchmarks at https://mratsim.github.io/Arraymancer/uth.speed.html but look old. To be objective can you repeat for a long time(30 min for each)?
The text was updated successfully, but these errors were encountered: