Micro Benchmark #425

develooper1994 · 2020-03-29T09:44:35Z

I found micro benchmarks at https://mratsim.github.io/Arraymancer/uth.speed.html but look old. To be objective can you repeat for a long time(30 min for each)?

mratsim · 2020-04-02T12:01:11Z

The latest one are at:

And it seems like I updated the benchmark result in the Python file but forgot to do the same in Julia/Nim. Here are the results:

#########
# Results on i9-9980XE
# Skylake-X overclocked 4.1GHz all-core turbo,
# AVX2 4.0GHz all-coreAVX-512 3.5GHz all-core
# Input 1500x1500 random large int64 matrix

# Nim 1.0.4. Compilation option: "-d:danger -d:openmp"
# Julia v1.3.1
# Python 3.8.1 + Numpy-MKL 1.18.0

# Nim: 0.14s, 22.7Mb
# Julia: 1.67s, 246.5Mb
# Python Numpy: 5.69s, 75.9Mb

Note that those benchmarks are on integer matrix multiplication. On floating point Julia, Nim, Python have mostly the same speed as they defer to BLAS which is pure Assembly (though I have a pure Nim floating point matrix multiplication that is as fast as OpenBLAS at https://github.com/numforge/laser/tree/master/laser/primitives/matrix_multiplication)

Regarding benchmarks, 30 min is not needed.

At the minimum you need to either disable turbo and frequency scaling to ensure your CPU is at max perf, or otherwise run a warmup workload for long enough so that the CPU switches to high performance before starting the benchmark.
I'm doing the second one.

Then run benchmarks multiple times to have some ideas or mean + standard deviation. I'm not reporting them but it's small and the scripts are available for everyone to dive in and investigate.

For those benchmarks, the difference is so big that benchmark accuracy is an acceptable epsilon when you want to compare implementations.

Now if we really want to be accurate we would need to kill all processes, also run a workload with a known number of cycles, measure how many cycles it takes with RDTSC, abort if there is too much deviation but that's quite a lot of work.

bluenote10 · 2020-04-02T12:06:35Z

[...] or otherwise run a warmup workload for long enough so that the CPU switches to high performance before starting the benchmark. I'm doing the second one.

Interesting, can you actually tell when that is the case, or do you simply know how long it takes on your system (if it is deterministically).

develooper1994 · 2020-04-02T13:43:09Z

Interesting results. You should add to main arraymancer website.
Do you need support? I am searching for a portfolio project. I am still new to GPU. I think nim needs good GPU support.
There is also a project in my mind. Need a opengl, vulkan or directx deep learning runtime for high framerate application. All process should done in device(GPU) without communication device. Device communucation have a huge impact on performance.
Also there is lots of infrastructure libraries but learning should continue while application running.

There is no pipeline like this description.

mratsim · 2020-04-19T14:28:13Z

[...] or otherwise run a warmup workload for long enough so that the CPU switches to high performance before starting the benchmark. I'm doing the second one.

Interesting, can you actually tell when that is the case, or do you simply know how long it takes on your system (if it is deterministically).

A CPU at 3GHz can process 3 basic instructions (1 cycle/instr) per nanoseconds. In 1ms is million times that and is empirically long enough for a CPU to ramp up to max performance.

Interesting results. You should add to main arraymancer website.
Do you need support? I am searching for a portfolio project. I am still new to GPU. I think nim needs good GPU support.
There is also a project in my mind. Need a opengl, vulkan or directx deep learning runtime for high framerate application. All process should done in device(GPU) without communication device. Device communucation have a huge impact on performance.
Also there is lots of infrastructure libraries but learning should continue while application running.

There is no pipeline like this description.

I've updated the under the hood speed article at https://mratsim.github.io/Arraymancer/uth.speed.html

Ultimately, unless people are working with int values larger than 2 billions which cannot be exactly represented by float64 it is always faster to convert to float64 and Arraymancer, Julia, Python would have the same speed in that case (using the underlying BLAS).

Regarding OpenGL/Vulkan/DirectX, unfortunately they don't provide the proper primitives to do deep learning. You can hack to use textures to store your tensor but it's quite impractical to develop on them including Vulkan Compute.

For now the best way forward for that would be to have a backend that produces LLVM IR in a JIT manner similar to Halide, and have LLVM produce Cuda/OpenGL/DirectX/Vulkan directly see discussion. This is something that I started working on in the Lux compiler as a Laser subproject: https://github.com/numforge/laser/tree/master/laser/lux_compiler, though i don't have time to dedicate to it at the moment.

Your last comment requires using async with Arraymancer with scheduled retraining and model swapping or alternatively use a deep learning model that supports "online_learning" like the reinforcement learning models.

mratsim closed this as completed in bd05448 Apr 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Micro Benchmark #425

Micro Benchmark #425

develooper1994 commented Mar 29, 2020

mratsim commented Apr 2, 2020

bluenote10 commented Apr 2, 2020

develooper1994 commented Apr 2, 2020

mratsim commented Apr 19, 2020 •

edited

Loading

Micro Benchmark #425

Micro Benchmark #425

Comments

develooper1994 commented Mar 29, 2020

mratsim commented Apr 2, 2020

bluenote10 commented Apr 2, 2020

develooper1994 commented Apr 2, 2020

mratsim commented Apr 19, 2020 • edited Loading

mratsim commented Apr 19, 2020 •

edited

Loading