Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Micro Benchmark #425

Closed
develooper1994 opened this issue Mar 29, 2020 · 4 comments
Closed

Micro Benchmark #425

develooper1994 opened this issue Mar 29, 2020 · 4 comments

Comments

@develooper1994
Copy link

I found micro benchmarks at https://mratsim.github.io/Arraymancer/uth.speed.html but look old. To be objective can you repeat for a long time(30 min for each)?

@mratsim
Copy link
Owner

mratsim commented Apr 2, 2020

The latest one are at:

And it seems like I updated the benchmark result in the Python file but forgot to do the same in Julia/Nim. Here are the results:

#########
# Results on i9-9980XE
# Skylake-X overclocked 4.1GHz all-core turbo,
# AVX2 4.0GHz all-coreAVX-512 3.5GHz all-core
# Input 1500x1500 random large int64 matrix

# Nim 1.0.4. Compilation option: "-d:danger -d:openmp"
# Julia v1.3.1
# Python 3.8.1 + Numpy-MKL 1.18.0

# Nim: 0.14s, 22.7Mb
# Julia: 1.67s, 246.5Mb
# Python Numpy: 5.69s, 75.9Mb

Note that those benchmarks are on integer matrix multiplication. On floating point Julia, Nim, Python have mostly the same speed as they defer to BLAS which is pure Assembly (though I have a pure Nim floating point matrix multiplication that is as fast as OpenBLAS at https://github.com/numforge/laser/tree/master/laser/primitives/matrix_multiplication)

Regarding benchmarks, 30 min is not needed.

At the minimum you need to either disable turbo and frequency scaling to ensure your CPU is at max perf, or otherwise run a warmup workload for long enough so that the CPU switches to high performance before starting the benchmark.
I'm doing the second one.

Then run benchmarks multiple times to have some ideas or mean + standard deviation. I'm not reporting them but it's small and the scripts are available for everyone to dive in and investigate.

For those benchmarks, the difference is so big that benchmark accuracy is an acceptable epsilon when you want to compare implementations.

Now if we really want to be accurate we would need to kill all processes, also run a workload with a known number of cycles, measure how many cycles it takes with RDTSC, abort if there is too much deviation but that's quite a lot of work.

@bluenote10
Copy link
Contributor

[...] or otherwise run a warmup workload for long enough so that the CPU switches to high performance before starting the benchmark. I'm doing the second one.

Interesting, can you actually tell when that is the case, or do you simply know how long it takes on your system (if it is deterministically).

@develooper1994
Copy link
Author

Interesting results. You should add to main arraymancer website.
Do you need support? I am searching for a portfolio project. I am still new to GPU. I think nim needs good GPU support.
There is also a project in my mind. Need a opengl, vulkan or directx deep learning runtime for high framerate application. All process should done in device(GPU) without communication device. Device communucation have a huge impact on performance.
Also there is lots of infrastructure libraries but learning should continue while application running.

There is no pipeline like this description.

@mratsim
Copy link
Owner

mratsim commented Apr 19, 2020

[...] or otherwise run a warmup workload for long enough so that the CPU switches to high performance before starting the benchmark. I'm doing the second one.

Interesting, can you actually tell when that is the case, or do you simply know how long it takes on your system (if it is deterministically).

A CPU at 3GHz can process 3 basic instructions (1 cycle/instr) per nanoseconds. In 1ms is million times that and is empirically long enough for a CPU to ramp up to max performance.

Interesting results. You should add to main arraymancer website.
Do you need support? I am searching for a portfolio project. I am still new to GPU. I think nim needs good GPU support.
There is also a project in my mind. Need a opengl, vulkan or directx deep learning runtime for high framerate application. All process should done in device(GPU) without communication device. Device communucation have a huge impact on performance.
Also there is lots of infrastructure libraries but learning should continue while application running.

There is no pipeline like this description.

I've updated the under the hood speed article at https://mratsim.github.io/Arraymancer/uth.speed.html

Ultimately, unless people are working with int values larger than 2 billions which cannot be exactly represented by float64 it is always faster to convert to float64 and Arraymancer, Julia, Python would have the same speed in that case (using the underlying BLAS).

Regarding OpenGL/Vulkan/DirectX, unfortunately they don't provide the proper primitives to do deep learning. You can hack to use textures to store your tensor but it's quite impractical to develop on them including Vulkan Compute.

For now the best way forward for that would be to have a backend that produces LLVM IR in a JIT manner similar to Halide, and have LLVM produce Cuda/OpenGL/DirectX/Vulkan directly see discussion. This is something that I started working on in the Lux compiler as a Laser subproject: https://github.com/numforge/laser/tree/master/laser/lux_compiler, though i don't have time to dedicate to it at the moment.

Your last comment requires using async with Arraymancer with scheduled retraining and model swapping or alternatively use a deep learning model that supports "online_learning" like the reinforcement learning models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants