Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using AcuteBenchmark system #5

Open
aminya opened this issue Jan 1, 2020 · 7 comments
Open

Using AcuteBenchmark system #5

aminya opened this issue Jan 1, 2020 · 7 comments

Comments

@aminya
Copy link

aminya commented Jan 1, 2020

I created a package for benchmarking functions called AcuteBenchmark.
https://github.com/aminya/AcuteBenchmark.jl

I particularly made it for IntelVectorMath (VML)

If you want we can switch the benchmarking system to AcuteBenchmark. It is very easy to use. It automatically generates random vectors based on the limits and the size given and then benchmarks and plots the result.

IntelVectorMath Performance Comparison

Other than the Acutebenchmark doc, there is a fully working example available here: https://github.com/JuliaMath/VML.jl/blob/AcuteBenchmark/benchmark/benchmark.jl

@chriselrod
Copy link
Member

It could be nice to clean code up, and set up automated performance testing so I don't accidentally cause regressions.

Any chance you can support a way of defining GFLOPS as a function of size?

I'd like to be able to present performance in terms of GFLOPS (billion floating point operations per second).
Primarilly, that has the advantage of making comparisons across sizes more clear.

It can also give you a rough idea of how well the CPU is being utilized. For example, if a CPU with avx512 runs at 3.6GHz while performing avx512 operations (configurable in the bios):

julia> GHz = 3.6;

julia> fma_per_clock = 2;

julia> dflop_per_fma = 16;

julia> GHz * fma_per_clock * dflop_per_fma
115.2

which tells you that anything short of 115.2 is under-utilizing the CPU, and the question becomes framed in terms of just how short the code is falling?
Aside from (IMO) being more informative across sizes, it can also be more informative across functions.

@aminya
Copy link
Author

aminya commented Jan 5, 2020

@chriselrod There is this https://github.com/triscale-innov/GFlops.jl package. Does this do what you want? If so I can integrate it, into AcuteBenchmark

If not, do you have an example code of what you want?

@chriselrod
Copy link
Member

chriselrod commented Jan 6, 2020

Here is an example, although that function should be called gflop_gemm.
I defined gflops manually for each, so doing what I wanted would just require either passing an anonymous function that calculates GFLOPS as a function of size, or a vector of numbers you can divide by nanoseconds to get GFLOPS.

That said, your idea of using GFLops.jl would be much cooler and more convenient.
It (currently) won't work with @avx, and definitely not with the C or Fortran code. But it should really only be run for one thing per size anyway -- the number of floating point operations shouldn't be changing.
Perhaps the routine could be: (1) use GFLOPS.jl to count FLOPS of the first function, and then benchmark all functions?

@aminya
Copy link
Author

aminya commented Apr 17, 2020

GFlops.jl returns an incorrect result for the broadcasted functions.

What should we do? Can we calculate the gflops for a scalar call and then multiply it by the dimension?

I did it in https://github.com/aminya/AcuteBenchmark.jl/blob/a6b5c0a4591513d06af764bd38fd1558c946f797/src/benchmarks.jl#L197 for example

@ffevotte
Copy link

Perhaps the routine could be: (1) use GFLOPS.jl to count FLOPS of the first function, and then benchmark all functions?

That's actually what @gflops does: it first counts all ops, then benchmarks using BenchmarkTools. But you're right, in your case it would make much more sense to count ops for one function (and one data size), and reuse this count for all subsequent benchmarks.

I'm not sure what to do about vector operations. Would you like me to add explicit support for counting SIMD.jl and/or SIMDPirates.jl vector operations?

Until now, I did not do anything, mostly for lack of use cases that would guide the design of the API. But I'm open to any suggestion if you know what kind of features you'd like to see in GFlops' API.

@ffevotte
Copy link

GFlops.jl returns an incorrect result for the broadcasted functions.

What should we do? Can we calculate the gflops for a scalar call and then multiply it by the dimension?

Hopefully this should be fixed as soon as Cassette releases a new version (see JuliaLabs/Cassette.jl#171).

In the meantime, you could either use Julia 1.3 or locally Pkg.dev the current master of Cassette. In both cases, this would allow you to develop new features in AcuteBenchmark without hitting this issue. (But you won't be able to release the feature before Cassette releases a new patch, and GFlops bumps its compatibility requirements)

@chriselrod
Copy link
Member

That's actually what @gflops does: it first counts all ops, then benchmarks using BenchmarkTools. But you're right, in your case it would make much more sense to count ops for one function (and one data size), and reuse this count for all subsequent benchmarks.

Okay, that sounds great.
This would also be required for the C and Fortran benchmarks.

I'm not sure what to do about vector operations. Would you like me to add explicit support for counting SIMD.jl and/or SIMDPirates.jl vector operations?

I would personally find SIMDPirates and LoopVectorization support very convenient!
How do you handle special functions? I'm not sure to what the standard practice is with respect to them / what meaning "floating point operations" have.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants