-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GEMM benchmark with CLBlast comparison #9
Comments
Futhark does not fare particularly well against ClBlast (at least I'd not expect that it does). This is based on only a little data: for the PPoPP'19 experiments we implemented matrix multiplication in Futhark and with cuBLAS. As I recall, cuBLAS was at least two times faster in its comfort zone. We pretty much know why this is: Futhark only does block tiling in local memory, while cuBLAS (probably) also does register tiling. We verified this by manually applying register tiling to Futhark's generated code, which brought performance fairly close to cuBLAS. We roughly know how to implement register tiling, but it's a bunch of work, so we haven't gotten around to it yet. It's unlikely (and not really intended) that Futhark will ever outperform expertly hand-tuned code on a primitive-for-primitive level. |
You can beat hand-tuned or at least reach 90% of the speed with a high-level library:
Even for CPU, I personally managed to reach BLAS speed without any assembly on CPU matrix multiplication:
Obviously that does not compose well and also requires the same effort to reproduce in OpenCL or Cuda, so I'm looking into alternative representations that abstracts loops hence my inquiries into Futhark. |
Sure, but all of those still involve machine-specific hand-tuning/scheduling (or search procedures). We don't have any current plans for Futhark to directly support that level of manual optimisation, but maybe if we can figure out a way that does not complicate the language as a whole... The Halide/Tiramisu approaches are definitely inspiring. Currently, our main performance focus is on optimising composition and ensuring good application-level performance. Eventually it'd be fun to also directly compete at the level of individual primitives, but alas, time is finite! |
Hello team,
I read with interests the 2 papers you published on Futhark and the DNN implementation.
I'm currently researching autotuners for deep learning and futhark stands out with the introduction of Second-Order Array Combinators like scanomap and redomap.
I'd like to understand how Futhark fares versus state-of-the-art matrix multiplication which should be able to reach 90%+ of hardware utilisation and stress compilers, autotuners and handwritten kernels with need of cache-aware (or cache-oblivious) algorithms, tiling, register pressure.
However I cannot find a benchmark versus a reference library like ClBlast
The text was updated successfully, but these errors were encountered: