ggml : get rid of BLAS and all it's variants #293

ggerganov · 2023-06-25T09:16:12Z

This is a big one

The only reason we use BLAS is that we don't have efficient implementation of matrix x matrix multiplication. Naively doing parallel dot products is not optimal. We need to implement some of the fundamental GEMM optimizations such as block tiling and we need to implement this in a compact way that reuses the existing dot product code and supports all quantization types

More comments on this:

The text was updated successfully, but these errors were encountered:

philipturner · 2023-06-25T21:49:56Z

supports all quantization types

The best place for dequantization would be after you've loaded data from device -> threadgroup memory, and are loading into registers. I hypothesize it will faster to have each of the four simds in a threadgroup unpack on their own, even if that duplicates the work two times.

For example, you might fork my MFA repository, then modify the line here to unpack after loading.

https://github.com/philipturner/metal-flash-attention/blob/fbfd0a028402c0ae6fa293c2f91de318b95e359b/Sources/GEMM.metal#L129

Dampfinchen · 2023-06-26T09:48:36Z

While I do understand the desire for the project to be independend of other libraries, I personally do not think removing the excellent cuBLAS implementation entirely is a good idea. One of the main benefits of this project is that you can run large models with good speeds on lower end hardware with less VRAM. For example, thanks to the cuBLAS' fast prompt processing of 15 ms/t I can enjoy a 13B parameter model with full context at around 1.6 token/sec, which is much faster than running GPTQ with cpu offloading (0,4 token/s) on my RTX 2060 laptop.

As you've said yourself, a native GEMM implementation would likely be slower than what cuBLAS is offering. Even if the performance difference is not drastic on your hardware, it can make all the difference on hardware like mine, even a slight performance difference (for example 15 ms/t to 23 ms/t) would lead to a worse experience for people who run huge LLMs too big for their systems. Not just for low to mid spec hardware, but for high end hardware trying to run 65B models as well.

Please keep this perspective in mind as you continue to develop the project. I don't think it would be a good outcome for me and many others to have to downgrade to older versions with cuBLAS and not enjoy new enhancements just because older versions run faster due to cuBLAS support.

philipturner · 2023-06-26T12:01:56Z

Why would custom code be slower than cuBLAS? There is Nvidia CUTLASS. The only things it can be are equal or faster.

Dampfinchen · 2023-06-26T12:33:57Z

Why would custom code be slower than cuBLAS? There is Nvidia CUTLASS. The only things it can be are equal or faster.

They aren't going to use CUTLASS though, because it'd be another third party lib.

CuBLAS is highly optimized for the hardware and Georgi said himself that he is aware a custom code is not going to perform as well in this comment:

ggerganov/llama.cpp#1867 (comment)

And I fully understand that it will be close to impossible to achieve the maximum performance available from dedicated libraries (such as cuBLAS, for example).

philipturner · 2023-06-26T12:36:55Z

300 lines of code, (soon to) outperform all of Apple's proprietary Metal Performance Shaders.

https://github.com/philipturner/metal-flash-attention/blob/main/Sources/GEMM.metal

If we take some code from CUTLASS, maybe just optimize for the matrix shapes that exist in LLaMA.

slaren · 2023-06-26T12:40:12Z

A custom GEMM implementation will be faster with quantized models - that's one of the goals. There may be a small performance regression with f16 and f32 models, though.

philipturner · 2023-06-26T12:41:18Z

Could we utilize the int4 hardware support in Ampere tensor cores?

slaren · 2023-06-26T12:45:17Z

From what I have seen of the way int4 works with tensor cores, I don't think so. We cannot do a matrix multiplication directly in int4, we need to dequantize to f16 or f32 first. But we can still use the tensor cores after dequantizing to float. I may be wrong, though.

Dampfinchen · 2023-06-26T12:46:17Z

A custom GEMM implementation will be faster with quantized models - that's one of the goals. There may be a small performance regression with f16 and f32 models, though.

Glad to hear that. I hope that will be the case.

Could we utilize the int4 hardware support in Ampere tensor cores?

Just a heads up: Tensor cores in Turing, Ampere and Ada Lovelace support INT4, INT8 and FP16 instructions. Ampere and Turing support INT1 as well.

While only Ampere and Ada support FP8 and FP32 in addition to that.

So ideally, the code would use FP16 or the integer instructions I mentioned to cover a wide range of hardware with tensor core support.

philipturner · 2023-06-26T13:00:16Z

Didn't they remove INT4 on Ada and Hopper?

Dampfinchen · 2023-06-26T13:03:55Z

Didn't they remove INT4 on Ada and Hopper?

Only on Hopper. INT4 is still present in consumer Ada Lovelace.

https://images.nvidia.com/aem-dam/Solutions/Data-Center/l4/nvidia-ada-gpu-architecture-whitepaper-v2.1.pdf Page 24:

Compared to Ampere, Ada delivers more than double the FP16, BF16, TF32, INT8, and INT4
Tensor TFLOPS, and also includes the Hopper FP8 Transformer Engine, delivering over 1.3
PetaFLOPS of tensor processing in the RTX 4090.

It looks like Ada removed support for INT1 though.

goerch · 2023-06-26T19:57:19Z

I'd tend to agree with @Dampfinchen : being interested in the Intel platform I don't believe we'll be able to outperform the MKL and oneAPI engineers.

JohannesGaessler · 2023-06-26T21:21:50Z

I tried implementing a (dequantization +) matrix matrix multiplication CUDA kernel but I'm struggling to get past 50% of cuBLAS performance for prompt processing. In particular, I found against my expectation that fusing dequantization + matrix matrix multiplication does not have a large impact on performance, possibly because you're limited by compute rather than memory bandwidth for large matrices (I am currently not using tensor cores).

What level of performance/sophistication is the goal for something that could possibly be merged? As of right now my implementation could already be useful for token generation since the VRAM usage will be lower compared to cuBLAS. But for prompt processing it is clearly worse. In general I think leaving cuBLAS as a compilation option would be desirable because given the small impact of tensor fusion I don't think I can realistically beat it for prompt processing performance.

philipturner · 2023-06-26T22:36:30Z

50% of cuBLAS performance for prompt processing.

If you get 50% of cuBLAS performance, then what is cuBLAS performance in ALU utilization? Perhaps both underutilize the processor, leaving much room to improve.

AlphaAtlas · 2023-06-26T23:39:37Z

If y'all progressively get rid of blas libraries, cublas is probably lowest on the totem pole? AFAIK users still need the huge cuda toolkit to run cuda inference anyway, so its hardly even getting rid of a dependency.

CLBLAST and CPU BLAS, on the other hand, can be tricky, but their implementations are open source. I just tried to get the OpenBLAS build working on an Ampere instance for a few hours... and ultimately failed.

philipturner · 2023-06-26T23:41:56Z

Why not just write the entire thing in one Mojo file.

slaren · 2023-06-27T08:24:41Z

What level of performance/sophistication is the goal for something that could possibly be merged?

Personally, I wouldn't want this enabled by default until the performance with quantized models is at least comparable to cuBLAS. It's ok if performance with f16/f32 models is worse.
But I think it could still be useful to have it merged as an option, just disabled by default. It would be a starting point, and we could keep improving it over time until we reach the performance goal.

LoganDark · 2023-06-28T03:01:22Z

Why not just write the entire thing in one Mojo file.

Because Mojo is vaporware and doesn't actually exist yet? You can't yet download or run Mojo, it is useless for now.

philipturner · 2023-06-28T03:33:24Z

There's a reason they're close-sourcing it for now, the same reason I close-sourced MFA for ~2 months. It's too buggy at the moment and will develop faster the way it is now.

Now I open-source when it is ready, and the decision pays off.

philipturner · 2023-06-28T03:58:15Z

Because Mojo is vaporware and doesn't actually exist yet?

When I learned what Modular was doing, I quit AI and shifted careers. No cap. There's nothing left for me to do because Modular is going to solve it.

LoganDark · 2023-06-28T21:29:57Z

There's a reason they're close-sourcing it for now, the same reason I close-sourced MFA for ~2 months. It's too buggy at the moment and will develop faster the way it is now.

Now I open-source when it is ready, and the decision pays off.

While I absolutely don't claim to know that it will never release, there is plenty of reason NOT to bet on it just yet considering it isn't public.

It's not just closed source, you can't even download binaries yet.

evanmiller · 2023-07-07T02:11:27Z

@ggerganov Can you clarify the current/planned threading model for CPU computation? This seems like it should be central to the discussion... BLAS is multi-threaded and works extremely well when the calling program is single-threaded. GGML appears to use threads to support concurrent execution of unrelated tasks. I wonder if the majority of workflows would be better off with a single-threaded top-level scheduler, with all cores assigned to work on individual large-ish tasks, a la the BLAS computation model.

bobqianic · 2023-08-10T15:34:05Z

I must say that this is simply not possible. I recommend reading the paper titled Anatomy of High-Performance Matrix Multiplication, written in 2008. Achieving high performance requires significant sacrifices. Have you looked at the code for Goto BLAS or OpenBLAS? They are all written in assembly! Yes, assembly language! You need to understand intimately how the hardware works and gauge exactly how far you can push it in order to achieve maximum performance. @ggerganov

JohannesGaessler · 2023-08-10T16:06:28Z

You would be correct if we were just doing FP32 + FP32 -> FP32 matrix multiplication. But we are not. The matrix is quantized to some custom data format that consists mostly of low-precision integers + some floating point scales. This data format can not be directly used by any regular BLAS library. So currently the quantized data needs to be converted to FP32 first which costs you both compute time and extra memory.

If you were to instead convert the hidden state from FP32 to q8_1 you would also be able to drastically reduce the amount of floating point instructions and replace them with SIMD integer instructions which are much faster. Consider the current state of CUDA mul_mat_q kernels: they use 700/970/1430 MiB less memory than cuBLAS for 7b/13b/33b and they are up to 2x faster (depending on hardware and quantization format). This is not because I can write the absolute best GEMM kernels but simply because I wrote GEMM kernels that take advantage of the specific ggml data format, both in terms of data types and the memory layout.

bobqianic · 2023-08-11T03:33:10Z

Great idea! I hadn't previously considered the overhead caused by the custom data format. However, I still believe that while minimizing overhead, we should use these BLAS libraries as much as possible to ensure optimal performance across different hardware. Because unlike CPUs, there are significant architectural differences between GPUs. Even products from the same company can have vast differences between generations. Every time NVIDIA introduces a new GPU architecture, CUDA has to undergo major updates to achieve the best performance on the new hardware. So, a custom-written kernel needs continuous maintenance, and the effort required is substantial. If you really don't want to use BLAS, I suggest you take a look at these:
Deep Learning Compilers
How Rammer squeezes more out of accelerator performance

philipturner · 2023-08-11T09:46:48Z

I don't think a single file can use all the hardware features of every processor, until Mojo comes around. We don't have a unified language, as CUDA only runs on gaming rigs and high-end laptops (regarding consumer hardware). Metal runs on 1 billion smartphones but is very different. Plus to use simdgroup_async_copy you have to pre-compile offline using a command-line tool from the archived Xcode 14.2 binary (dependency nightmare).

For mat-vec multiplication, it makes sense to dequantize in place. For mat-mat multiplication, dequantizing in-place increases the total number of operations while the ALU is already saturated. Plus, the proposed single-file idea will probably skip important hardware features (e.g. simdgroup_async_copy) that get full ALU saturation in the first place. I've been discussing quantized mat-mat multiplication in another AI application, and we decided on dequantizing to a small scratch MTLBuffer before calling into a pre-compiled FP16 x FP16 GEMM kernel from MFA (not MPS).

NOTE: By single-file I do not mean it has to literally be a single file. But that is the general sense of what this idea seems to be close to.

JohannesGaessler · 2023-08-11T13:21:00Z

For mat-mat multiplication, dequantizing in-place increases the total number of operations while the ALU is already saturated.

The goal is not to dequantize in place but to quantize the hidden state to q8_1 once per matrix matrix multiplication and to then do the calculations entirely using the quantized formats. This lets you replace floating point arithmetic with integer arithmetic or SIMD instructions so it should end up being faster.

philipturner · 2023-08-11T15:37:50Z

The GPU is already a SIMD architecture. Do you mean an optimization only applicable to CPU? If you're using an entire SIMD vector instruction for one scalar, that's underutilizing the SIMD ALU by a factor proportional to vector width.

JohannesGaessler · 2023-08-11T15:41:05Z

I mean to use this instead of floating point arithmetic.

rawwerks · 2023-12-13T02:27:59Z

this might be relevant: https://github.com/ashvardanian/SimSIMD

* Add tqdm to Python requirements * Remove torchvision torchaudio, add requests

jart · 2024-01-01T22:57:25Z

On Mozilla's llamafile project, we managed to get ggml-cuda.cu to not need to depend on cuBLAS any more, by whipping up an implementation of the four or so GEMM functions it needed. We called it tinyBLAS and it's worked reasonably well so far. The advantage for us has been enabling our prebuilt llama.cpp binaries to run on stock Windows installs, thus aiding distributability. See https://github.com/Mozilla-Ocho/llamafile/blob/main/llamafile/tinyblas.cu Would you want us to upstream this? cc: @stlhood

JohannesGaessler · 2024-01-01T23:05:50Z

Well what's the performance like?

jart · 2024-01-01T23:17:57Z

For many LLMs I've tried, tinyBLAS goes 4% slower. Some measurements:

wizardcoder 13b gives me 8.45 tokens/s on geforce rtx 2080 ti w/ tinyblas
wizardcoder 13b gives me 8.80 tokens/s on geforce rtx 2080 ti w/ cublas
llava 7b gives me 64.79 tokens/s on geforce rtx 2080 ti w/ cublas
llava 7b gives me 61.67 tokens/s on geforce rtx 2080 ti w/ tinyblas
mistral 7b gives me 61 tokens/s on geforce rtx 2080 ti w/ cublas
mistral 7b gives me 59 tokens/s on geforce rtx 2080 ti w/ tinyblas

I also tested on Jetson and NVIDIA L4. The numbers were basically the same.

However, LLaVA image processing currently only goes 50% as fast as cuBLAS when using tinyBLAS. We have some changes in flight like Mozilla-Ocho/llamafile#156 for improving that. One of the benefits I see to upstreaming is that I'd love to have an opportunity to collaborate with folks here on improving that.

JohannesGaessler · 2024-01-01T23:40:44Z

What batch size are we talking about here? For a batch size of 1 you could already run e.g. LLaMA models completely without cuBLAS with moderate performance penalties. The problem is efficient matrix multiplication with large batch sizes where you are compute bound rather than I/O bound. Currently for LLaMA 2 q8_0 prompt processing using my RTX 3090 I get 3400 t/s with cuBLAS and 2300 t/s using MMQ.

jart · 2024-01-01T23:53:20Z

We learned that the hard way. Our first pass coding tinyblasGemmEx() processed batched data 1600% slower. Adding 2d blocking and __shared__ memory was what enabled us to make it only go 50% slower.

hiepxanh · 2024-01-16T14:18:57Z

oh my god, tired of mutiBLAS madness and you save my life @jart It take me 3 days to dive in ocean of runner and this one only take me 30 minutes to make sure it work.

58.44 tokens per second
with RX 6600 XT , no CPU overload

just download llamafile from release page, add .exe and voila!

./llamafile-0.6.exe -ngl 35 --gpu amd -m ../models/vinallama-2.7b-chat_q5_0.gguf

JohannesGaessler · 2024-01-16T14:43:26Z

I forgot to say: I recently looked at the cuBLAS kernels in detail and it seems that they are not optimized for consumer GPUs at all. This is because while professional GPUs typically have powers of 2 as the SM count the SM count on consumer GPUs is more awkward to optimize for (which seems intentional to me). As a consequence the last wave in cuBLAS FP16 GEMM has poor GPU utilization. The overall GPU utilization is ~15% lower at a batch size of 512 for 7b (larger models and batch sizes are less affected). So depending on the amount of invested effort it may be possible to write CUDA kernels that outperform cuBLAS FP16 GEMM on consumer GPUs.

jart · 2024-01-16T16:56:50Z

That's great to hear @hiepxanh! I see you have an AMD GPU. We also learned earlier in Mozilla-Ocho/llamafile#188 (comment) that tinyBLAS is helping people with AMD laptops, because rocBLAS is designed for the HPC market and was never tested on architectures like gfx1103, where it currently fails. tinyBLAS works, and pumping up AMD consumer support is a priority.

Also @ahgamut and I rented an NVIDIA RTX 3080 for nine hours yesterday and did some work on tinyBLAS. Here's our latest progress. Earlier, I reported tinyBLAS took 2x longer (total time) to do LLaVA image processing (which requires chewing on a meaty 512 token batch). Now tinyBLAS only takes 1.43x longer. That's thanks to our most recent iteration in Mozilla-Ocho/llamafile#205 which is helping us close the gap. Here's the executive summary:

Please help us @JohannesGaessler because I hope what you're saying is true. I would love nothing more than for us to be able to claim basic linear algebra subprogram supremacy on consumer hardware. Some of our recent changes, like C++ templatizing the tinyBLAS block kernel, I believe will help us ship a pareto optimized set of code paths tuned for popular consumer hardware, while having safe fallbacks for everything else. For example, on previous iterations, I noticed development work we did on NVIDIA A100 needed to be recalibrated to also run well on a $300 AMD GPU, possibly due to a smaller register file. We had to trade away a few percent on the A100, but it made a 20x difference for the Radeon. With templates that shouldn't be an issue anymore.

JohannesGaessler · 2024-01-16T17:04:54Z

Sorry, I don't have the time to work on TinyBLAS. I'm currently working on matrix multiplication using int8 tensor cores ggerganov/llama.cpp#4801 and I have a lot of other llama.cpp related things queued up after that which I think are more important. And even if I had more time writing better FP16 GEMM than cuBLAS for consumer GPUs would still be very difficult and time consuming.

ggerganov added refactoring Refactoring performance Speed related topics labels Jun 25, 2023

mqy mentioned this issue Jun 25, 2023

Fine tune MUL_MAT, new threading (spin+wait/notify), speedup q_f32 BLAS by splitting COMPUTE stage ggerganov/llama.cpp#1632

Closed

ggerganov mentioned this issue Jun 28, 2023

llama : add example for speculative sampling ggerganov/llama.cpp#2030

Closed

ggerganov self-assigned this Jun 28, 2023

ggerganov mentioned this issue Jun 29, 2023

dequantize + matrix multiplication CUDA kernels ggerganov/llama.cpp#2043

Closed

ggerganov mentioned this issue Jul 25, 2023

ggml : mul mat tweaks ggerganov/llama.cpp#2372

Merged

2 tasks

Green-Sky mentioned this issue Aug 21, 2023

metal-flash-attention support leejet/stable-diffusion.cpp#22

Open

ggerganov mentioned this issue Aug 22, 2023

CUDA: use mul_mat_q kernels by default ggerganov/llama.cpp#2683

Merged

jart mentioned this issue Dec 11, 2023

Support CUDA without cuBLAS Mozilla-Ocho/llamafile#82

Merged

CCLDArjun pushed a commit to CCLDArjun/ggml that referenced this issue Dec 18, 2023

Add tqdm to Python requirements (ggerganov#293)

5cb63e2

* Add tqdm to Python requirements * Remove torchvision torchaudio, add requests

jart mentioned this issue Feb 19, 2024

New feature suggestions support when possible CLBLAS, Intel GPUs, SYCL, Vulkan based GPGPU via underlying llama.cpp capabilities. Mozilla-Ocho/llamafile#245

Closed

ggerganov removed their assignment Mar 10, 2024

teleprint-me mentioned this issue May 24, 2024

OpenBLAS Linux : more thread = less performances ggerganov/llama.cpp#5534

Closed

martincerven mentioned this issue Sep 4, 2024

Why does this extension need full blown Chromium.app? continuedev/continue#2166

Open

3 tasks

ggml : get rid of BLAS and all it's variants #293

ggml : get rid of BLAS and all it's variants #293

Comments

ggerganov commented Jun 25, 2023 • edited Loading

philipturner commented Jun 25, 2023

Dampfinchen commented Jun 26, 2023 • edited Loading

philipturner commented Jun 26, 2023

Dampfinchen commented Jun 26, 2023

philipturner commented Jun 26, 2023

slaren commented Jun 26, 2023

philipturner commented Jun 26, 2023

slaren commented Jun 26, 2023

Dampfinchen commented Jun 26, 2023 • edited Loading

philipturner commented Jun 26, 2023

Dampfinchen commented Jun 26, 2023

goerch commented Jun 26, 2023

JohannesGaessler commented Jun 26, 2023

philipturner commented Jun 26, 2023

AlphaAtlas commented Jun 26, 2023 • edited Loading

philipturner commented Jun 26, 2023

slaren commented Jun 27, 2023

LoganDark commented Jun 28, 2023

philipturner commented Jun 28, 2023 • edited Loading

philipturner commented Jun 28, 2023

LoganDark commented Jun 28, 2023

evanmiller commented Jul 7, 2023

bobqianic commented Aug 10, 2023

JohannesGaessler commented Aug 10, 2023 • edited Loading

bobqianic commented Aug 11, 2023 • edited Loading

philipturner commented Aug 11, 2023 • edited Loading

JohannesGaessler commented Aug 11, 2023

philipturner commented Aug 11, 2023

JohannesGaessler commented Aug 11, 2023

rawwerks commented Dec 13, 2023

jart commented Jan 1, 2024

JohannesGaessler commented Jan 1, 2024

jart commented Jan 1, 2024

JohannesGaessler commented Jan 1, 2024

jart commented Jan 1, 2024

hiepxanh commented Jan 16, 2024

JohannesGaessler commented Jan 16, 2024

jart commented Jan 16, 2024

JohannesGaessler commented Jan 16, 2024 • edited Loading

ggerganov commented Jun 25, 2023 •

edited

Loading

Dampfinchen commented Jun 26, 2023 •

edited

Loading

Dampfinchen commented Jun 26, 2023 •

edited

Loading

AlphaAtlas commented Jun 26, 2023 •

edited

Loading

philipturner commented Jun 28, 2023 •

edited

Loading

JohannesGaessler commented Aug 10, 2023 •

edited

Loading

bobqianic commented Aug 11, 2023 •

edited

Loading

philipturner commented Aug 11, 2023 •

edited

Loading

JohannesGaessler commented Jan 16, 2024 •

edited

Loading