Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml : get rid of BLAS and all it's variants #293

Open
ggerganov opened this issue Jun 25, 2023 · 39 comments
Open

ggml : get rid of BLAS and all it's variants #293

ggerganov opened this issue Jun 25, 2023 · 39 comments
Labels
performance Speed related topics refactoring Refactoring

Comments

@ggerganov
Copy link
Owner

ggerganov commented Jun 25, 2023

This is a big one

The only reason we use BLAS is that we don't have efficient implementation of matrix x matrix multiplication. Naively doing parallel dot products is not optimal. We need to implement some of the fundamental GEMM optimizations such as block tiling and we need to implement this in a compact way that reuses the existing dot product code and supports all quantization types

More comments on this:

@philipturner
Copy link

supports all quantization types

The best place for dequantization would be after you've loaded data from device -> threadgroup memory, and are loading into registers. I hypothesize it will faster to have each of the four simds in a threadgroup unpack on their own, even if that duplicates the work two times.

For example, you might fork my MFA repository, then modify the line here to unpack after loading.

https://github.com/philipturner/metal-flash-attention/blob/fbfd0a028402c0ae6fa293c2f91de318b95e359b/Sources/GEMM.metal#L129

@Dampfinchen
Copy link

Dampfinchen commented Jun 26, 2023

While I do understand the desire for the project to be independend of other libraries, I personally do not think removing the excellent cuBLAS implementation entirely is a good idea. One of the main benefits of this project is that you can run large models with good speeds on lower end hardware with less VRAM. For example, thanks to the cuBLAS' fast prompt processing of 15 ms/t I can enjoy a 13B parameter model with full context at around 1.6 token/sec, which is much faster than running GPTQ with cpu offloading (0,4 token/s) on my RTX 2060 laptop.

As you've said yourself, a native GEMM implementation would likely be slower than what cuBLAS is offering. Even if the performance difference is not drastic on your hardware, it can make all the difference on hardware like mine, even a slight performance difference (for example 15 ms/t to 23 ms/t) would lead to a worse experience for people who run huge LLMs too big for their systems. Not just for low to mid spec hardware, but for high end hardware trying to run 65B models as well.

Please keep this perspective in mind as you continue to develop the project. I don't think it would be a good outcome for me and many others to have to downgrade to older versions with cuBLAS and not enjoy new enhancements just because older versions run faster due to cuBLAS support.

@philipturner
Copy link

Why would custom code be slower than cuBLAS? There is Nvidia CUTLASS. The only things it can be are equal or faster.

@Dampfinchen
Copy link

Why would custom code be slower than cuBLAS? There is Nvidia CUTLASS. The only things it can be are equal or faster.

They aren't going to use CUTLASS though, because it'd be another third party lib.

CuBLAS is highly optimized for the hardware and Georgi said himself that he is aware a custom code is not going to perform as well in this comment:

ggerganov/llama.cpp#1867 (comment)

And I fully understand that it will be close to impossible to achieve the maximum performance available from dedicated libraries (such as cuBLAS, for example).

@philipturner
Copy link

300 lines of code, (soon to) outperform all of Apple's proprietary Metal Performance Shaders.

https://github.com/philipturner/metal-flash-attention/blob/main/Sources/GEMM.metal

If we take some code from CUTLASS, maybe just optimize for the matrix shapes that exist in LLaMA.

@slaren
Copy link
Collaborator

slaren commented Jun 26, 2023

A custom GEMM implementation will be faster with quantized models - that's one of the goals. There may be a small performance regression with f16 and f32 models, though.

@philipturner
Copy link

Could we utilize the int4 hardware support in Ampere tensor cores?

@slaren
Copy link
Collaborator

slaren commented Jun 26, 2023

From what I have seen of the way int4 works with tensor cores, I don't think so. We cannot do a matrix multiplication directly in int4, we need to dequantize to f16 or f32 first. But we can still use the tensor cores after dequantizing to float. I may be wrong, though.

@Dampfinchen
Copy link

Dampfinchen commented Jun 26, 2023

A custom GEMM implementation will be faster with quantized models - that's one of the goals. There may be a small performance regression with f16 and f32 models, though.

Glad to hear that. I hope that will be the case.

Could we utilize the int4 hardware support in Ampere tensor cores?

Just a heads up: Tensor cores in Turing, Ampere and Ada Lovelace support INT4, INT8 and FP16 instructions. Ampere and Turing support INT1 as well.

While only Ampere and Ada support FP8 and FP32 in addition to that.

So ideally, the code would use FP16 or the integer instructions I mentioned to cover a wide range of hardware with tensor core support.

@philipturner
Copy link

Didn't they remove INT4 on Ada and Hopper?

@Dampfinchen
Copy link

Didn't they remove INT4 on Ada and Hopper?

Only on Hopper. INT4 is still present in consumer Ada Lovelace.

https://images.nvidia.com/aem-dam/Solutions/Data-Center/l4/nvidia-ada-gpu-architecture-whitepaper-v2.1.pdf Page 24:

Compared to Ampere, Ada delivers more than double the FP16, BF16, TF32, INT8, and INT4
Tensor TFLOPS, and also includes the Hopper FP8 Transformer Engine, delivering over 1.3
PetaFLOPS of tensor processing in the RTX 4090.

It looks like Ada removed support for INT1 though.

@goerch
Copy link
Contributor

goerch commented Jun 26, 2023

I'd tend to agree with @Dampfinchen : being interested in the Intel platform I don't believe we'll be able to outperform the MKL and oneAPI engineers.

@JohannesGaessler
Copy link
Collaborator

I tried implementing a (dequantization +) matrix matrix multiplication CUDA kernel but I'm struggling to get past 50% of cuBLAS performance for prompt processing. In particular, I found against my expectation that fusing dequantization + matrix matrix multiplication does not have a large impact on performance, possibly because you're limited by compute rather than memory bandwidth for large matrices (I am currently not using tensor cores).

What level of performance/sophistication is the goal for something that could possibly be merged? As of right now my implementation could already be useful for token generation since the VRAM usage will be lower compared to cuBLAS. But for prompt processing it is clearly worse. In general I think leaving cuBLAS as a compilation option would be desirable because given the small impact of tensor fusion I don't think I can realistically beat it for prompt processing performance.

@philipturner
Copy link

50% of cuBLAS performance for prompt processing.

If you get 50% of cuBLAS performance, then what is cuBLAS performance in ALU utilization? Perhaps both underutilize the processor, leaving much room to improve.

@AlphaAtlas
Copy link

AlphaAtlas commented Jun 26, 2023

If y'all progressively get rid of blas libraries, cublas is probably lowest on the totem pole? AFAIK users still need the huge cuda toolkit to run cuda inference anyway, so its hardly even getting rid of a dependency.

CLBLAST and CPU BLAS, on the other hand, can be tricky, but their implementations are open source. I just tried to get the OpenBLAS build working on an Ampere instance for a few hours... and ultimately failed.

@philipturner
Copy link

Why not just write the entire thing in one Mojo file.

@slaren
Copy link
Collaborator

slaren commented Jun 27, 2023

What level of performance/sophistication is the goal for something that could possibly be merged?

Personally, I wouldn't want this enabled by default until the performance with quantized models is at least comparable to cuBLAS. It's ok if performance with f16/f32 models is worse.
But I think it could still be useful to have it merged as an option, just disabled by default. It would be a starting point, and we could keep improving it over time until we reach the performance goal.

@LoganDark
Copy link
Contributor

Why not just write the entire thing in one Mojo file.

Because Mojo is vaporware and doesn't actually exist yet? You can't yet download or run Mojo, it is useless for now.

@philipturner
Copy link

philipturner commented Jun 28, 2023

There's a reason they're close-sourcing it for now, the same reason I close-sourced MFA for ~2 months. It's too buggy at the moment and will develop faster the way it is now.

Now I open-source when it is ready, and the decision pays off.

@philipturner
Copy link

Because Mojo is vaporware and doesn't actually exist yet?

When I learned what Modular was doing, I quit AI and shifted careers. No cap. There's nothing left for me to do because Modular is going to solve it.

@LoganDark
Copy link
Contributor

There's a reason they're close-sourcing it for now, the same reason I close-sourced MFA for ~2 months. It's too buggy at the moment and will develop faster the way it is now.

Now I open-source when it is ready, and the decision pays off.

While I absolutely don't claim to know that it will never release, there is plenty of reason NOT to bet on it just yet considering it isn't public.

It's not just closed source, you can't even download binaries yet.

@evanmiller
Copy link

@ggerganov Can you clarify the current/planned threading model for CPU computation? This seems like it should be central to the discussion... BLAS is multi-threaded and works extremely well when the calling program is single-threaded. GGML appears to use threads to support concurrent execution of unrelated tasks. I wonder if the majority of workflows would be better off with a single-threaded top-level scheduler, with all cores assigned to work on individual large-ish tasks, a la the BLAS computation model.

@bobqianic
Copy link
Contributor

I must say that this is simply not possible. I recommend reading the paper titled Anatomy of High-Performance Matrix Multiplication, written in 2008. Achieving high performance requires significant sacrifices. Have you looked at the code for Goto BLAS or OpenBLAS? They are all written in assembly! Yes, assembly language! You need to understand intimately how the hardware works and gauge exactly how far you can push it in order to achieve maximum performance. @ggerganov

@JohannesGaessler
Copy link
Collaborator

JohannesGaessler commented Aug 10, 2023

You would be correct if we were just doing FP32 + FP32 -> FP32 matrix multiplication. But we are not. The matrix is quantized to some custom data format that consists mostly of low-precision integers + some floating point scales. This data format can not be directly used by any regular BLAS library. So currently the quantized data needs to be converted to FP32 first which costs you both compute time and extra memory.

If you were to instead convert the hidden state from FP32 to q8_1 you would also be able to drastically reduce the amount of floating point instructions and replace them with SIMD integer instructions which are much faster. Consider the current state of CUDA mul_mat_q kernels: they use 700/970/1430 MiB less memory than cuBLAS for 7b/13b/33b and they are up to 2x faster (depending on hardware and quantization format). This is not because I can write the absolute best GEMM kernels but simply because I wrote GEMM kernels that take advantage of the specific ggml data format, both in terms of data types and the memory layout.

@bobqianic
Copy link
Contributor

bobqianic commented Aug 11, 2023

Great idea! I hadn't previously considered the overhead caused by the custom data format. However, I still believe that while minimizing overhead, we should use these BLAS libraries as much as possible to ensure optimal performance across different hardware. Because unlike CPUs, there are significant architectural differences between GPUs. Even products from the same company can have vast differences between generations. Every time NVIDIA introduces a new GPU architecture, CUDA has to undergo major updates to achieve the best performance on the new hardware. So, a custom-written kernel needs continuous maintenance, and the effort required is substantial. If you really don't want to use BLAS, I suggest you take a look at these:
Deep Learning Compilers
How Rammer squeezes more out of accelerator performance

@philipturner
Copy link

philipturner commented Aug 11, 2023

I don't think a single file can use all the hardware features of every processor, until Mojo comes around. We don't have a unified language, as CUDA only runs on gaming rigs and high-end laptops (regarding consumer hardware). Metal runs on 1 billion smartphones but is very different. Plus to use simdgroup_async_copy you have to pre-compile offline using a command-line tool from the archived Xcode 14.2 binary (dependency nightmare).

For mat-vec multiplication, it makes sense to dequantize in place. For mat-mat multiplication, dequantizing in-place increases the total number of operations while the ALU is already saturated. Plus, the proposed single-file idea will probably skip important hardware features (e.g. simdgroup_async_copy) that get full ALU saturation in the first place. I've been discussing quantized mat-mat multiplication in another AI application, and we decided on dequantizing to a small scratch MTLBuffer before calling into a pre-compiled FP16 x FP16 GEMM kernel from MFA (not MPS).

NOTE: By single-file I do not mean it has to literally be a single file. But that is the general sense of what this idea seems to be close to.

@JohannesGaessler
Copy link
Collaborator

For mat-mat multiplication, dequantizing in-place increases the total number of operations while the ALU is already saturated.

The goal is not to dequantize in place but to quantize the hidden state to q8_1 once per matrix matrix multiplication and to then do the calculations entirely using the quantized formats. This lets you replace floating point arithmetic with integer arithmetic or SIMD instructions so it should end up being faster.

@philipturner
Copy link

The GPU is already a SIMD architecture. Do you mean an optimization only applicable to CPU? If you're using an entire SIMD vector instruction for one scalar, that's underutilizing the SIMD ALU by a factor proportional to vector width.

@JohannesGaessler
Copy link
Collaborator

I mean to use this instead of floating point arithmetic.

@rawwerks
Copy link

this might be relevant: https://github.com/ashvardanian/SimSIMD

CCLDArjun pushed a commit to CCLDArjun/ggml that referenced this issue Dec 18, 2023
* Add tqdm to Python requirements
* Remove torchvision torchaudio, add requests
@jart
Copy link
Contributor

jart commented Jan 1, 2024

On Mozilla's llamafile project, we managed to get ggml-cuda.cu to not need to depend on cuBLAS any more, by whipping up an implementation of the four or so GEMM functions it needed. We called it tinyBLAS and it's worked reasonably well so far. The advantage for us has been enabling our prebuilt llama.cpp binaries to run on stock Windows installs, thus aiding distributability. See https://github.com/Mozilla-Ocho/llamafile/blob/main/llamafile/tinyblas.cu Would you want us to upstream this? cc: @stlhood

@JohannesGaessler
Copy link
Collaborator

Well what's the performance like?

@jart
Copy link
Contributor

jart commented Jan 1, 2024

For many LLMs I've tried, tinyBLAS goes 4% slower. Some measurements:

  • wizardcoder 13b gives me 8.45 tokens/s on geforce rtx 2080 ti w/ tinyblas
  • wizardcoder 13b gives me 8.80 tokens/s on geforce rtx 2080 ti w/ cublas
  • llava 7b gives me 64.79 tokens/s on geforce rtx 2080 ti w/ cublas
  • llava 7b gives me 61.67 tokens/s on geforce rtx 2080 ti w/ tinyblas
  • mistral 7b gives me 61 tokens/s on geforce rtx 2080 ti w/ cublas
  • mistral 7b gives me 59 tokens/s on geforce rtx 2080 ti w/ tinyblas

I also tested on Jetson and NVIDIA L4. The numbers were basically the same.

However, LLaVA image processing currently only goes 50% as fast as cuBLAS when using tinyBLAS. We have some changes in flight like Mozilla-Ocho/llamafile#156 for improving that. One of the benefits I see to upstreaming is that I'd love to have an opportunity to collaborate with folks here on improving that.

@JohannesGaessler
Copy link
Collaborator

What batch size are we talking about here? For a batch size of 1 you could already run e.g. LLaMA models completely without cuBLAS with moderate performance penalties. The problem is efficient matrix multiplication with large batch sizes where you are compute bound rather than I/O bound. Currently for LLaMA 2 q8_0 prompt processing using my RTX 3090 I get 3400 t/s with cuBLAS and 2300 t/s using MMQ.

@jart
Copy link
Contributor

jart commented Jan 1, 2024

We learned that the hard way. Our first pass coding tinyblasGemmEx() processed batched data 1600% slower. Adding 2d blocking and __shared__ memory was what enabled us to make it only go 50% slower.

@hiepxanh
Copy link

oh my god, tired of mutiBLAS madness and you save my life @jart It take me 3 days to dive in ocean of runner and this one only take me 30 minutes to make sure it work.

58.44 tokens per second
with RX 6600 XT , no CPU overload

image
image

just download llamafile from release page, add .exe and voila!

./llamafile-0.6.exe -ngl 35 --gpu amd -m ../models/vinallama-2.7b-chat_q5_0.gguf

@JohannesGaessler
Copy link
Collaborator

I forgot to say: I recently looked at the cuBLAS kernels in detail and it seems that they are not optimized for consumer GPUs at all. This is because while professional GPUs typically have powers of 2 as the SM count the SM count on consumer GPUs is more awkward to optimize for (which seems intentional to me). As a consequence the last wave in cuBLAS FP16 GEMM has poor GPU utilization. The overall GPU utilization is ~15% lower at a batch size of 512 for 7b (larger models and batch sizes are less affected). So depending on the amount of invested effort it may be possible to write CUDA kernels that outperform cuBLAS FP16 GEMM on consumer GPUs.

@jart
Copy link
Contributor

jart commented Jan 16, 2024

That's great to hear @hiepxanh! I see you have an AMD GPU. We also learned earlier in Mozilla-Ocho/llamafile#188 (comment) that tinyBLAS is helping people with AMD laptops, because rocBLAS is designed for the HPC market and was never tested on architectures like gfx1103, where it currently fails. tinyBLAS works, and pumping up AMD consumer support is a priority.

Also @ahgamut and I rented an NVIDIA RTX 3080 for nine hours yesterday and did some work on tinyBLAS. Here's our latest progress. Earlier, I reported tinyBLAS took 2x longer (total time) to do LLaVA image processing (which requires chewing on a meaty 512 token batch). Now tinyBLAS only takes 1.43x longer. That's thanks to our most recent iteration in Mozilla-Ocho/llamafile#205 which is helping us close the gap. Here's the executive summary:

image

Please help us @JohannesGaessler because I hope what you're saying is true. I would love nothing more than for us to be able to claim basic linear algebra subprogram supremacy on consumer hardware. Some of our recent changes, like C++ templatizing the tinyBLAS block kernel, I believe will help us ship a pareto optimized set of code paths tuned for popular consumer hardware, while having safe fallbacks for everything else. For example, on previous iterations, I noticed development work we did on NVIDIA A100 needed to be recalibrated to also run well on a $300 AMD GPU, possibly due to a smaller register file. We had to trade away a few percent on the A100, but it made a 20x difference for the Radeon. With templates that shouldn't be an issue anymore.

@JohannesGaessler
Copy link
Collaborator

JohannesGaessler commented Jan 16, 2024

Sorry, I don't have the time to work on TinyBLAS. I'm currently working on matrix multiplication using int8 tensor cores ggerganov/llama.cpp#4801 and I have a lot of other llama.cpp related things queued up after that which I think are more important. And even if I had more time writing better FP16 GEMM than cuBLAS for consumer GPUs would still be very difficult and time consuming.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Speed related topics refactoring Refactoring
Projects
Status: Todo
Development

No branches or pull requests