-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dequantize + matrix multiplication CUDA kernels #2043
dequantize + matrix multiplication CUDA kernels #2043
Conversation
While I was working on this the PR for k-quant super block sizes of 64 #2001 was merged. This option is not supported by my implementation. |
I don't understand why the macOS CI build is failing. Can someone help? |
its been flaky all day. I manually rerun failed actions, which usually fixes it. |
Quite good results, not a small feat despite not matching the speed. |
Making more room in vram matters alot on constrained system. |
I think you might be heading in a wrong direction - let me try to explain Before starting work on the GPU code, we were focused on implementing a highly-optimized version for CPU-only inference. The key insight was to start using quantized data during matrix multiplication in order to reduce the memory bandwidth and improve the inference performance. To do that, we developed optimized SIMD "kernels" for computing the dot product directly working with the quantized integer data. I.e. it is essential that we do not dequantize back to F16 or F32, but instead quantize This was the main reason during your first CUDA PR to ask you to investigate this same approach for the GPU: Although you showed that this approach was not efficient on older GPUs, I think we should revisit this and put some more effort in making it work. The expectation is that using it for The existing The goal of ggerganov/ggml#293 will be to add exactly this type of optimization (and potentially other). We will start with the CPU implementation and then I plan to do the same for Metal. I'm pretty confident that this is the correct approach to take and will recommend to do the same for CUDA and OpenCL. |
I think that dequantizing to f16 or f32 is going to be necessary to use the tensor cores, so I think this approach has merit. Dequantizing to SRAM should still be much faster than the previous dequantization to DRAM, and if the bottleneck is the compute (and it should be), then tensor cores should improve performance significantly, possibly more than integer math. |
Regarding floats vs ints: First of all I think it should be noted that CPUs and GPUs have very different performance characteristics. In particular:
I think currently the biggest improvement for quantized CUDA performance to be had will be to rearrange the data and to utilize shared memory. For CUDA to efficiently load data it needs to be aligned to 32 bytes. However, the current blocks have sizes of e.g. 18 bytes. So what should be done I think is to rearrange the data when it's transferred into VRAM so that all quants come first followed by all scales. Then, when the data is used the quants are memory aligned and the scales can be read into shared memory as blocks (since they take up comparatively few bytes). Of course this is not mutually exclusive with using quantized dot products though. |
Right, but the intermediate results need higher precision than int4 or even int8. On the CPU, we usually do the dot products in int16 before scaling to float. I am not sure if that's possible to do with tensor cores, but I don't think it is. We could do the same on the GPU, just without tensor cores, and it would likely be faster than the current implementation, but I am not sure that it would be faster than converting to f16 and using the tensor cores. The int4 ops have been deprecated in Hopper, so I don't expect that they will ever have a stable API. |
According to the documentation the accumulator for int8 tensor cores is int32. |
Good to know. Though the 2nd tensor is usually so small that on-the-fly conversion is always possible. In my fblas-16bit I'm just down-converting the 2nd matrix inside the function, I didn't notice a performance difference. I wonder if Nvidia will actually release the fp8 functions to consumer hardware, in the past year I had the impression that they actively work against the use of their gaming-GPU line for ML applications. Reducing VRAM, restricting driver licensing, removing SLI. Their datacenter line brings in 2 times more revenue as they just scale the price of large VRAM models up 10 fold. |
I think that merging this as an optional feature and iterating over it later would be ok, but duplicating the dequantization functions is really not good, it's going to add too much maintenance overhead in the future. My understanding was that dequantizing two values at a time was done for performance, so I am not sure why we need functions that only return one value now. If that cannot be fixed in this implementation, I would suggest leaving this open as a draft for people who may be interested in trying it or iterating over it in the future, but not merging. |
I'm fine with keeping this PR open; I plan to revise the dequantize_mul_mat_vec kernels anyways. |
Also the reason I added dequantization kernels that dequantize only one value is to get a unified interface for all quantization formats. Especially if you want to try implementations that have a tile size < 32 in one direction I find that very convenient since you won't have to care about the data layout of the specific quantization types. If you look at the code for |
Why is the gain in VRAM usage so high? Excluding |
The order in which the pool buffers are allocated is bad. It first allocates a small buffer, then a larger one, and then an even larger one. So you can't just re-use the previously allocated buffers and instead have to allocate new ones. In #1935 I did a hacky patch that just allocates the largest size at the start so buffers can be reused but I think a preferable solution would be to get rid of the buffers entirely via tensor fusion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think currently the biggest improvement for quantized CUDA performance to be had will be to rearrange the data and to utilize shared memory. For CUDA to efficiently load data it needs to be aligned to 32 bytes.
Maybe a quick and dirty prototype of this idea could be useful to see how much improvement the 32-byte alignment would bring. Some ideas from #1073 could be potentially useful
Speaking of which, I tried reordering the data and it made no difference, presumably because the blocks are already in cache due to the contiguous nature. However, I have a prototype for quantizing the vector and then using SIMD intrinsics to do the dot product. On my RTX 3090 this was ~10% faster. |
After #2140 where not using |
This is a good find - I think this is the first time where I see |
Superseded by #2160 . |
I implemented CUDA kernels that do dequantization and matrix matrix multiplication in one step. This eliminates the need for temporary buffers to hold the dequantized f32 matrix given to cuBLAS. For 33b this saves at least ~600 MiB for -ngl <= 60, and at least ~1400 MiB for -ngl >= 61. As a result a few more layers can be offloaded. For this table pp == prompt processing, tg128 == generation of 128 tokens with an empty prompt:
Unfortunately my matrix multiplication kernel is not very good and significantly slower than cuBLAS for prompt processing. This is particularly noticeable when there is plenty of VRAM anyways:
By default cuBLAS is still in use. The new dequantization + matrix multiplication kernels can be used by setting the compile option
LLAMA_CUDA_DMM
.The implementation works by adding a template that takes a dequantization method. I revised the dequantization to require only the index of the data value to be dequantized. I plan to eventually apply this dequantization scheme for the other templates as well since it is simpler.
I don't understand why the performance of my matrix multiplication kernel is so bad. I tried several other variants but they all had even worse performance. I would very much welcome it if someone were to write a better one.