-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA: int8 tensor core matrix multiplication #4801
Conversation
I don't know enough about this to comment on the implementation, but if the issue is scaling the result, maybe dequantizing to fp16 and multiplying as fp16 could work. It should still be faster than cuBLAS, and it would reduce the memory requirements. |
I have not actually tried this but I do not expect this to work well:
Not directly the same but in a previous version I did the q8_0 -> int8 conversion in the matrix multiplication kernel. This alone took up ~25% of the kernel's runtime. On a more general note, now that I think about it my kernel is mostly just doing regular int8 matrix multiplication with some scaling at the end. But maybe it'd be possible to somehow use cuBLAS GEMM and add a small kernel for scaling afterwards; I'll investigate. |
Okay, so apparently cuBLAS allows you to do (int8, int8) -> int32 matrix multiplications. But it seems that will not be of use to us. You can do two individual matrix multiplications to replace the kernel I wrote but that is slower even without an extra kernel for scaling afterwards. What we would need is an operation |
Quite a stunning speedup, looks like for anything below RTX 40 series this is a winner already. In case you are interested, a few weeks ago (#4310) I had outlined to gg how to do native FP8 (which is twice the speed of FP16) in llama.cpp by using wrappers to transparently "force" it into ggml-cuda.cu while using the new cublasLt API which is required for fp8. |
I wonder what would this work for rdna3 wmma. Would like to take advantage of wmma |
After the bugfix in #4809 prompt processing on master has become faster, particularly so for cuBLAS. This has raised the bar that this PR will need to meet: (I think I did something wrong when I measured MMQ performance for the previous plot; in this one the performance is much better.) |
Well speedwise, according to that graph, MMI8_8 appears to be almost on par with cuBLAS, which makes it very worth it in my opinion as it does save quite a bit of memory. Sadly though, perplexity seems noticeably higher than the other methods. |
34d1513
to
d44f438
Compare
I'm slowly making progress. Square tensor core fragments seem to work better than the thin fragments I was using before. Also asychronous data copies seem to indeed be faster than regular data copies (even without interleaving compute and memory I/O). Unfortunately for best performance it seems to be necessary to copy a block of at least 512 bytes though (currently 128 bytes in this PR). I did some more performance optimizations for cuBLAS on master but I don't expect there to be more so cuBLAS performance should now stay constant (unless full FP16 model evaluation is implemented). Currently MMI8_8 is again slightly faster than cuBLAS while MMI8_15 is about the same speed as cuBLAS prior to the performance optimizations: |
I think I've cracked the code. The problem with tensor cores is that while each fragment is technically just a struct with individual values you have no guarantees regarding what those values are; which specific values each thread receives varies by GPU architecture. But you can apply scalar operations to these values and the effect will be the same for all GPU architectures because each value is treated the same. So you can (in addition to the scales per row/column) define scales per 16x16 or 32x8 tensor core fragment. Then you only have to apply these scales per fragment without any expensive calls to I have a prototype for single precision mmi8 with per-fragment scales that gets 5.8107 PPL with q8_0 (vs. 5.7985 PPL with cuBLAS). However, among the repeating weight tensors |
Hey @JohannesGaessler, For example:
Another option is to use LLVM builtins directly for RDNA3 WMMA support. There's generally a 1:1 mapping between mma.h /rocWMMA and LLVM builtins (except the loading/storing part, which can get complicated depending on how much performance you want to squeeze out of it). I have sample code for that on my blog |
ggml-cuda.cu
Outdated
typedef nvcuda::wmma::fragment<nvcuda::wmma::matrix_b, 16, 16, 16, int8_t, nvcuda::wmma::col_major> frag_thin_b; | ||
typedef nvcuda::wmma::fragment<nvcuda::wmma::accumulator, 16, 16, 16, int> frag_thin_c; | ||
|
||
typedef cuda::barrier<cuda::thread_scope::thread_scope_block> cuda_barrier; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may break compatibility with ROCm in case rocWMMA
is used for AMD RDNA3 GPU support. Could there be a portable workaround using pure C++ device code w/ builtin sync primitives rather than something libcudacxx specific?
I can try to make it so that AMD tensor cores can be used but I can't guarantee that the performance would be good enough to make it worthwhile. |
Good, can't wait to test that on my 7900xtx |
Thanks! It's still worth keeping the support, perhaps behind a feature flag that's set to false by default in case this isn't fast enough for the time being. |
@jammm one thing that I had to do to make the implementation is using |
I think I found it, |
No, that seems to be for RAM<->VRAM or VRAM<->VRAM copies, what I need is an VRAM->SRAM copy. |
d44f438
to
e2f5c0e
Compare
I've pushed a prototype that should perform at least as well as cuBLAS: What's very annoying is that I've gone back to thin tiles in order to get more fragment scales per number of hidden state elements but that makes it much harder to write a kernel for (4096x4096) * (4096x512) matrix multiplication that does not massively suffer from tail effects. My 3090 currently does 1.04 waves for each such matrix multiplication which means it's effectively idling 50% of the time. |
Not really, cudaMemcpyAsync direct replacement is hipmemcpyasync according to the documentation |
Yes, but I'm not asking about a replacement for |
e2f5c0e
to
f22123a
Compare
It was very tedious but I managed to tune the kernel in such a way that prompt processing with a batch size of 512 or more is fast: The implementation is kind of awkward with each of the 8 warps working on 13 tensor core fragments but with this number you get up to 98% GPU utilization on an RTX 3090. What's annoying is that strictly speaking you would need to tune this on a GPU by GPU basis. So I think I'll make versions of the kernel where the dimensions are just powers of 2 as the baseline and some specialized kernels that run well specifically on an RTX 3090 (ti). Then at runtime evaluate a heuristic to estimate which kernel would have the best performance for the given matrix dimensions. Current performance:
|
I believe (at least for RDNA GPUs) there's no HW support for that. Copying data from global memory to LDS (AMD term for shared memory) without using intermediate VGPRs (registers) isn't supported. So as a HIP specific workaround you may want to just copy manually using pointers or perhaps use a plain memcpy.
Right, it seems you're looking for a function that can be called from within your device kernel to copy from global memory to LDS and not a HIP runtime function hipMemcpyAsync that dispatches a separate kernel for asynchronous copies. Sadly, AFAIK, we don't have hardware support for the former at the moment. EDIT: MI300 ISA does indeed have support for direct loads from global to LDS (section 9.1.9 https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/instruction-set-architectures/amd-instinct-mi300-cdna3-instruction-set-architecture.pdf) |
ggml-cuda.cu
Outdated
const int nrows_y, const int nrows_dst) { | ||
|
||
// #if __CUDA_ARCH__ >= CC_VOLTA && !(defined(GGML_USE_HIPBLAS) && defined(__HIP_PLATFORM_AMD__)) | ||
typedef nvcuda::wmma::fragment<nvcuda::wmma::matrix_a, 32, 8, 16, int8_t, nvcuda::wmma::row_major> frag_thin_a; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re. AMD RDNA3 support - only 16x16x16 sizes are supported for all of matrix_a, matrix_b and accumulator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a problem. I'm specifically using 32x8 fragments because that way I can define more integer scales for the b fragments so the precision is better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. Could 32x16 work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I previously had an implementation with 16x16 fragments. It does work but the precision is worse. The increase in perplexity with single precision MMI8 compared to cuBLAS FP16 GEMM was ~40% higher.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noob question - so the higher perplexity is because 16 scales don't provide enough precision compared to 32 scales? if so, then I wonder if 32x16 is worth a try (two 16x16's somehow, reusing the accumulator from the first to second). But I'm unsure of how it would affect the perplexity score.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The trick to making the current single-precision implementation work is to calculate additional scales for the "b" fragments, i.e. the fragments of the hidden state. This makes it so that large values in the hidden state only affect the precision of a few values rather than all of the values in a column. So it is better to use small b fragments because you can scale the values with a finer granularity. 32x16 and 16x8 is better than 16x16 and 16x16 because the b fragments are smaller (the a fragment size does not matter). So 32x16 would in fact be worse than 16x16.
ggml-cuda.cu
Outdated
const int j_tile = j + threadIdx.x/4; | ||
const int k_tile = WARP_SIZE + threadIdx.x % 4; | ||
|
||
cuda::memcpy_async(&tile_y_qs[j_tile*(WARP_SIZE + MMI8_PADDING_Y) + k_tile], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re. RDNA3 support - use normal memcpy or raw pointer copy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On AMD cards, how many registers are available per streaming multiprocessor (or whatever the equivalent is)? I'm asking because memcpy_async
is crucial for reducing register pressure by copying data directly from VRAM to shared memory without going through registers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming a SM is equivalent to an AMD compute unit (CU), each CU has 2 SIMD32's and each SIMD32 has 1536 VGPRs (registers) for navi31 and navi32 (gfx1100/gfx1101). So each CU will have 3072 VGPRs. Each shader can have a max. of 256 VGPRs allocated to it. Assuming CU mode (which is the default for RDNA when compiling via. hipcc), for full occupancy of 16 waves per SIMD32, you'd want to be at or below 96 VGPRs. For more info see section 3.3.2.1 of RDNA3 ISA docs https://www.amd.com/content/dam/amd/en/documents/radeon-tech-docs/instruction-set-architectures/rdna3-shader-instruction-set-architecture-feb-2023_0.pdf
FWIW, navi31/32 of RDNA3 have 50% more registers than navi2 of RDNA2. gfx1102, gfx1103 (Radeon 780m) has 1024 VGPRs per SIMD, similar to navi2.
You can compile using --save-temps
to get the disassembly in .s
file(s) that contain metrics for register pressure, LDS usage, scratch usage etc.
ggml-cuda.cu
Outdated
nvcuda::wmma::mma_sync(fc_tmp, fa, fb, fc_tmp); | ||
#pragma unroll | ||
for (int l = 0; l < 32*8/WARP_SIZE; ++l) { | ||
fc[j/8].x[l] += bs * fc_tmp.x[l]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jammm here I'm directly accessing the data that each thread holds and scaling it. Is this something that you can do on AMD?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if you use rocWMMA
, this should just work. See https://github.com/ROCm/rocWMMA/blob/7f00ff71f1d4536a86c94a963d7224a2ca9516f6/library/include/rocwmma/rocwmma.hpp#L247-L251
But I would recommend using fc.num_elements
instead of 32*8/WARP_SIZE
as the number of matrix elements per fragment can vary across vendors and archs. FWIW, make sure you only exchange data between fragments of the same type. Here both fc
and fc_tmp
are of accumulator type, so it should be fine.
f22123a
to
2daab3b
Compare
I did some more performance optimizations and added a few kernel variants with smaller tile sizes that work better for small batch sizes: The performance is universally better than cuBLAS. I will now focus on getting this PR into a state where it can be merged. In particular:
|
This is definitely a problem. We need to be more careful about not instantiating too many combinations of the same function, in many cases the performance difference is meaningless. |
The amount of VRAM and general bloat saved by Johannes Matmul kernels in comparison to cuBLAS is not meanigless however. This benefit far outweights longer compilation time in my opinion. |
Easy to say when you are not the one working on the code. Anyway, that's not really relevant to this issue. |
2daab3b
to
979a9bf
Compare
I ran some tests using the new KL divergence code added in #5076 :
Measured in KL divergence using MMI8_8 for all tensors has roughly the same precision loss as using q6_K instead of FP16 (but I don't know if you can just treat it as additive). For q5_K_S and below the precision loss from MMI8_8 seems to already be small compared to the precision loss from quantization. |
The incoming RDNA4 gpu got most of stuff for INT8 matrix multiplications. |
This PR has become obsolete. |
We have it on RDNA3 to https://gpuopen.com/learn/wmma_on_rdna3/ (INT8/INT4/FP16/BF16 ...) @JohannesGaessler or other |
This PR has been obsoleted by the kernels in |
After much trial and error I finally have a working prototype with competitive performance for CUDA matrix matrix multiplication using int8 tensor core operations. At first I tried to do the implementation in the same way as with MMQ: load quantized data blocks, do the int8 arithmetic, then scale the result with the block scales. The problem with tensor cores however, is that loading the results from them is very slow. So to get good performance you need large blocks of integers with as few floating point scales as possible. However, as discussed in #4755 , large block sizes seem to lead to potentially bad results for the hidden state. The solution that I implemented for this is to use only a single scale for a row of the hidden state but to quantize it to "double precision", i.e. to use two int8 values to represent the floating point value. The total precision is 15 bits (16-1 because you need two sign bits). With double precision twice as many int8 tensor core operations need to be performed but this seems to fix the precision issues. The weights are always transformed to single precision int8 values with a single scale per row.
Currently I get the following results in terms of t/s and perplexity for 7b using my RTX 3090:
Note: the MMI8 results for all quantization formats other than q8_0 are not representative in terms of performance because only q8_0 has a fast kernel for q8_0 -> i8 conversion. For q8_0 single precision MMI8_8 is 9% faster than cuBLAS, MMI8_15 is 11% slower. Perplexity with MMI8_8 is bad, with MMI8_15 I think it's okay (other than for q8_0 where I think there is something wrong with the conversion kernel; previously it was roughly +0.002 compared to cuBLAS). Performance for q8_0 scales as follows when the batch size is varied:
MMI8_8 is generally the fastest, at small batch sizes MMI8_15 is slightly faster than cuBLAS, at large batch sizes it's a little slower. VRAM usage for 7b q8_0 with a batch size of 512 is 7798 MiB for cuBLAS and 7642 MiB for MMI8_15.
Overall I think this PR in its current state is not worth merging. For one MMI8_8 is simply not precise enough (for transformers at least). And wile there are some cases where MMI8_15 is slightly faster than cuBLAS but currently not enough to warrant the additional complexity. I think there is still potential for performance optimization though: the MMI8_15 kernel achieves only 28% tensor core utilization compared to 45% utilization for the equivalent kernel used by cuBLAS.