CUDA: int8 tensor core matrix multiplication #4801

JohannesGaessler · 2024-01-06T19:10:00Z

After much trial and error I finally have a working prototype with competitive performance for CUDA matrix matrix multiplication using int8 tensor core operations. At first I tried to do the implementation in the same way as with MMQ: load quantized data blocks, do the int8 arithmetic, then scale the result with the block scales. The problem with tensor cores however, is that loading the results from them is very slow. So to get good performance you need large blocks of integers with as few floating point scales as possible. However, as discussed in #4755 , large block sizes seem to lead to potentially bad results for the hidden state. The solution that I implemented for this is to use only a single scale for a row of the hidden state but to quantize it to "double precision", i.e. to use two int8 values to represent the floating point value. The total precision is 15 bits (16-1 because you need two sign bits). With double precision twice as many int8 tensor core operations need to be performed but this seems to fix the precision issues. The weights are always transformed to single precision int8 values with a single scale per row.

Currently I get the following results in terms of t/s and perplexity for 7b using my RTX 3090:

Metric	Method	q4_0	q4_1	q5_0	q5_1	q8_0	q2_K	q3_K_S	q4_K_S	q5_K_S	q6_K
t/s pp512	cuBLAS FP16 GEMM	3473	3406	3303	3299	3350	3633	3592	3726	3692	3657
t/s pp512	mul_mat_q	2356	1954	2059	1799	2246	1668	1829	2128	1947	1943
t/s pp512	mul_mat_i8, 8 bit prec.	2434	2421	2393	2392	3635	2503	2495	2507	2495	2465
t/s pp512	mul_mat_i8, 15 bit prec.	2132	2126	2090	2092	2973	2181	2172	2182	2163	2153
PPL	cuBLAS FP16 GEMM	5.9634	6.0013	5.8282	5.8465	5.7985	6.4201	6.2956	5.8805	5.8187	5.8085
PPL	mul_mat_q	5.9685	6.0033	5.8295	5.8533	5.8002	6.4209	6.2961	5.8873	5.8243	5.8095
PPL	mul_mat_i8, 8 bit prec.	6.0392	6.0647	5.9045	5.9198	6.0254	6.5537	6.4204	5.9705	5.8989	5.8892
PPL	mul_mat_i8, 15 bit prec.	5.9668	6.0056	5.8330	5.8513	5.9282	6.4301	6.3021	5.8849	5.8214	5.8072

Note: the MMI8 results for all quantization formats other than q8_0 are not representative in terms of performance because only q8_0 has a fast kernel for q8_0 -> i8 conversion. For q8_0 single precision MMI8_8 is 9% faster than cuBLAS, MMI8_15 is 11% slower. Perplexity with MMI8_8 is bad, with MMI8_15 I think it's okay (other than for q8_0 where I think there is something wrong with the conversion kernel; previously it was roughly +0.002 compared to cuBLAS). Performance for q8_0 scales as follows when the batch size is varied:

MMI8_8 is generally the fastest, at small batch sizes MMI8_15 is slightly faster than cuBLAS, at large batch sizes it's a little slower. VRAM usage for 7b q8_0 with a batch size of 512 is 7798 MiB for cuBLAS and 7642 MiB for MMI8_15.

Overall I think this PR in its current state is not worth merging. For one MMI8_8 is simply not precise enough (for transformers at least). And wile there are some cases where MMI8_15 is slightly faster than cuBLAS but currently not enough to warrant the additional complexity. I think there is still potential for performance optimization though: the MMI8_15 kernel achieves only 28% tensor core utilization compared to 45% utilization for the equivalent kernel used by cuBLAS.

slaren · 2024-01-06T19:34:35Z

I don't know enough about this to comment on the implementation, but if the issue is scaling the result, maybe dequantizing to fp16 and multiplying as fp16 could work. It should still be faster than cuBLAS, and it would reduce the memory requirements.

JohannesGaessler · 2024-01-06T20:25:28Z

I don't know enough about this to comment on the implementation, but if the issue is scaling the result, maybe dequantizing to fp16 and multiplying as fp16 could work.

I have not actually tried this but I do not expect this to work well:

2 FP16 values require 33% more shared memory than 3 int8 values.
Integer arithmetic is faster than floating point arithmetic.
int8 -> FP16 conversions are pretty slow; data types with 4 bytes are in my experience the fastest. For dequantize_mul_mat_vec this doesn't matter as much because you're I/O bound anyways.
Adding complicated instructions for data loading increases register pressure, which is already a major problem.
Ampere and newer can do asynchronous memcpys from global to shared memory. This would allow the same warps to do computations and data copies at the same time but only if the data does not need to be altered upon load (currently not used).

Not directly the same but in a previous version I did the q8_0 -> int8 conversion in the matrix multiplication kernel. This alone took up ~25% of the kernel's runtime.

On a more general note, now that I think about it my kernel is mostly just doing regular int8 matrix multiplication with some scaling at the end. But maybe it'd be possible to somehow use cuBLAS GEMM and add a small kernel for scaling afterwards; I'll investigate.

JohannesGaessler · 2024-01-06T21:25:38Z

Okay, so apparently cuBLAS allows you to do (int8, int8) -> int32 matrix multiplications. But it seems that will not be of use to us. You can do two individual matrix multiplications to replace the kernel I wrote but that is slower even without an extra kernel for scaling afterwards. What we would need is an operation $C = A (B_\mathrm{low} + 128 B_\mathrm{high})$ but I don't see a corresponding cuBLAS routine. There is batched matrix multiplication but it does not support per-matrix alpha values and the matrices $C_i$ must explicitly not overlap.

cmp-nct · 2024-01-07T04:21:01Z

Quite a stunning speedup, looks like for anything below RTX 40 series this is a winner already.

In case you are interested, a few weeks ago (#4310) I had outlined to gg how to do native FP8 (which is twice the speed of FP16) in llama.cpp by using wrappers to transparently "force" it into ggml-cuda.cu while using the new cublasLt API which is required for fp8.
It's not meant as solution but kernels and code all work, the wrapper approach allows an easy test without changing too much of the core code (given the different cuda API)

sorasoras · 2024-01-07T09:05:09Z

I wonder what would this work for rdna3 wmma. Would like to take advantage of wmma

JohannesGaessler · 2024-01-07T19:39:20Z

After the bugfix in #4809 prompt processing on master has become faster, particularly so for cuBLAS. This has raised the bar that this PR will need to meet:

(I think I did something wrong when I measured MMQ performance for the previous plot; in this one the performance is much better.)

Dampfinchen · 2024-01-08T17:20:25Z

Well speedwise, according to that graph, MMI8_8 appears to be almost on par with cuBLAS, which makes it very worth it in my opinion as it does save quite a bit of memory. Sadly though, perplexity seems noticeably higher than the other methods.

JohannesGaessler · 2024-01-14T17:26:18Z

I'm slowly making progress. Square tensor core fragments seem to work better than the thin fragments I was using before. Also asychronous data copies seem to indeed be faster than regular data copies (even without interleaving compute and memory I/O). Unfortunately for best performance it seems to be necessary to copy a block of at least 512 bytes though (currently 128 bytes in this PR).

I did some more performance optimizations for cuBLAS on master but I don't expect there to be more so cuBLAS performance should now stay constant (unless full FP16 model evaluation is implemented). Currently MMI8_8 is again slightly faster than cuBLAS while MMI8_15 is about the same speed as cuBLAS prior to the performance optimizations:

JohannesGaessler · 2024-01-16T21:14:09Z

I think I've cracked the code. The problem with tensor cores is that while each fragment is technically just a struct with individual values you have no guarantees regarding what those values are; which specific values each thread receives varies by GPU architecture. But you can apply scalar operations to these values and the effect will be the same for all GPU architectures because each value is treated the same. So you can (in addition to the scales per row/column) define scales per 16x16 or 32x8 tensor core fragment. Then you only have to apply these scales per fragment without any expensive calls to store_matrix_sync.

I have a prototype for single precision mmi8 with per-fragment scales that gets 5.8107 PPL with q8_0 (vs. 5.7985 PPL with cuBLAS). However, among the repeating weight tensors attn_k, attn_k, attn_output, ffn_gate, and ffn_down seem to do fine with single precision; if I use single precision only for those I get 5.8007 PPL.

jammm · 2024-01-17T06:31:48Z

Hey @JohannesGaessler,
For AMD support of mma.h, I would recommend using rocWMMA https://github.com/ROCm/rocWMMA
It offers the same interface as mma.h so you won't have to modify your code much, but it'll also enable WMMA support for RDNA3 GPUs. Generally you just have to alias the namespace nvcuda or rocwmma to something common like wmma and use it to be cross-vendor compatible between AMD and NVIDIA. CC @cgmillette

For example:

#if defined( __CUDACC__ )
#include <mma.h>
namespace wmma = nvcuda::wmma;
#elif defined(__gfx1100__) || defined(__gfx1101__) || defined(__gfx1102__) || defined(__gfx1103__) // all RDNA3 GPUs
#include <rocwmma/rocwmma.hpp>
namespace wmma = rocwmma;
#endif

Another option is to use LLVM builtins directly for RDNA3 WMMA support. There's generally a 1:1 mapping between mma.h /rocWMMA and LLVM builtins (except the loading/storing part, which can get complicated depending on how much performance you want to squeeze out of it). I have sample code for that on my blog
(just for fp16 tho) https://gpuopen.com/learn/wmma_on_rdna3/ though IMO I'd recommend using rocWMMA as you only really have to change a couple lines of code to get it working. And it also does optimized loads/stores for you.

jammm · 2024-01-17T06:44:38Z

ggml-cuda.cu

+    typedef nvcuda::wmma::fragment<nvcuda::wmma::matrix_b,    16, 16, 16, int8_t, nvcuda::wmma::col_major> frag_thin_b;
+    typedef nvcuda::wmma::fragment<nvcuda::wmma::accumulator, 16, 16, 16, int>                             frag_thin_c;
+
+    typedef cuda::barrier<cuda::thread_scope::thread_scope_block> cuda_barrier;


This may break compatibility with ROCm in case rocWMMA is used for AMD RDNA3 GPU support. Could there be a portable workaround using pure C++ device code w/ builtin sync primitives rather than something libcudacxx specific?

JohannesGaessler · 2024-01-17T06:48:25Z

I can try to make it so that AMD tensor cores can be used but I can't guarantee that the performance would be good enough to make it worthwhile.

sorasoras · 2024-01-17T06:51:58Z

Good, can't wait to test that on my 7900xtx

jammm · 2024-01-17T08:09:03Z

I can try to make it so that AMD tensor cores can be used but I can't guarantee that the performance would be good enough to make it worthwhile.

Thanks! It's still worth keeping the support, perhaps behind a feature flag that's set to false by default in case this isn't fast enough for the time being.

JohannesGaessler · 2024-01-17T08:36:06Z

@jammm one thing that I had to do to make the implementation is using cuda::memcpy_async. This allows you to copy data directly from global memory to shared memory without going through registers, thus reducing the number of registers used by the kernel. Does ROCm have an equivalent?

sorasoras · 2024-01-17T09:52:52Z

@jammm one thing that I had to do to make the implementation is using cuda::memcpy_async. This allows you to copy data directly from global memory to shared memory without going through registers, thus reducing the number of registers used by the kernel. Does ROCm have an equivalent?

I think I found it,
it's called
hipMemcpyAsync

https://docs.amd.com/projects/HIP/en/latest/doxygen/html/group___memory.html#gaccf359cb35ce1887e6250c09e115e9a2

JohannesGaessler · 2024-01-17T09:55:53Z

No, that seems to be for RAM<->VRAM or VRAM<->VRAM copies, what I need is an VRAM->SRAM copy.

JohannesGaessler · 2024-01-17T11:59:16Z

I've pushed a prototype that should perform at least as well as cuBLAS:

What's very annoying is that I've gone back to thin tiles in order to get more fragment scales per number of hidden state elements but that makes it much harder to write a kernel for (4096x4096) * (4096x512) matrix multiplication that does not massively suffer from tail effects. My 3090 currently does 1.04 waves for each such matrix multiplication which means it's effectively idling 50% of the time.

sorasoras · 2024-01-17T14:46:00Z

No, that seems to be for RAM<->VRAM or VRAM<->VRAM copies, what I need is an VRAM->SRAM copy.

Not really, cudaMemcpyAsync direct replacement is hipmemcpyasync according to the documentation
https://rocmdocs.amd.com/projects/HIPIFY/en/latest/tables/CUDA_Runtime_API_functions_supported_by_HIP.html
It should be a drop in replacement.

JohannesGaessler · 2024-01-17T15:05:05Z

Yes, but I'm not asking about a replacement for cudaMemcpyAsync but for a replacement for cuda::memcpy_async which are two entire different things.

JohannesGaessler · 2024-01-17T18:18:12Z

It was very tedious but I managed to tune the kernel in such a way that prompt processing with a batch size of 512 or more is fast:

The implementation is kind of awkward with each of the 8 warps working on 13 tensor core fragments but with this number you get up to 98% GPU utilization on an RTX 3090. What's annoying is that strictly speaking you would need to tune this on a GPU by GPU basis. So I think I'll make versions of the kernel where the dimensions are just powers of 2 as the baseline and some specialized kernels that run well specifically on an RTX 3090 (ti). Then at runtime evaluate a heuristic to estimate which kernel would have the best performance for the given matrix dimensions.

Current performance:

GPU	Model	Batch size	Test	t/s master	t/s cuda-mmqf-22	Speedup
RTX 3090	llama 7B Q8_0	512	pp2048	3466.34	3814.23	1.10
RTX 3090	llama 7B Q8_0	1024	pp2048	3905.44	3993.58	1.02
RTX 3090	llama 7B Q8_0	2048	pp2048	3808.43	3876.39	1.02

jammm · 2024-01-17T21:44:00Z

@jammm one thing that I had to do to make the implementation is using cuda::memcpy_async. This allows you to copy data directly from global memory to shared memory without going through registers, thus reducing the number of registers used by the kernel. Does ROCm have an equivalent?

I believe (at least for RDNA GPUs) there's no HW support for that. Copying data from global memory to LDS (AMD term for shared memory) without using intermediate VGPRs (registers) isn't supported. So as a HIP specific workaround you may want to just copy manually using pointers or perhaps use a plain memcpy.

Yes, but I'm not asking about a replacement for cudaMemcpyAsync but for a replacement for cuda::memcpy_async which are two entire different things.

Right, it seems you're looking for a function that can be called from within your device kernel to copy from global memory to LDS and not a HIP runtime function hipMemcpyAsync that dispatches a separate kernel for asynchronous copies. Sadly, AFAIK, we don't have hardware support for the former at the moment.

EDIT: MI300 ISA does indeed have support for direct loads from global to LDS (section 9.1.9 https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/instruction-set-architectures/amd-instinct-mi300-cdna3-instruction-set-architecture.pdf)
For RDNA GPUs however, this doesn't seem to be mentioned. The RDNA3 ISA doc section 12.1.1 mentions that some loads do support direct loads from global memory to LDS but it's not clear which load it refers to. https://www.amd.com/content/dam/amd/en/documents/radeon-tech-docs/instruction-set-architectures/rdna3-shader-instruction-set-architecture-feb-2023_0.pdf and AFAIK, I'm not aware of any way to do such a load using HIP on RDNA.

jammm · 2024-01-18T13:06:28Z

ggml-cuda.cu

+    const int nrows_y, const int nrows_dst) {
+
+// #if __CUDA_ARCH__ >= CC_VOLTA && !(defined(GGML_USE_HIPBLAS) && defined(__HIP_PLATFORM_AMD__))
+    typedef nvcuda::wmma::fragment<nvcuda::wmma::matrix_a,    32, 8, 16, int8_t, nvcuda::wmma::row_major> frag_thin_a;


Re. AMD RDNA3 support - only 16x16x16 sizes are supported for all of matrix_a, matrix_b and accumulator.

That's a problem. I'm specifically using 32x8 fragments because that way I can define more integer scales for the b fragments so the precision is better.

Hmm. Could 32x16 work?

I previously had an implementation with 16x16 fragments. It does work but the precision is worse. The increase in perplexity with single precision MMI8 compared to cuBLAS FP16 GEMM was ~40% higher.

Noob question - so the higher perplexity is because 16 scales don't provide enough precision compared to 32 scales? if so, then I wonder if 32x16 is worth a try (two 16x16's somehow, reusing the accumulator from the first to second). But I'm unsure of how it would affect the perplexity score.

The trick to making the current single-precision implementation work is to calculate additional scales for the "b" fragments, i.e. the fragments of the hidden state. This makes it so that large values in the hidden state only affect the precision of a few values rather than all of the values in a column. So it is better to use small b fragments because you can scale the values with a finer granularity. 32x16 and 16x8 is better than 16x16 and 16x16 because the b fragments are smaller (the a fragment size does not matter). So 32x16 would in fact be worse than 16x16.

jammm · 2024-01-18T13:09:33Z

ggml-cuda.cu

+                const int j_tile = j + threadIdx.x/4;
+                const int k_tile = WARP_SIZE + threadIdx.x % 4;
+
+                cuda::memcpy_async(&tile_y_qs[j_tile*(WARP_SIZE + MMI8_PADDING_Y) + k_tile],


Re. RDNA3 support - use normal memcpy or raw pointer copy

On AMD cards, how many registers are available per streaming multiprocessor (or whatever the equivalent is)? I'm asking because memcpy_async is crucial for reducing register pressure by copying data directly from VRAM to shared memory without going through registers.

Assuming a SM is equivalent to an AMD compute unit (CU), each CU has 2 SIMD32's and each SIMD32 has 1536 VGPRs (registers) for navi31 and navi32 (gfx1100/gfx1101). So each CU will have 3072 VGPRs. Each shader can have a max. of 256 VGPRs allocated to it. Assuming CU mode (which is the default for RDNA when compiling via. hipcc), for full occupancy of 16 waves per SIMD32, you'd want to be at or below 96 VGPRs. For more info see section 3.3.2.1 of RDNA3 ISA docs https://www.amd.com/content/dam/amd/en/documents/radeon-tech-docs/instruction-set-architectures/rdna3-shader-instruction-set-architecture-feb-2023_0.pdf

FWIW, navi31/32 of RDNA3 have 50% more registers than navi2 of RDNA2. gfx1102, gfx1103 (Radeon 780m) has 1024 VGPRs per SIMD, similar to navi2.

You can compile using --save-temps to get the disassembly in .s file(s) that contain metrics for register pressure, LDS usage, scratch usage etc.

JohannesGaessler · 2024-01-18T14:20:48Z

ggml-cuda.cu

+                nvcuda::wmma::mma_sync(fc_tmp, fa, fb, fc_tmp);
+#pragma unroll
+                for (int l = 0; l < 32*8/WARP_SIZE; ++l) {
+                    fc[j/8].x[l] += bs * fc_tmp.x[l];


@jammm here I'm directly accessing the data that each thread holds and scaling it. Is this something that you can do on AMD?

Yes, if you use rocWMMA, this should just work. See https://github.com/ROCm/rocWMMA/blob/7f00ff71f1d4536a86c94a963d7224a2ca9516f6/library/include/rocwmma/rocwmma.hpp#L247-L251

But I would recommend using fc.num_elements instead of 32*8/WARP_SIZE as the number of matrix elements per fragment can vary across vendors and archs. FWIW, make sure you only exchange data between fragments of the same type. Here both fc and fc_tmp are of accumulator type, so it should be fine.

JohannesGaessler · 2024-01-19T19:43:50Z

I did some more performance optimizations and added a few kernel variants with smaller tile sizes that work better for small batch sizes:

The performance is universally better than cuBLAS. I will now focus on getting this PR into a state where it can be merged. In particular:

Fast conversion to int8 for quantization formats other than q8_0.
Add some way to configure the matrix multiplication algorithm at runtime. I think it would make sense to offer something like mmi8_mixed where the algorithm choice is made based on how sensitive the tensor is to hidden state precision.
Maybe add an option LLAMA_FAST_COMPILE that only compiles a single kernel version so the compilation time is faster (but of course performance is worse). On my system teh compilation time has increased from ~20 s to ~40 s.
I will not implement double precision MMI8 in this PR.
I will not implement AMD support. I would need to write the kernel quite differently to make it compatible with AMD and the precision loss would be worse because AMD does not support 16x8 fragments. Also without access to compatible AMD hardware I will not be able to tune the performance in any meaningful way so it would probably be slower than regular BLAS anyways.

slaren · 2024-01-20T19:02:31Z

On my system teh compilation time has increased from ~20 s to ~40 s.

This is definitely a problem. We need to be more careful about not instantiating too many combinations of the same function, in many cases the performance difference is meaningless.

Dampfinchen · 2024-01-22T18:37:57Z

On my system teh compilation time has increased from ~20 s to ~40 s.

This is definitely a problem. We need to be more careful about not instantiating too many combinations of the same function, in many cases the performance difference is meaningless.

The amount of VRAM and general bloat saved by Johannes Matmul kernels in comparison to cuBLAS is not meanigless however. This benefit far outweights longer compilation time in my opinion.

slaren · 2024-01-22T18:39:46Z

This benefit far outweights longer compilation time in my opinion.

Easy to say when you are not the one working on the code. Anyway, that's not really relevant to this issue.

JohannesGaessler · 2024-01-23T13:49:24Z

I ran some tests using the new KL divergence code added in #5076 :

Model	ln PPL ratio cuBLAS	ln PPL ratio MMI8_8	KLD cuBLAS	KLD MMI8_8
LLaMA 2 q4_0	0.02857 +- 0.00067	0.02966 +- 0.00068	0.02884 +- 0.00030	0.02986 +- 0.00031
LLaMA 2 q4_1	0.03492 +- 0.00067	0.03620 +- 0.00068	0.03131 +- 0.00019	0.03253 +- 0.00019
LLaMA 2 q5_0	0.00580 +- 0.00040	0.00754 +- 0.00044	0.00950 +- 0.00022	0.01065 +- 0.00026
LLaMA 2 q5_1	0.00891 +- 0.00034	0.01073 +- 0.00039	0.00755 +- 0.00007	0.00931 +- 0.00015
LLaMA 2 q8_0	0.00062 +- 0.00011	0.00269 +- 0.00023	0.00038 +- 0.00001	0.00261 +- 0.00017
LLaMA 2 q2_K	0.10241 +- 0.00129	0.10414 +- 0.00129	0.10888 +- 0.00064	0.11002 +- 0.00065
LLaMA 2 q3_K_S	0.08283 +- 0.00112	0.08412 +- 0.00112	0.08627 +- 0.00050	0.08714 +- 0.00050
LLaMA 2 q4_K_S	0.01462 +- 0.00051	0.01629 +- 0.00053	0.01782 +- 0.00011	0.01925 +- 0.00017
LLaMA 2 q5_K_S	0.00410 +- 0.00033	0.00556 +- 0.00036	0.00704 +- 0.00006	0.00844 +- 0.00012
LLaMA 2 q6_K	0.00241 +- 0.00019	0.00356 +- 0.00027	0.00210 +- 0.00001	0.00406 +- 0.00016

Measured in KL divergence using MMI8_8 for all tensors has roughly the same precision loss as using q6_K instead of FP16 (but I don't know if you can just treat it as additive). For q5_K_S and below the precision loss from MMI8_8 seems to already be small compared to the precision loss from quantization.

sorasoras · 2024-01-30T10:52:53Z

The incoming RDNA4 gpu got most of stuff for INT8 matrix multiplications.
This could be interesting to you I guess.
https://chipsandcheese.com/2024/01/28/examining-amds-rdna-4-changes-in-llvm/

JohannesGaessler · 2024-06-14T17:07:13Z

This PR has become obsolete.

Djip007 · 2024-10-25T19:55:43Z

The incoming RDNA4 gpu got most of stuff for INT8 matrix multiplications. This could be interesting to you I guess. https://chipsandcheese.com/2024/01/28/examining-amds-rdna-4-changes-in-llvm/

We have it on RDNA3 to https://gpuopen.com/learn/wmma_on_rdna3/ (INT8/INT4/FP16/BF16 ...)

@JohannesGaessler or other
what "replace" this PR?

JohannesGaessler · 2024-10-25T19:59:39Z

This PR has been obsoleted by the kernels in ggml/src/ggml-cuda/mmq.cuh making use of the primitives in ggml/src/ggml-cuda/mma.cuh.

JohannesGaessler mentioned this pull request Jan 6, 2024

llama : ggml-backend integration #4766

Merged

7 tasks

JohannesGaessler mentioned this pull request Jan 7, 2024

CUDA: fixed redundant value dequantization #4809

Merged

ggerganov added the demo Demonstrate some concept or idea, not intended to be merged label Jan 13, 2024

JohannesGaessler mentioned this pull request Jan 13, 2024

pipeline parallelism demo #4918

Closed

JohannesGaessler force-pushed the cuda-mmqf-10 branch from 34d1513 to d44f438 Compare January 14, 2024 17:19

This was referenced Jan 16, 2024

llama : support quantum K cache #4312

Merged

ggml : get rid of BLAS and all it's variants ggerganov/ggml#293

Open

jammm reviewed Jan 17, 2024

View reviewed changes

JohannesGaessler force-pushed the cuda-mmqf-10 branch from d44f438 to e2f5c0e Compare January 17, 2024 11:43

JohannesGaessler force-pushed the cuda-mmqf-10 branch from e2f5c0e to f22123a Compare January 17, 2024 18:07

jammm reviewed Jan 18, 2024

View reviewed changes

JohannesGaessler commented Jan 18, 2024

View reviewed changes

JohannesGaessler mentioned this pull request Jan 18, 2024

WIP: Flash Attention implementation (forward + backward) Pints-AI/llama.cpp#1

Closed

3 tasks

JohannesGaessler force-pushed the cuda-mmqf-10 branch from f22123a to 2daab3b Compare January 19, 2024 19:31

JohannesGaessler mentioned this pull request Jan 22, 2024

ggml : add Flash Attention #5021

Merged

8 tasks

CUDA: added int8 tensor core matrix multiplication, 4279 t/s

979a9bf

JohannesGaessler force-pushed the cuda-mmqf-10 branch from 2daab3b to 979a9bf Compare January 23, 2024 13:38

JohannesGaessler mentioned this pull request Jan 23, 2024

KL-divergence #5076

Merged

sorasoras mentioned this pull request Apr 8, 2024

Support QuaRot quantization scheme #6444

Open

K-Mistele mentioned this pull request May 1, 2024

Implement 4-bit quantized KV Cache for faster performance and to enable longer context #6863

Closed

3 tasks

JohannesGaessler mentioned this pull request May 21, 2024

CUDA: deduplicate mmq code #7397

Merged

JohannesGaessler closed this Jun 14, 2024

jammm mentioned this pull request Oct 16, 2024

Fix flash attention for ROCm #7011

Draft

CUDA: int8 tensor core matrix multiplication #4801

CUDA: int8 tensor core matrix multiplication #4801

Conversation

JohannesGaessler commented Jan 6, 2024 • edited Loading

slaren commented Jan 6, 2024

JohannesGaessler commented Jan 6, 2024

JohannesGaessler commented Jan 6, 2024

cmp-nct commented Jan 7, 2024 • edited Loading

sorasoras commented Jan 7, 2024

JohannesGaessler commented Jan 7, 2024

Dampfinchen commented Jan 8, 2024

JohannesGaessler commented Jan 14, 2024

JohannesGaessler commented Jan 16, 2024

jammm commented Jan 17, 2024 • edited Loading

jammm Jan 17, 2024 • edited Loading

Choose a reason for hiding this comment

JohannesGaessler commented Jan 17, 2024

sorasoras commented Jan 17, 2024

jammm commented Jan 17, 2024

JohannesGaessler commented Jan 17, 2024

sorasoras commented Jan 17, 2024 • edited Loading

JohannesGaessler commented Jan 17, 2024

JohannesGaessler commented Jan 17, 2024

sorasoras commented Jan 17, 2024

JohannesGaessler commented Jan 17, 2024

JohannesGaessler commented Jan 17, 2024

jammm commented Jan 17, 2024 • edited Loading

jammm Jan 18, 2024 • edited Loading

Choose a reason for hiding this comment

JohannesGaessler Jan 18, 2024

Choose a reason for hiding this comment

jammm Jan 18, 2024

Choose a reason for hiding this comment

JohannesGaessler Jan 18, 2024

Choose a reason for hiding this comment

jammm Jan 18, 2024 • edited Loading

Choose a reason for hiding this comment

JohannesGaessler Jan 18, 2024 • edited Loading

Choose a reason for hiding this comment

jammm Jan 18, 2024

Choose a reason for hiding this comment

JohannesGaessler Jan 18, 2024

Choose a reason for hiding this comment

jammm Jan 18, 2024 • edited Loading

Choose a reason for hiding this comment

JohannesGaessler Jan 18, 2024

Choose a reason for hiding this comment

jammm Jan 18, 2024 • edited Loading

Choose a reason for hiding this comment

JohannesGaessler commented Jan 19, 2024

slaren commented Jan 20, 2024

Dampfinchen commented Jan 22, 2024 • edited Loading

slaren commented Jan 22, 2024

JohannesGaessler commented Jan 23, 2024

sorasoras commented Jan 30, 2024

JohannesGaessler commented Jun 14, 2024

Djip007 commented Oct 25, 2024

JohannesGaessler commented Oct 25, 2024

JohannesGaessler commented Jan 6, 2024 •

edited

Loading

cmp-nct commented Jan 7, 2024 •

edited

Loading

jammm commented Jan 17, 2024 •

edited

Loading

jammm Jan 17, 2024 •

edited

Loading

sorasoras commented Jan 17, 2024 •

edited

Loading

jammm commented Jan 17, 2024 •

edited

Loading

jammm Jan 18, 2024 •

edited

Loading

jammm Jan 18, 2024 •

edited

Loading

JohannesGaessler Jan 18, 2024 •

edited

Loading

jammm Jan 18, 2024 •

edited

Loading

jammm Jan 18, 2024 •

edited

Loading

Dampfinchen commented Jan 22, 2024 •

edited

Loading