More GPU threads for dequantization #1341

JohannesGaessler · 2023-05-06T10:26:54Z

On master CUDA kernels for dequantization are launched as blocks with only a single thread. This is an inefficient way of utilizing GPUs since they can execute many threads in parallel. This PR changes the way CUDA kernels are launched to launch 256 times fewer blocks but with each block containing 256 threads. On my hardware (GTX 1070) this speeds up prompt processing by ~14% for 33b and a prompt with 399 tokens (navy seals copypasta). I only implemented and tested q4_0 and q5_1 because there are pending breaking quantization changes in #1305 but the change is very simple and should be easy to adapt; just tell me how you want to proceed. Notably I was not able to 100% ensure that generations are unchanged by this PR because I am getting a bug that affects the reproducibility of CUDA-accelerated prompt processing as described in #1340 .

slaren · 2023-05-06T11:49:39Z

Similar PR: #1221

slaren · 2023-05-06T19:31:32Z

This is good if it increases performance by 14%, but your results seem to contradict the tests from @dfyz here, that showed that increasing occupancy does not always lead to increased performance.

Ideall though, we would use cudaOccupancyMaxPotentialBlockSize to determine the block size, and add a check in the kernels to ignore the extra elements if it is not perfectly divisible.

JohannesGaessler · 2023-05-06T20:21:42Z

I assume the discrepancy between my results and the results by @dfyz are due to the differences in hardware. I used a GTX 1070 while he used an RTX 3070 and a Tesla M40. He hypothesized that due to the latency of memory transfers in his case the faster dequantization kernel did not yield an actual performance improvement. In other words, his hardware is sufficiently fast that either version finishes in time. But my hardware is much slower so the dequantization is presumably slow enough that there is room for improvement (I'm assuming that latency does not depend on the GPU).

In general, GPU performance optimizations are not 100% portable because they depend on the specifics of the hardware used; on my hardware using more GPU threads is significantly faster and because low occupancy can lead to poor performance I was assuming that this would be universally faster (I was not aware of the other PR). Of course a more sophisticated solution would be preferable but I made this PR because it seemed like a low-hanging fruit.

dfyz · 2023-05-06T21:29:33Z

my hardware is much slower so the dequantization is presumably slow enough that there is room for improvement

This is very encouraging! I think that simply changing the block size (as opposed to what I did in the #1221) is a very non-invasive change, so as long as it results in speeding up kernels at least on some GPUs, it should be merged.

I actually tried something similar to your PR here (see the "hacky patch"), but the dequantize_block_q4_0 kernel actually got significantly slower on RTX 3070, so I abandoned this approach. This was with 768 threads (the result of cudaOccupancyMaxPotentialBlockSize) and with an additional check to make sure we ignore the extra elements if the number of blocks is not divisible by 768.

As I said, that was a hacky patch, so it's possible that I got something wrong. If someone with a relatively recent GPU can confirm that this PR doesn't slow down the kernels, I think we can proceed with increasing the block size for all kernels and merging the PR (thought it's ultimately up to @slaren).

dfyz · 2023-05-06T21:35:57Z

ggml-cuda.cu

@@ -227,7 +227,8 @@ static __global__ void dequantize_block_q8_0(const void * vx, float * y) {

 static void dequantize_row_q4_0_cuda(const void * vx, float * y, int k, cudaStream_t stream) {
    const int nb = k / QK4_0;
-    dequantize_block_q4_0<<<nb, 1, 0, stream>>>(vx, y);
+    GGML_ASSERT(nb % 256 == 0);


@slaren already said this, but I want to emphasize that the GPU code shouldn't assume that the number of blocks is divisible by 256. The CPU code only assumes that it is even.

slaren · 2023-05-06T22:09:28Z

If someone with a relatively recent GPU can confirm that this PR doesn't slow down the kernels, I think we can proceed with increasing the block size for all kernels and merging the PR (thought it's ultimately up to @slaren).

I agree, and even though I appreciate the gesture, I am just a voice here, I don't "own" any code.

SlyEcho · 2023-05-07T00:21:06Z

FYI, getting 15% perf increase on my Vega64 as well.

JohannesGaessler · 2023-05-07T15:08:31Z

With cudaOccupancyMaxPotentialBlockSize the kernel size on my GTX 1070 is set to 1024. With 33b q4_0 the speedup for prompt processing is now 15% (Edit: CUDA v12.1.0-3). Someone with a relatively recent GPU please confirm that this PR at least does not give them a performance regression.

SlyEcho · 2023-05-07T16:18:16Z

I did some tests, first line is the old (master) then second is the patch in this PR. Running on the perplexity tool with the first 103 lines from wiki.test.raw, 8 threads, no mmap, LLaMa 7B Q4_2.

RX 6900XT, Linux 5.15, ROCm 5.2.3 (based on #1087, not master)

llama_print_timings: prompt eval time = 25154.07 ms /  6144 tokens (    4.09 ms per token)
llama_print_timings: prompt eval time = 25642.30 ms /  6144 tokens (    4.17 ms per token)

RTX 2080 Ti, Linux 5.8, CUDA 11.4

llama_print_timings: prompt eval time = 27952.36 ms /  6144 tokens (    4.55 ms per token)
llama_print_timings: prompt eval time = 29715.57 ms /  6144 tokens (    4.84 ms per token)

RTX 3090, Linux 5.15, CUDA 11.4:

llama_print_timings: prompt eval time = 20707.82 ms /  6144 tokens (    3.37 ms per token)
CUDA error 700 at ~/src/llama.cpp/ggml-cuda.cu:682: an illegal memory access was encountered

The machines are different so not comparable directly to each other.

JohannesGaessler · 2023-05-07T16:35:33Z

I think I passed the wrong argument for determining the data limit. Can you re-run the test for the 3090 and check whether it works now?

SlyEcho · 2023-05-07T17:16:06Z

RX 6900XT:

llama_print_timings: prompt eval time = 25508.54 ms /  6144 tokens (    4.15 ms per token)
llama_print_timings: prompt eval time = 26084.42 ms /  6144 tokens (    4.25 ms per token)

RTX 2080 Ti:

llama_print_timings: prompt eval time = 28185.34 ms /  6144 tokens (    4.59 ms per token)
llama_print_timings: prompt eval time = 29794.24 ms /  6144 tokens (    4.85 ms per token)

RTX 3090:

llama_print_timings: prompt eval time = 20290.44 ms /  6144 tokens (    3.30 ms per token)
llama_print_timings: prompt eval time = 18930.16 ms /  6144 tokens (    3.08 ms per token)

Same setup as before. There may be something about the older software versions. I can't update anything since I don't own them, I only have access. Also, other users are running stuff on these machines, too but today seems quieter.

I ran the tests a couple times and the numbers were pretty much the same.

Sidenote, I should automate this somehow...

JohannesGaessler · 2023-05-07T17:45:38Z

It seems performance isn't consistently better/worse... should we make larger CUDA blocks a compilation option?

SlyEcho · 2023-05-07T17:48:22Z

I'd like someone else test as well, maybe on CUDA 12 it is better?

slaren · 2023-05-07T20:32:08Z

RTX 3080, CUDA 12.1

Master: llama_print_timings: prompt eval time = 25057.53 ms /  6144 tokens (    4.08 ms per token)
PR:     llama_print_timings: prompt eval time = 26587.40 ms /  6144 tokens (    4.33 ms per token)

SlyEcho · 2023-05-08T10:11:07Z

I tried some hacking:

It seems like the early return is liked better by the GPU:

static __global__ void dequantize_block_q4_2(const void * vx, float * y, int k) {
    const block_q4_2 * x = (const block_q4_2 *) vx;

    const int i = blockIdx.x*blockDim.x + threadIdx.x;

    if (i >= k) {
        return;
    }

    const float d = x[i].d;
    const uint8_t * pp = x[i].qs;

    for (int l = 0; l < QK4_2; l += 2) {
        const uint8_t vi = pp[l/2];

        const int8_t vi0 = vi & 0xf;
        const int8_t vi1 = vi >> 4;

        const float v0 = (vi0 - 8)*d;
        const float v1 = (vi1 - 8)*d;

        y[i*QK4_2 + l + 0] = v0;
        y[i*QK4_2 + l + 1] = v1;
    }
}

Also, only check the max block size once, and only use it if it's lower than 256 like the original PR:

static void dequantize_row_q4_2_cuda(const void * vx, float * y, int k, cudaStream_t stream) {
    const int nb = k / QK4_2;
    static int block_size = -1;
    if (block_size == -1) {
        int min_grid_size = -1;
        CUDA_CHECK(cudaOccupancyMaxPotentialBlockSize(&min_grid_size, &block_size, dequantize_block_q4_2, 0, 0));
        if (block_size > 256) block_size = 256;
    }
    int grid_size = (nb + block_size - 1) / block_size; // Round up.
    dequantize_block_q4_2<<<grid_size, block_size, 0, stream>>>(vx, y, nb);
}

With that I get to pretty much the same performance as with master, but I didn't test too exhaustively.

JohannesGaessler · 2023-05-08T11:48:10Z

Thanks everyone for the feedback. Caching the block size in particular is a good catch; on my hardware this also seems to be slightly faster though in my case the higher occupancy had presumably offset the additional overhead. I'm surprised though that the earlier return makes a difference; intuitively I would have thought that to the compiler both versions are equivalent. In any case, I integrated the suggestions. I only feel like we've kind of done a more complicated implementation of the original approach where the block size is just set to 256...

SlyEcho · 2023-05-08T11:50:29Z

This number, 256, is just magic, maybe it should be an option in the CMake file so the user can set it? Command line?

Although, I also tried 128 and 64 and they were slower.

JohannesGaessler · 2023-05-08T12:01:53Z

I already posted this in the PR about faster quantization kernels, but I also quickly tried implementing a faster quantization kernel: #1221 (comment) . With that kernel the number of blocks is 32 (16 for q4_2) times higher. I would intuitively assume that larger blocks would then give you better performance because there are fewer idle threads and (I think?) the GPU will have more opportunity for latency hiding.

SlyEcho · 2023-05-08T12:17:34Z

I think it's also closer to how the OpenCL kernel does it.

JohannesGaessler · 2023-05-08T12:21:06Z

Maybe we should just wait for the quantization changes and then simultaneously optimize kernels and block size.

slaren · 2023-05-09T12:47:16Z

There is quite a bit of duplicated code now, I recommend refactoring it with a templace such as:

typedef void (*to_fp32_kernel_t)(const void * x, float * y, int k);

template<to_fp32_kernel_t F, int QK>
static void launch_kernel(const void * x, float * y, int k, cudaStream_t stream) {
    const int nb = k / QK;
    static int block_size = -1;
    if (block_size == -1) {
        int min_grid_size;
        CUDA_CHECK(cudaOccupancyMaxPotentialBlockSize(&min_grid_size, &block_size, F, 0, 0));
        block_size = std::min(block_size, GGML_CUDA_MAX_BLOCK_SIZE);
    }
    const int grid_size = (nb + block_size - 1) / block_size; // Round up.
    F<<<grid_size, block_size, 0, stream>>>(x, y, nb);
}

static to_fp32_cuda_t ggml_get_to_fp32_cuda(ggml_type type) {
    switch (type) {
        case GGML_TYPE_Q4_0:
            return launch_kernel<dequantize_block_q4_0, QK4_0>;
        case GGML_TYPE_Q4_1:
            return launch_kernel<dequantize_block_q4_1, QK4_1>;
        case GGML_TYPE_Q4_2:
            return launch_kernel<dequantize_block_q4_2, QK4_2>;
        case GGML_TYPE_Q5_0:
            return launch_kernel<dequantize_block_q5_0, QK5_0>;
        case GGML_TYPE_Q5_1:
            return launch_kernel<dequantize_block_q5_1, QK5_1>;
        case GGML_TYPE_Q8_0:
            return launch_kernel<dequantize_block_q8_0, QK8_0>;
        case GGML_TYPE_F16:
            return launch_kernel<convert_fp16_to_fp32, 1>;
        default:
            return nullptr;
    }
}

JohannesGaessler · 2023-05-13T14:58:08Z

Closing this due to the merging of #1412 .

slaren · 2023-05-13T14:59:58Z

Wouldn't this still be useful for the prompt processing?

JohannesGaessler · 2023-05-13T15:29:01Z

I think I'll just try to write better quantization kernels instead of trying to optimize parameters for the ones on master. The ones that I'm using in the matrix vector multiplication template should be easy to apply to just dequantization.

JohannesGaessler mentioned this pull request May 6, 2023

Generation with cuBLAS not deterministic for long prompts #1340

Closed

4 tasks

JohannesGaessler mentioned this pull request May 6, 2023

[DRAFT] Speedup dequantize kernels #1221

Closed

dfyz reviewed May 6, 2023

View reviewed changes

More GPU threads for CUDA kernels

5014840

JohannesGaessler force-pushed the faster-cuda branch from e37718b to 5014840 Compare May 7, 2023 15:05

fixup! More GPU threads for CUDA kernels

8d8de07

fixup! More GPU threads for CUDA kernels

d0199b3

fixup! More GPU threads for CUDA kernels

006db8e

JohannesGaessler closed this May 13, 2023

ASH1998 mentioned this pull request Sep 16, 2023

CUDA illegal memory access | AWS #3163

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More GPU threads for dequantization #1341

More GPU threads for dequantization #1341

JohannesGaessler commented May 6, 2023 •

edited

Loading

slaren commented May 6, 2023

slaren commented May 6, 2023

JohannesGaessler commented May 6, 2023

dfyz commented May 6, 2023

dfyz May 6, 2023

slaren commented May 6, 2023

SlyEcho commented May 7, 2023

JohannesGaessler commented May 7, 2023 •

edited

Loading

SlyEcho commented May 7, 2023 •

edited

Loading

JohannesGaessler commented May 7, 2023

SlyEcho commented May 7, 2023 •

edited

Loading

JohannesGaessler commented May 7, 2023

SlyEcho commented May 7, 2023

slaren commented May 7, 2023

SlyEcho commented May 8, 2023

JohannesGaessler commented May 8, 2023

SlyEcho commented May 8, 2023

JohannesGaessler commented May 8, 2023

SlyEcho commented May 8, 2023

JohannesGaessler commented May 8, 2023

slaren commented May 9, 2023 •

edited

Loading

JohannesGaessler commented May 13, 2023

slaren commented May 13, 2023

JohannesGaessler commented May 13, 2023

More GPU threads for dequantization #1341

More GPU threads for dequantization #1341

Conversation

JohannesGaessler commented May 6, 2023 • edited Loading

slaren commented May 6, 2023

slaren commented May 6, 2023

JohannesGaessler commented May 6, 2023

dfyz commented May 6, 2023

dfyz May 6, 2023

Choose a reason for hiding this comment

slaren commented May 6, 2023

SlyEcho commented May 7, 2023

JohannesGaessler commented May 7, 2023 • edited Loading

SlyEcho commented May 7, 2023 • edited Loading

JohannesGaessler commented May 7, 2023

SlyEcho commented May 7, 2023 • edited Loading

JohannesGaessler commented May 7, 2023

SlyEcho commented May 7, 2023

slaren commented May 7, 2023

SlyEcho commented May 8, 2023

JohannesGaessler commented May 8, 2023

SlyEcho commented May 8, 2023

JohannesGaessler commented May 8, 2023

SlyEcho commented May 8, 2023

JohannesGaessler commented May 8, 2023

slaren commented May 9, 2023 • edited Loading

JohannesGaessler commented May 13, 2023

slaren commented May 13, 2023

JohannesGaessler commented May 13, 2023

JohannesGaessler commented May 6, 2023 •

edited

Loading

JohannesGaessler commented May 7, 2023 •

edited

Loading

SlyEcho commented May 7, 2023 •

edited

Loading

SlyEcho commented May 7, 2023 •

edited

Loading

slaren commented May 9, 2023 •

edited

Loading