Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More GPU threads for dequantization #1341

Closed

Conversation

JohannesGaessler
Copy link
Collaborator

@JohannesGaessler JohannesGaessler commented May 6, 2023

On master CUDA kernels for dequantization are launched as blocks with only a single thread. This is an inefficient way of utilizing GPUs since they can execute many threads in parallel. This PR changes the way CUDA kernels are launched to launch 256 times fewer blocks but with each block containing 256 threads. On my hardware (GTX 1070) this speeds up prompt processing by ~14% for 33b and a prompt with 399 tokens (navy seals copypasta). I only implemented and tested q4_0 and q5_1 because there are pending breaking quantization changes in #1305 but the change is very simple and should be easy to adapt; just tell me how you want to proceed. Notably I was not able to 100% ensure that generations are unchanged by this PR because I am getting a bug that affects the reproducibility of CUDA-accelerated prompt processing as described in #1340 .

@slaren
Copy link
Collaborator

slaren commented May 6, 2023

Similar PR: #1221

@slaren
Copy link
Collaborator

slaren commented May 6, 2023

This is good if it increases performance by 14%, but your results seem to contradict the tests from @dfyz here, that showed that increasing occupancy does not always lead to increased performance.

Ideall though, we would use cudaOccupancyMaxPotentialBlockSize to determine the block size, and add a check in the kernels to ignore the extra elements if it is not perfectly divisible.

@JohannesGaessler
Copy link
Collaborator Author

I assume the discrepancy between my results and the results by @dfyz are due to the differences in hardware. I used a GTX 1070 while he used an RTX 3070 and a Tesla M40. He hypothesized that due to the latency of memory transfers in his case the faster dequantization kernel did not yield an actual performance improvement. In other words, his hardware is sufficiently fast that either version finishes in time. But my hardware is much slower so the dequantization is presumably slow enough that there is room for improvement (I'm assuming that latency does not depend on the GPU).

In general, GPU performance optimizations are not 100% portable because they depend on the specifics of the hardware used; on my hardware using more GPU threads is significantly faster and because low occupancy can lead to poor performance I was assuming that this would be universally faster (I was not aware of the other PR). Of course a more sophisticated solution would be preferable but I made this PR because it seemed like a low-hanging fruit.

@dfyz
Copy link
Collaborator

dfyz commented May 6, 2023

my hardware is much slower so the dequantization is presumably slow enough that there is room for improvement

This is very encouraging! I think that simply changing the block size (as opposed to what I did in the #1221) is a very non-invasive change, so as long as it results in speeding up kernels at least on some GPUs, it should be merged.

I actually tried something similar to your PR here (see the "hacky patch"), but the dequantize_block_q4_0 kernel actually got significantly slower on RTX 3070, so I abandoned this approach. This was with 768 threads (the result of cudaOccupancyMaxPotentialBlockSize) and with an additional check to make sure we ignore the extra elements if the number of blocks is not divisible by 768.

As I said, that was a hacky patch, so it's possible that I got something wrong. If someone with a relatively recent GPU can confirm that this PR doesn't slow down the kernels, I think we can proceed with increasing the block size for all kernels and merging the PR (thought it's ultimately up to @slaren).

ggml-cuda.cu Outdated
@@ -227,7 +227,8 @@ static __global__ void dequantize_block_q8_0(const void * vx, float * y) {

static void dequantize_row_q4_0_cuda(const void * vx, float * y, int k, cudaStream_t stream) {
const int nb = k / QK4_0;
dequantize_block_q4_0<<<nb, 1, 0, stream>>>(vx, y);
GGML_ASSERT(nb % 256 == 0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@slaren already said this, but I want to emphasize that the GPU code shouldn't assume that the number of blocks is divisible by 256. The CPU code only assumes that it is even.

@slaren
Copy link
Collaborator

slaren commented May 6, 2023

If someone with a relatively recent GPU can confirm that this PR doesn't slow down the kernels, I think we can proceed with increasing the block size for all kernels and merging the PR (thought it's ultimately up to @slaren).

I agree, and even though I appreciate the gesture, I am just a voice here, I don't "own" any code.

@SlyEcho
Copy link
Collaborator

SlyEcho commented May 7, 2023

FYI, getting 15% perf increase on my Vega64 as well.

@JohannesGaessler
Copy link
Collaborator Author

JohannesGaessler commented May 7, 2023

With cudaOccupancyMaxPotentialBlockSize the kernel size on my GTX 1070 is set to 1024. With 33b q4_0 the speedup for prompt processing is now 15% (Edit: CUDA v12.1.0-3). Someone with a relatively recent GPU please confirm that this PR at least does not give them a performance regression.

@SlyEcho
Copy link
Collaborator

SlyEcho commented May 7, 2023

I did some tests, first line is the old (master) then second is the patch in this PR. Running on the perplexity tool with the first 103 lines from wiki.test.raw, 8 threads, no mmap, LLaMa 7B Q4_2.

RX 6900XT, Linux 5.15, ROCm 5.2.3 (based on #1087, not master)

llama_print_timings: prompt eval time = 25154.07 ms /  6144 tokens (    4.09 ms per token)
llama_print_timings: prompt eval time = 25642.30 ms /  6144 tokens (    4.17 ms per token)

RTX 2080 Ti, Linux 5.8, CUDA 11.4

llama_print_timings: prompt eval time = 27952.36 ms /  6144 tokens (    4.55 ms per token)
llama_print_timings: prompt eval time = 29715.57 ms /  6144 tokens (    4.84 ms per token)

RTX 3090, Linux 5.15, CUDA 11.4:

llama_print_timings: prompt eval time = 20707.82 ms /  6144 tokens (    3.37 ms per token)
CUDA error 700 at ~/src/llama.cpp/ggml-cuda.cu:682: an illegal memory access was encountered

The machines are different so not comparable directly to each other.

@JohannesGaessler
Copy link
Collaborator Author

I think I passed the wrong argument for determining the data limit. Can you re-run the test for the 3090 and check whether it works now?

@SlyEcho
Copy link
Collaborator

SlyEcho commented May 7, 2023

RX 6900XT:

llama_print_timings: prompt eval time = 25508.54 ms /  6144 tokens (    4.15 ms per token)
llama_print_timings: prompt eval time = 26084.42 ms /  6144 tokens (    4.25 ms per token)

RTX 2080 Ti:

llama_print_timings: prompt eval time = 28185.34 ms /  6144 tokens (    4.59 ms per token)
llama_print_timings: prompt eval time = 29794.24 ms /  6144 tokens (    4.85 ms per token)

RTX 3090:

llama_print_timings: prompt eval time = 20290.44 ms /  6144 tokens (    3.30 ms per token)
llama_print_timings: prompt eval time = 18930.16 ms /  6144 tokens (    3.08 ms per token)

Same setup as before. There may be something about the older software versions. I can't update anything since I don't own them, I only have access. Also, other users are running stuff on these machines, too but today seems quieter.

I ran the tests a couple times and the numbers were pretty much the same.

Sidenote, I should automate this somehow...

@JohannesGaessler
Copy link
Collaborator Author

It seems performance isn't consistently better/worse... should we make larger CUDA blocks a compilation option?

@SlyEcho
Copy link
Collaborator

SlyEcho commented May 7, 2023

I'd like someone else test as well, maybe on CUDA 12 it is better?

@slaren
Copy link
Collaborator

slaren commented May 7, 2023

RTX 3080, CUDA 12.1

Master: llama_print_timings: prompt eval time = 25057.53 ms /  6144 tokens (    4.08 ms per token)
PR:     llama_print_timings: prompt eval time = 26587.40 ms /  6144 tokens (    4.33 ms per token)

@SlyEcho
Copy link
Collaborator

SlyEcho commented May 8, 2023

I tried some hacking:

It seems like the early return is liked better by the GPU:

static __global__ void dequantize_block_q4_2(const void * vx, float * y, int k) {
    const block_q4_2 * x = (const block_q4_2 *) vx;

    const int i = blockIdx.x*blockDim.x + threadIdx.x;

    if (i >= k) {
        return;
    }

    const float d = x[i].d;
    const uint8_t * pp = x[i].qs;

    for (int l = 0; l < QK4_2; l += 2) {
        const uint8_t vi = pp[l/2];

        const int8_t vi0 = vi & 0xf;
        const int8_t vi1 = vi >> 4;

        const float v0 = (vi0 - 8)*d;
        const float v1 = (vi1 - 8)*d;

        y[i*QK4_2 + l + 0] = v0;
        y[i*QK4_2 + l + 1] = v1;
    }
}

Also, only check the max block size once, and only use it if it's lower than 256 like the original PR:

static void dequantize_row_q4_2_cuda(const void * vx, float * y, int k, cudaStream_t stream) {
    const int nb = k / QK4_2;
    static int block_size = -1;
    if (block_size == -1) {
        int min_grid_size = -1;
        CUDA_CHECK(cudaOccupancyMaxPotentialBlockSize(&min_grid_size, &block_size, dequantize_block_q4_2, 0, 0));
        if (block_size > 256) block_size = 256;
    }
    int grid_size = (nb + block_size - 1) / block_size; // Round up.
    dequantize_block_q4_2<<<grid_size, block_size, 0, stream>>>(vx, y, nb);
}

With that I get to pretty much the same performance as with master, but I didn't test too exhaustively.

@JohannesGaessler
Copy link
Collaborator Author

Thanks everyone for the feedback. Caching the block size in particular is a good catch; on my hardware this also seems to be slightly faster though in my case the higher occupancy had presumably offset the additional overhead. I'm surprised though that the earlier return makes a difference; intuitively I would have thought that to the compiler both versions are equivalent. In any case, I integrated the suggestions. I only feel like we've kind of done a more complicated implementation of the original approach where the block size is just set to 256...

@SlyEcho
Copy link
Collaborator

SlyEcho commented May 8, 2023

This number, 256, is just magic, maybe it should be an option in the CMake file so the user can set it? Command line?

Although, I also tried 128 and 64 and they were slower.

@JohannesGaessler
Copy link
Collaborator Author

I already posted this in the PR about faster quantization kernels, but I also quickly tried implementing a faster quantization kernel: #1221 (comment) . With that kernel the number of blocks is 32 (16 for q4_2) times higher. I would intuitively assume that larger blocks would then give you better performance because there are fewer idle threads and (I think?) the GPU will have more opportunity for latency hiding.

@SlyEcho
Copy link
Collaborator

SlyEcho commented May 8, 2023

I think it's also closer to how the OpenCL kernel does it.

@JohannesGaessler
Copy link
Collaborator Author

Maybe we should just wait for the quantization changes and then simultaneously optimize kernels and block size.

@slaren
Copy link
Collaborator

slaren commented May 9, 2023

There is quite a bit of duplicated code now, I recommend refactoring it with a templace such as:

typedef void (*to_fp32_kernel_t)(const void * x, float * y, int k);

template<to_fp32_kernel_t F, int QK>
static void launch_kernel(const void * x, float * y, int k, cudaStream_t stream) {
    const int nb = k / QK;
    static int block_size = -1;
    if (block_size == -1) {
        int min_grid_size;
        CUDA_CHECK(cudaOccupancyMaxPotentialBlockSize(&min_grid_size, &block_size, F, 0, 0));
        block_size = std::min(block_size, GGML_CUDA_MAX_BLOCK_SIZE);
    }
    const int grid_size = (nb + block_size - 1) / block_size; // Round up.
    F<<<grid_size, block_size, 0, stream>>>(x, y, nb);
}

static to_fp32_cuda_t ggml_get_to_fp32_cuda(ggml_type type) {
    switch (type) {
        case GGML_TYPE_Q4_0:
            return launch_kernel<dequantize_block_q4_0, QK4_0>;
        case GGML_TYPE_Q4_1:
            return launch_kernel<dequantize_block_q4_1, QK4_1>;
        case GGML_TYPE_Q4_2:
            return launch_kernel<dequantize_block_q4_2, QK4_2>;
        case GGML_TYPE_Q5_0:
            return launch_kernel<dequantize_block_q5_0, QK5_0>;
        case GGML_TYPE_Q5_1:
            return launch_kernel<dequantize_block_q5_1, QK5_1>;
        case GGML_TYPE_Q8_0:
            return launch_kernel<dequantize_block_q8_0, QK8_0>;
        case GGML_TYPE_F16:
            return launch_kernel<convert_fp16_to_fp32, 1>;
        default:
            return nullptr;
    }
}

@JohannesGaessler
Copy link
Collaborator Author

Closing this due to the merging of #1412 .

@slaren
Copy link
Collaborator

slaren commented May 13, 2023

Wouldn't this still be useful for the prompt processing?

@JohannesGaessler
Copy link
Collaborator Author

I think I'll just try to write better quantization kernels instead of trying to optimize parameters for the ones on master. The ones that I'm using in the matrix vector multiplication template should be easy to apply to just dequantization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants