Skip to content

Conversation

@leejet
Copy link
Contributor

@leejet leejet commented Oct 23, 2025

This PR fixes a potential CUDA grid launch issue in binbcast.cu.
Previously, only the z dimension (block_nums.z) was checked against the CUDA grid limit (65535).
However, the y dimension can also exceed this limit for large tensors, leading to an invalid grid configuration and potential launch failure.

The fix adds a check for block_nums.y > 65535 to prevent grid dimension overflow.

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Oct 23, 2025
@jeffbolznv
Copy link
Collaborator

Please add a backend test that hits this case. The bug may exist in other backends, too.

@leejet leejet requested a review from slaren as a code owner October 23, 2025 15:48
@leejet
Copy link
Contributor Author

leejet commented Oct 23, 2025

The backend test has been added.

add_test_bin_bcast(type, {1, 1, 640, 1}, {32, 32, 1, 1});
add_test_bin_bcast(type, {5120, 1, 1, 1}, {1, 256, 1, 1});
add_test_bin_bcast(type, {640, 1, 1, 1}, {1, 1, 1, 1});
add_test_bin_bcast(type, {64, 262144, 1, 1}, {64, 1, 1, 1});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. This one is kind of slow. Would it hit the same bug with smaller ne[0]?

Copy link
Contributor Author

@leejet leejet Oct 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The smaller ne[0] requires a larger ne[1]. block_nums.y = ne1*ne0 / 256 when ne0 <= 128

        const int block_size = 128;

        int64_t hne0 = std::max(ne0 / 2LL, 1LL);

        dim3 block_dims;
        block_dims.x = std::min<unsigned int>(hne0, block_size);
        block_dims.y = std::min<unsigned int>(ne1, block_size / block_dims.x);
        block_dims.z = std::min(std::min<unsigned int>(ne2 * ne3, block_size / block_dims.x / block_dims.y), 64U);

        dim3 block_nums((hne0 + block_dims.x - 1) / block_dims.x, (ne1 + block_dims.y - 1) / block_dims.y,
                        (ne2 * ne3 + block_dims.z - 1) / block_dims.z);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, OK. I'm fine with it then.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second parameter of add_test_bin_bcast is the number of repetitions, so this results in a 4 GB tensor, and it is causing the CI to fail. Using {1,1,1,1} should be enough to test this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your review. I’ve now reduced the repetitions.

@github-actions github-actions bot added the testing Everything test related label Oct 23, 2025
@JohannesGaessler
Copy link
Collaborator

The CI seems to be failing due to a timeout. I'm not familiar with how the CI was set up so it would be useful if someone more knowledgeable could provide some context.

@JohannesGaessler JohannesGaessler merged commit 55945d2 into ggml-org:master Oct 24, 2025
72 checks passed
wqerrewetw added a commit to wqerrewetw/llama.cpp that referenced this pull request Oct 25, 2025
* model-conversion : add trust_remote_code for orig model run [no ci] (ggml-org#16751)

This commit add the trust_remote_code=True argument when loading models
using AutoConfig, AutoTokenizer, and AutoModelForCausalLM for the run
original model script.

The motivation for this is that some models require custom code to be
loaded properly, and setting trust_remote_code=True avoids a prompt
asking for user confirmation:
```console
(venv) $ make causal-run-original-model
The repository /path/to/model contains custom code which must be
executed to correctly load the model. You can inspect the repository
content at /path/to/model.

Do you wish to run the custom code? [y/N] N
```

Having this as the default seems like a safe choice as we have to clone
or download the models we convert and would be expecting to run any
custom code they have.

* webui: support q URL parameter (ggml-org#16728)

* webui: support q URL parameter

Fixes ggml-org#16722
I’ve checked that it works with Firefox’s AI tools

* webui: apply suggestions from code review

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* chore: update webui static build

---------

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* CUDA: use CUB for arbitary size argsort (ggml-org#16754)

* ggml: fix CUDA grid launch condition for large block_nums.y in binbcast (ggml-org#16742)

* Fix CUDA grid launch condition for large block_nums.y

* add backend ops test

* reduce test  repetitions

* convert : avoid dequantizing mxfp4 for GPT-OSS (ggml-org#16756)

* vulkan: Optimize SSM_SCAN (ggml-org#16645)

* vulkan: delete dead code (ggml-org#16732)

ggml_vk_create_buffer_temp is not used anywhere, and it is the only
caller for ggml_vk_pool_malloc.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

* model : set res->t_embd in PLaMo2 models (ggml-org#16766)

---------

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com>
Co-authored-by: Florian Badie <florianbadie@odrling.xyz>
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
Co-authored-by: leejet <leejet714@gmail.com>
Co-authored-by: compilade <git@compilade.net>
Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
Co-authored-by: Giuseppe Scrivano <gscrivan@redhat.com>
Co-authored-by: Shunta Saito <shunta.saito@gmail.com>
pwilkin pushed a commit to pwilkin/llama.cpp that referenced this pull request Oct 25, 2025
…st (ggml-org#16742)

* Fix CUDA grid launch condition for large block_nums.y

* add backend ops test

* reduce test  repetitions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants