-
Notifications
You must be signed in to change notification settings - Fork 13.6k
ggml: fix CUDA grid launch condition for large block_nums.y in binbcast #16742
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Please add a backend test that hits this case. The bug may exist in other backends, too. |
|
The backend test has been added. |
tests/test-backend-ops.cpp
Outdated
| add_test_bin_bcast(type, {1, 1, 640, 1}, {32, 32, 1, 1}); | ||
| add_test_bin_bcast(type, {5120, 1, 1, 1}, {1, 256, 1, 1}); | ||
| add_test_bin_bcast(type, {640, 1, 1, 1}, {1, 1, 1, 1}); | ||
| add_test_bin_bcast(type, {64, 262144, 1, 1}, {64, 1, 1, 1}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. This one is kind of slow. Would it hit the same bug with smaller ne[0]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The smaller ne[0] requires a larger ne[1]. block_nums.y = ne1*ne0 / 256 when ne0 <= 128
const int block_size = 128;
int64_t hne0 = std::max(ne0 / 2LL, 1LL);
dim3 block_dims;
block_dims.x = std::min<unsigned int>(hne0, block_size);
block_dims.y = std::min<unsigned int>(ne1, block_size / block_dims.x);
block_dims.z = std::min(std::min<unsigned int>(ne2 * ne3, block_size / block_dims.x / block_dims.y), 64U);
dim3 block_nums((hne0 + block_dims.x - 1) / block_dims.x, (ne1 + block_dims.y - 1) / block_dims.y,
(ne2 * ne3 + block_dims.z - 1) / block_dims.z);There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, OK. I'm fine with it then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The second parameter of add_test_bin_bcast is the number of repetitions, so this results in a 4 GB tensor, and it is causing the CI to fail. Using {1,1,1,1} should be enough to test this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your review. I’ve now reduced the repetitions.
|
The CI seems to be failing due to a timeout. I'm not familiar with how the CI was set up so it would be useful if someone more knowledgeable could provide some context. |
* model-conversion : add trust_remote_code for orig model run [no ci] (ggml-org#16751) This commit add the trust_remote_code=True argument when loading models using AutoConfig, AutoTokenizer, and AutoModelForCausalLM for the run original model script. The motivation for this is that some models require custom code to be loaded properly, and setting trust_remote_code=True avoids a prompt asking for user confirmation: ```console (venv) $ make causal-run-original-model The repository /path/to/model contains custom code which must be executed to correctly load the model. You can inspect the repository content at /path/to/model. Do you wish to run the custom code? [y/N] N ``` Having this as the default seems like a safe choice as we have to clone or download the models we convert and would be expecting to run any custom code they have. * webui: support q URL parameter (ggml-org#16728) * webui: support q URL parameter Fixes ggml-org#16722 I’ve checked that it works with Firefox’s AI tools * webui: apply suggestions from code review Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * chore: update webui static build --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * CUDA: use CUB for arbitary size argsort (ggml-org#16754) * ggml: fix CUDA grid launch condition for large block_nums.y in binbcast (ggml-org#16742) * Fix CUDA grid launch condition for large block_nums.y * add backend ops test * reduce test repetitions * convert : avoid dequantizing mxfp4 for GPT-OSS (ggml-org#16756) * vulkan: Optimize SSM_SCAN (ggml-org#16645) * vulkan: delete dead code (ggml-org#16732) ggml_vk_create_buffer_temp is not used anywhere, and it is the only caller for ggml_vk_pool_malloc. Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> * model : set res->t_embd in PLaMo2 models (ggml-org#16766) --------- Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com> Co-authored-by: Florian Badie <florianbadie@odrling.xyz> Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> Co-authored-by: Aman Gupta <amangupta052@gmail.com> Co-authored-by: leejet <leejet714@gmail.com> Co-authored-by: compilade <git@compilade.net> Co-authored-by: Jeff Bolz <jbolz@nvidia.com> Co-authored-by: Giuseppe Scrivano <gscrivan@redhat.com> Co-authored-by: Shunta Saito <shunta.saito@gmail.com>
…st (ggml-org#16742) * Fix CUDA grid launch condition for large block_nums.y * add backend ops test * reduce test repetitions
This PR fixes a potential CUDA grid launch issue in
binbcast.cu.Previously, only the
zdimension (block_nums.z) was checked against the CUDA grid limit (65535).However, the
ydimension can also exceed this limit for large tensors, leading to an invalid grid configuration and potential launch failure.The fix adds a check for
block_nums.y > 65535to prevent grid dimension overflow.