WIP: llama: Vulkan: Fix Adreno Q8_0 issues. #11

infinitalo · 2025-08-29T13:37:13Z

This MR is a work-in-progress.

The current commits are able to get inference working for Q8_0 on Adreno 830 (Samsung S25), but finetuning still crashes.

We're currently working on a fix for lora-finetuning on Adreno A830, but you can use this for testing in the meanwhile.

…a is provided

Signed-off-by: vineet <vineet.suryan@collabora.com>

…lation Signed-off-by: vineet <vineet.suryan@collabora.com>

This fixes the vkDeviceLostError on Mali

infinitalo · 2025-09-01T15:12:45Z

Steps to run the backend-ops test suite:

Set up your Android environment for testing llama.cpp. You can use this comment as a reference if you haven't built it already: Add initial LoRA finetuning support; vulkan OUT_PROD; vulkan cross-entropy-backward #5 (comment)
Configure your build with: cmake -B build -DGGML_VULKAN=1 -DCMAKE_BUILD_TYPE=Debug -DBUILD_TESTING=ON
Build llama.cpp: cmake --build build --config Debug -j2
Run the backend-ops tests: ./build/bin/test-backend-ops
You can also run tests for specific operators with the -o option, for example: ./build/bin/test-backend-ops -o MUL_MAT

This PR has a commit disabling several tests for quantized datatypes that are not currently working properly on Adreno 830.

If you run the test suite as described above with this branch, it should say 2/2 backends passing at the end, with no failing tests on A830, as the attached file shows.

test_adreno_q8_inf2.txt

* oai moe * compat with new checkpoint * add attn sink impl * add rope scaling yarn * logits match with latest transformers code * wip chat template * rm trailing space * use ggml_scale_bias * rm redundant is_swa_all * convert interleaved gate_up * graph : fix activation function to match reference (#7) * vocab : handle o200k_harmony special tokens * ggml : add attention sinks support (#1) * llama : add attn sinks * ggml : add attn sinks * cuda : add attn sinks * vulkan : add support for sinks in softmax remove unnecessary return * ggml : add fused swiglu_oai op (#11) * ggml : add fused swiglu_oai op * Update ggml/src/ggml-cpu/ops.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * update CUDA impl * cont : metal impl * add vulkan impl * test-backend-ops : more test cases, clean up * llama : remove unfused impl * remove extra lines --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com> * repack mxfp4 upon conversion * clean up a bit * enable thinking * add quick hack to render only some special tokens * fix bf16 conversion * remove vocab hack * webui ok * support chat parsing for gpt-oss * fix webui * direct mapping mxfp4, FINALLY * force using mxfp4 * properly use lazy tensor * ggml : add mxfp4 ggml : use e8m0 conversion instead of powf Co-authored-by: Diego Devesa <slarengh@gmail.com> change kvalues_mxfp4 table to match e2m1 (#6) metal : remove quantization for now (not used) cuda : fix disabled CUDA graphs due to ffn moe bias vulkan : add support for mxfp4 cont : add cm2 dequant * ggml : add ggml_add_id (#13) * ggml : add ggml_add_id * add cuda impl * llama : add weight support check for add_id * perf opt * add vulkan impl * rename cuda files * add metal impl * allow in-place ggml_add_id * llama : keep biases on CPU with --cpu-moe * llama : fix compile error ggml-ci * cuda : add fallback for __nv_cvt_e8m0_to_bf16raw ggml-ci * cleanup ggml-ci * sycl : fix supports_op for MXFP4 ggml-ci * fix Unknown reasoning format * ggml-cpu : fix AVX build ggml-ci * fix hip build ggml-ci * cuda : add mxfp4 dequantization support for cuBLAS ggml-ci * ggml-cpu : fix mxfp4 fallback definitions for some architectures ggml-ci * cuda : fix version required for __nv_cvt_e8m0_to_bf16raw --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: slaren <slarengh@gmail.com>

* vulkan: fix debug mode issues * vulkan: remove broken check_results GGML_OP_SET_ROWS support

infinitalo · 2025-09-24T14:26:52Z

In this current version, the environment variable GGML_TILING_ENABLE=1 should be used when training. Wee'll remove it when cleaning up.

This makes MUL_MAT tests pass for Q8_0 when n=9 failed.

infinitalo · 2025-10-15T13:10:43Z

Closing this, see #34 for the new version

makaveli10 and others added 18 commits July 30, 2025 16:58

Add lora finetuning from adapter

d8ba38a

Add: create new lora adapter for target modules to finetune if no lor…

c8ea7b2

…a is provided

Fix identical loss over epochs; fix garbage lora initization

69c464a

Signed-off-by: vineet <vineet.suryan@collabora.com>

Remove lora training from finetune.cpp

ab0dae2

Signed-off-by: vineet <vineet.suryan@collabora.com>

Add adapter saving & other lora target modules

1b6e22d

Signed-off-by: vineet <vineet.suryan@collabora.com>

Add finetune-lora for lora finetuning in examples

c61a5af

Signed-off-by: vineet <vineet.suryan@collabora.com>

Add dequantization to out_prod cuda kernel

7799375

Signed-off-by: vineet <vineet.suryan@collabora.com>

Update README with finetune-lora

cbd4e1f

Signed-off-by: vineet <vineet.suryan@collabora.com>

Vulkan: add support for fp32 OUT_PROD op

cde18d5

CPU: add support for fp16_fp32 OUT_PROD op

b6b30de

Vulkan: add support for f16_f32 OUT_PROD op

5d4d767

Vulkan: Add Q4_0/Q8_0 OUT_PROD Vulkan support

e55225a

vulkan: Add initial cross entropy loss backward shader

c81f689

Signed-off-by: vineet <vineet.suryan@collabora.com>

vulkan: Fix cross-entropy-loss-back dispatch size and wg denominator

fc3ae66

Signed-off-by: vineet <vineet.suryan@collabora.com>

vulkan: Change uint32 cast to int32 for outprod; allows android compi…

055e30f

…lation Signed-off-by: vineet <vineet.suryan@collabora.com>

vulkan: Deallocate memory after destroying buffer

ae5ae40

vulkan: Set specialization constants to { 0 } for out_prod

794c37f

This fixes the vkDeviceLostError on Mali

vulkan: Set out_prod pipeline disable_robustness to true

36cfb76

github-actions bot added Nvidia GPU Vulkan examples ggml testing labels Aug 29, 2025

infinitalo force-pushed the italo/tether/adreno_q8_inference branch 2 times, most recently from cbea88f to 208747f Compare September 1, 2025 14:13

makaveli10 and others added 4 commits September 4, 2025 05:23

Fix out_prod; vulkan ci issues

ea37326

Add GEGLU backward (Vulkan) to enable Gemma training.

4614f2d

ggml-vulkan: Add TQ2_0 dequantize and mul_mat vec

7d0f6db

ggml-vulkan: Enable coopmat support for Android

794f94c

Italo Nicola and others added 2 commits September 24, 2025 11:07

wip vulkan crash fix

741f459

Vulkan: Fix minor debug mode issues (ggml-org#14899)

c893d44

* vulkan: fix debug mode issues * vulkan: remove broken check_results GGML_OP_SET_ROWS support

Italo Nicola added 13 commits September 26, 2025 09:05

(wip) Tests: disable non-q8 quant tests

cda1a17

(wip) Vulkan: remove packed16 optimization for Q8_0 dequant4

2a1fef4

(wip) Vulkan: disable packed16 optimizations for Q8_0 src0

0f3eb04

(wip) Vulkan: disable mulmat device->integer_dot_product optimization

dfbd6b5

This makes MUL_MAT tests pass for Q8_0 when n=9 failed.

(wip) Vulkan: stop using data_b_v4 in mul_mat_vec shader for Q8_0

8ba6f53

(wip) Tests: enable test output logs

fb893dc

(wip) Tests: add 4 mulmat Q8_0 tests for larger input sizes

fe60d66

(wip) Tests: modify test case parameters

59476cd

(wip) Tests: set tensor initial values

9da0050

(wip) Tests: disable log spam

e70ff02

(wip) Tests: add a larger test

7f4b20e

(wip) Vulkan: Adreno Q4_0 fix

d4c5db4

(wip) Vulkan: Adreno Q4_1 fix

825b09e

infinitalo force-pushed the italo/tether/adreno_q8_inference branch from 1049722 to d12255c Compare September 29, 2025 15:59

zoq and others added 2 commits September 30, 2025 14:01

(wip) Vulkan: Implement MUL_MAT tiling workaround

6b50346

(wip) Vulkan: Adreno Q6_K fix

c5b7162

infinitalo force-pushed the italo/tether/adreno_q8_inference branch from d12255c to c5b7162 Compare September 30, 2025 17:05

infinitalo closed this Oct 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP: llama: Vulkan: Fix Adreno Q8_0 issues. #11

WIP: llama: Vulkan: Fix Adreno Q8_0 issues. #11

Uh oh!

infinitalo commented Aug 29, 2025

Uh oh!

infinitalo commented Sep 1, 2025 •

edited

Loading

Uh oh!

infinitalo commented Sep 24, 2025

Uh oh!

infinitalo commented Oct 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

WIP: llama: Vulkan: Fix Adreno Q8_0 issues. #11

WIP: llama: Vulkan: Fix Adreno Q8_0 issues. #11

Uh oh!

Conversation

infinitalo commented Aug 29, 2025

Uh oh!

infinitalo commented Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

infinitalo commented Sep 24, 2025

Uh oh!

infinitalo commented Oct 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

infinitalo commented Sep 1, 2025 •

edited

Loading