Skip to content

Conversation

lhez
Copy link
Collaborator

@lhez lhez commented Aug 12, 2025

This PR adds the initial mxfp4 support. It is based on the Metal kernels in #15091. We will use it as the baseline and iterate on it to improve mxfp4 performance.

@lhez lhez requested a review from max-krasnyansky August 12, 2025 16:29
@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning OpenCL Issues specific to the OpenCL backend labels Aug 12, 2025
@lhez lhez marked this pull request as ready for review August 12, 2025 16:38
Copy link
Collaborator

@max-krasnyansky max-krasnyansky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.
Tested on my X-Elite MS Surface and it's fully functional (i.e able to run native MXFP4 model).
We'll need to add all the same tricks we do for Q4_0 to get better performance.

`.\build-wos\bin\llama-cli.exe --no-mmap -m C:\Users\maxk\src\gguf\gpt-oss-20b-mxfp4.gguf --ctx-size 8192 -f .\gpt-oss-cookies.txt -n 128 -t 4 -ngl 99 -no-cnv

ggml_opencl: selected platform: 'QUALCOMM Snapdragon(TM)'
ggml_opencl: device: 'Qualcomm(R) Adreno(TM) X1-85 GPU (OpenCL 3.0 Qualcomm(R) Adreno(TM) X1-85 GPU)'`
...
load_tensors: offloaded 25/25 layers to GPU
load_tensors:       OpenCL model buffer size =  9717.06 MiB
load_tensors:          CPU model buffer size =  1819.12 MiB
...
<|start|>user<|message|>What is the most popular cookie in the world right now? please break the answer by regions and include the overall top-choice.
<|start|>assistant<|channel|>analysis<|message|>The user asks: "What is the most popular cookie in the world right now? please break the answer by regions and include the overall top-choice."

We need to interpret: They want a list of the most popular cookie type in different regions and also an overall top-choice. We need to find the answer. There's no single official source. We can reference recent market research, brand popularity, or just general consensus. Likely chocolate chip cookie is most popular overall. Regionally: In US, chocolate chip; Europe maybe shortbread or biscotti; Asia maybe rice crackers or mochi? But cookies are generic; in Asia

llama_perf_sampler_print:    sampling time =      13.05 ms /   160 runs   (    0.08 ms per token, 12264.30 tokens per second)
llama_perf_context_print:        load time =    4717.25 ms
llama_perf_context_print: prompt eval time =    2983.58 ms /    32 tokens (   93.24 ms per token,    10.73 tokens per second)
llama_perf_context_print:        eval time =   19640.81 ms /   127 runs   (  154.65 ms per token,     6.47 tokens per second)
llama_perf_context_print:       total time =   22699.73 ms /   159 tokens

@max-krasnyansky max-krasnyansky merged commit e2c1bff into ggml-org:master Aug 15, 2025
44 of 47 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning OpenCL Issues specific to the OpenCL backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants