Running Qwen2.5 Coder 32B Instruct with 128k Context in MLX #1808

jagga99 · 2025-01-29T12:10:29Z

jagga99
Jan 29, 2025

Could you please guide me on how to run the MLX model Qwen2.5 Coder 32B Instruct with a large 128k context window? According to the YaRn (Rope Scaling 4.0) method, the model should support a 128k context window with the following configuration:

"rope_scaling": {
"factor": 4.0,
"original_max_position_embeddings": 32768,
"type": "yarn"
}
I would like to clarify whether it is possible to run this model (Hugging Face link) with 128k token support using the mlx-lm framework.
Or does the mlx-lm framework only support a 32k token context window?

Thank you for your help!

awni · 2025-01-29T14:25:57Z

awni
Jan 29, 2025
Maintainer

Hmm, I would recommend using one of the newer 1M models instead. That doesn't require you to fiddle with the config and add the additional RoPE for the model.

Otherwise you will need to manually edit the Qwen 2.5 model file as well as the hugging face config to get it to support the Yarn scaling.

For details on running with long context you can do something like the following:

Convert your code base to a prompt:

files-to-prompt path_to_python_code -e py -c > prompt.txt

Pre-compute the prompt cache as it's expensive. This way you can make multiple queries with the same prompt:

mlx_lm.cache_prompt --model mlx-community/Qwen2.5-7B-Instruct-1M-4bit --prompt-cache-file q7b.safetensors --prompt - < prompt.txt

Generate using the pre-computed cache + query:

mlx_lm.generate --prompt-cache-file q7b.safetensors --prompt "Tell me about this codebase?" -m 128

More notes:

Peak RAM for prompt filling was 22GB
Peak RAM for generation 12GB
Prompt filling took 350 seconds on an M2 Ultra
Generation ran at 31 tokens-per-second on M2 Ultra
This should work up to 256k tokens. Beyond that one needs to have the DCN implemented for the RoPE which is not yet in MLX LM
To use less RAM you could try using KV cache quantization, more here: https://github.com/ml-explore/mlx-examples/blob/main/llms/README.md#long-prompts-and-generations

2 replies

jagga99 Jan 29, 2025
Author

Thanks for your reply! However, I'm not sure that a general-purpose model like Qwen2.5-14B-Instruct-1M-8bit would surpass a specialized model like Qwen2.5-Coder-32B-Instruct-8bit for code generation tasks, such as programming in Python (a 14B general model is unlikely to outperform a 32B specialized model).

awni Jan 29, 2025
Maintainer

Yes that may be true.. though from benchmarks in the technical report they look to be very high quality on code as well.

There is also the fact that the 1M model is actually trained for long context whereas the 2.5 coder models are using yarn scaling for extrapolation, so for context length > 32k these models will have an advantage. Whereas for short context the old qwen 2.5 coder will still probably be better.

Nevertheless, you are welcome to setup the coder model with yarn scaling.. it should be quite doable by editing the config and using this yarn rope implementation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running Qwen2.5 Coder 32B Instruct with 128k Context in MLX #1808

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Running Qwen2.5 Coder 32B Instruct with 128k Context in MLX #1808

jagga99 Jan 29, 2025

Replies: 1 comment · 2 replies

awni Jan 29, 2025 Maintainer

jagga99 Jan 29, 2025 Author

awni Jan 29, 2025 Maintainer

jagga99
Jan 29, 2025

Replies: 1 comment 2 replies

awni
Jan 29, 2025
Maintainer

jagga99 Jan 29, 2025
Author

awni Jan 29, 2025
Maintainer