Skip to content

Conversation

@Isotr0py
Copy link
Member

@Isotr0py Isotr0py commented Sep 23, 2025

Purpose

Test Plan

python examples/offline_inference/basic/generate.py --model Qwen/Qwen3-30B-A3B-GPTQ-Int4
python examples/offline_inference/basic/generate.py --model jart25/Qwen3-Coder-30B-A3B-Instruct-Int4-gptq
python examples/offline_inference/basic/generate.py --model Intel/Qwen3-30B-A3B-Instruct-2507-int4-AutoRound

Test Result

Qwen3-MoE models with/without MoE gate quantized can all generate normal outputs without GPTQ patch now.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a more robust and cleaner way to handle unquantized modules in GPTQ models by adding the modules_in_block_to_quantize field to the configuration. This replaces the previous hacky approach and resolves existing issues, which is a great improvement for maintainability. The implementation correctly infers the modules to quantize from the model's checkpoint metadata when not explicitly provided. I have one suggestion regarding code duplication between GPTQConfig and GPTQMarlinConfig that would further enhance the maintainability of this new logic.

Comment on lines 241 to 261
def apply_vllm_mapper(self, hf_to_vllm_mapper):
if self.modules_in_block_to_quantize is not None:
self.modules_in_block_to_quantize = hf_to_vllm_mapper.apply_list(
self.modules_in_block_to_quantize)

def maybe_update_config(self,
model_name: str,
revision: Optional[str] = None):
if self.modules_in_block_to_quantize:
return

unquant_dtypes = [torch.float16, torch.bfloat16, torch.float32]
metadata = get_safetensors_params_metadata(model_name,
revision=revision)
quant_layers: set[str] = {
param_name.rsplit(".", 1)[0]
for param_name, info in metadata.items()
if (dtype := info.get('dtype', None))
and _SAFETENSORS_TO_TORCH_DTYPE[dtype] not in unquant_dtypes
}
self.modules_in_block_to_quantize = list(quant_layers)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The methods apply_vllm_mapper and maybe_update_config are identical to the ones in vllm/model_executor/layers/quantization/gptq.py. This code duplication can lead to maintenance issues, where a change in one file might not be propagated to the other, potentially causing bugs.

To improve maintainability, I recommend refactoring this shared logic into a common base class or a mixin. For example, you could create a GPTQConfigMixin class to house these methods, and then have both GPTQConfig and GPTQMarlinConfig inherit from it. This would ensure that the logic for handling modules_in_block_to_quantize is defined in a single place.

@Isotr0py Isotr0py requested a review from jeejeelee September 24, 2025 04:09
@jeejeelee
Copy link
Collaborator

Can you check and clean up other models that contain _maybe_ignore_quant_config as well?

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
@Isotr0py
Copy link
Member Author

Can you check and clean up other models that contain _maybe_ignore_quant_config as well?

Removed _maybe_ignore_quant_config from keye, minicpmo and ovis. Have confirmed all models can still be loaded after cleanup. (Ovis is broken at inference due to other reasons)

Copy link
Collaborator

@jeejeelee jeejeelee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the improvement. Perhaps we can apply this logic to other quantitative methods

@Isotr0py Isotr0py enabled auto-merge (squash) September 24, 2025 16:04
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 24, 2025
@Isotr0py Isotr0py merged commit d4d9899 into vllm-project:main Sep 26, 2025
54 checks passed
@Isotr0py Isotr0py deleted the gptq-skip branch September 26, 2025 16:17
@JartX
Copy link
Contributor

JartX commented Sep 28, 2025

Hi @Isotr0py , the following PR breaks model loading:

jart25/Qwen3-Next-80B-A3B-Instruct-Int4-GPTQ

The model has been generated using Autoround-GPTQ, like the previous ones
I've verified that reverting it works again. Here's the error:

Loading safetensors checkpoint shards:   0% 0/9 [00:00<?, ?it/s](Worker_TP0 pid=238) ERROR 09-28 07:55:52 [multiproc_executor.py:597] WorkerProc failed to start.
vllm1-1  | (Worker_TP0 pid=238) ERROR 09-28 07:55:52 [multiproc_executor.py:597] Traceback (most recent call last):
vllm1-1  | (Worker_TP0 pid=238) ERROR 09-28 07:55:52 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 571, in worker_main
vllm1-1  | (Worker_TP0 pid=238) ERROR 09-28 07:55:52 [multiproc_executor.py:597]     worker = WorkerProc(*args, **kwargs)
vllm1-1  | (Worker_TP0 pid=238) ERROR 09-28 07:55:52 [multiproc_executor.py:597]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm1-1  | (Worker_TP0 pid=238) ERROR 09-28 07:55:52 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 437, in __init__
vllm1-1  | (Worker_TP0 pid=238) ERROR 09-28 07:55:52 [multiproc_executor.py:597]     self.worker.load_model()
vllm1-1  | (Worker_TP0 pid=238) ERROR 09-28 07:55:52 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 213, in load_model
vllm1-1  | (Worker_TP0 pid=238) ERROR 09-28 07:55:52 [multiproc_executor.py:597]     self.model_runner.load_model(eep_scale_up=eep_scale_up)
vllm1-1  | (Worker_TP0 pid=238) ERROR 09-28 07:55:52 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 2712, in load_model
vllm1-1  | (Worker_TP0 pid=238) ERROR 09-28 07:55:52 [multiproc_executor.py:597]     self.model = model_loader.load_model(
vllm1-1  | (Worker_TP0 pid=238) ERROR 09-28 07:55:52 [multiproc_executor.py:597]                  ^^^^^^^^^^^^^^^^^^^^^^^^
vllm1-1  | (Worker_TP0 pid=238) ERROR 09-28 07:55:52 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 50, in load_model
vllm1-1  | (Worker_TP0 pid=238) ERROR 09-28 07:55:52 [multiproc_executor.py:597]     self.load_weights(model, model_config)
vllm1-1  | (Worker_TP0 pid=238) ERROR 09-28 07:55:52 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/default_loader.py", line 264, in load_weights
vllm1-1  | (Worker_TP0 pid=238) ERROR 09-28 07:55:52 [multiproc_executor.py:597]     loaded_weights = model.load_weights(
vllm1-1  | (Worker_TP0 pid=238) ERROR 09-28 07:55:52 [multiproc_executor.py:597]                      ^^^^^^^^^^^^^^^^^^^
vllm1-1  | (Worker_TP0 pid=238) ERROR 09-28 07:55:52 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_next.py", line 1185, in load_weights
vllm1-1  | (Worker_TP0 pid=238) ERROR 09-28 07:55:52 [multiproc_executor.py:597]     return loader.load_weights(weights)    raise ValueError(msg)
vllm1-1  | (Worker_TP3 pid=241) ERROR 09-28 08:46:50 [multiproc_executor.py:597] ValueError: There is no module or parameter named 'layers.10.mlp.experts.0' in Qwen3NextModel
vllm1-1  | (Worker_TP2 pid=240) ERROR 09-28 08:46:50 [multiproc_executor.py:597] WorkerProc failed to start.

vllm1-1  | (Worker_TP0 pid=238) ERROR 09-28 07:55:52 [multiproc_executor.py:597]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm1-1  | (Worker_TP0 pid=238) ERROR 09-28 07:55:52 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 297, in load_weights
vllm1-1  | (Worker_TP0 pid=238) ERROR 09-28 07:55:52 [multiproc_executor.py:597]     autoloaded_weights = set(self._load_module("", self.module, weights))
vllm1-1  | (Worker_TP0 pid=238) ERROR 09-28 07:55:52 [multiproc_executor.py:597]                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm1-1  | (Worker_TP0 pid=238) ERROR 09-28 07:55:52 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 255, in _load_module
vllm1-1  | (Worker_TP0 pid=238) ERROR 09-28 07:55:52 [multiproc_executor.py:597]     yield from self._load_module(prefix,
vllm1-1  | (Worker_TP0 pid=238) ERROR 09-28 07:55:52 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/utils.py", line 228, in _load_module
vllm1-1  | (Worker_TP0 pid=238) ERROR 09-28 07:55:52 [multiproc_executor.py:597]     loaded_params = module_load_weights(weights)
vllm1-1  | (Worker_TP0 pid=238) ERROR 09-28 07:55:52 [multiproc_executor.py:597]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
vllm1-1  | (Worker_TP0 pid=238) ERROR 09-28 07:55:52 [multiproc_executor.py:597]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_next.py", line 1021, in load_weights
vllm1-1  | (Worker_TP0 pid=238) ERROR 09-28 07:55:52 [multiproc_executor.py:597]     param = params_dict[name]
vllm1-1  | (Worker_TP0 pid=238) ERROR 09-28 07:55:52 [multiproc_executor.py:597]             ~~~~~~~~~~~^^^^^^
vllm1-1  | (Worker_TP0 pid=238) ERROR 09-28 07:55:52 [multiproc_executor.py:597] KeyError: 'layers.7.self_attn.qkqkv_proj.g_idx'
Loading safetensors checkpoint shards:   0% 0/9 [00:02<?, ?it/s]
vllm1-1  | (Worker_TP1 pid=239) INFO 09-28 07:55:52 [multiproc_executor.py:558] Parent process exited, terminating worker

@Isotr0py
Copy link
Member Author

Ooops, sorry for breaking this. Let me fix it. 😅

@JartX
Copy link
Contributor

JartX commented Sep 28, 2025

@Isotr0py Don't worry, development ☺️ stuff

@Isotr0py
Copy link
Member Author

@JartX I double checked the config.json's quant_config in jart25/Qwen3-Next-80B-A3B-Instruct-Int4-GPTQ, and found that "modules_in_block_to_quantize" missed self attention layer's projector which is also quantized in checkpoint. Perhaps there are some bugs in AutoRound's GPTQ quantizer?

    "modules_in_block_to_quantize": [
      [
+++     "self_attn.q_proj",
+++     "self_attn.k_proj",
+++     "self_attn.v_proj",
+++     "self_attn.o_proj",
        "linear_attn.in_proj_qkvz",
        "linear_attn.in_proj_ba",
        "linear_attn.out_proj",

With above changes in config.json, the model should run nornamlly now:

$ python examples/offline_inference/basic/generate.py --model /run//user/1001/hf_model/Qwen3-Next-80B-A3B-Instruct-Int4-GPTQ/ --max-model-len 4096 --enforce-eager
INFO 09-29 00:05:18 [__init__.py:216] Automatically detected platform cuda.
INFO 09-29 00:05:20 [utils.py:233] non-default args: {'max_model_len': 4096, 'num_redundant_experts': None, 'eplb_window_size': None, 'eplb_step_interval': None, 'eplb_log_balancedness': None, 'enforce_eager': True, 'enable_lora': None, 'model': '/run//user/1001/hf_model/Qwen3-Next-80B-A3B-Instruct-Int4-GPTQ/'}
INFO 09-29 00:05:20 [model.py:552] Resolved architecture: Qwen3NextForCausalLM
`torch_dtype` is deprecated! Use `dtype` instead!
INFO 09-29 00:05:20 [model.py:1515] Using max model len 4096
INFO 09-29 00:05:21 [gptq_marlin.py:191] The model is convertible to gptq_marlin during runtime. Using gptq_marlin kernel.
INFO 09-29 00:05:21 [scheduler.py:205] Chunked prefill is enabled with max_num_batched_tokens=16384.
INFO 09-29 00:05:21 [config.py:297] Hybrid or mamba-based model detected: disabling prefix caching since it is not yet supported.
INFO 09-29 00:05:21 [config.py:308] Hybrid or mamba-based model detected: setting cudagraph mode to FULL_AND_PIECEWISE in order to optimize performance.
INFO 09-29 00:05:24 [config.py:377] Setting attention block size to 544 tokens to ensure that attention page size is >= mamba page size.
INFO 09-29 00:05:24 [config.py:398] Padding mamba page size by 1.49% to ensure that mamba page size and attention page size are exactly equal.
INFO 09-29 00:05:24 [__init__.py:382] Cudagraph is disabled under eager mode
WARNING 09-29 00:05:24 [__init__.py:3035] We must use the `spawn` multiprocessing start method. Overriding VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See https://docs.vllm.ai/en/latest/usage/troubleshooting.html#python-multiprocessing for more information. Reasons: CUDA is initialized
INFO 09-29 00:05:26 [__init__.py:216] Automatically detected platform cuda.
(EngineCore_DP0 pid=1925120) INFO 09-29 00:05:29 [core.py:648] Waiting for init message from front-end.
(EngineCore_DP0 pid=1925120) INFO 09-29 00:05:29 [core.py:78] Initializing a V1 LLM engine (v0.10.2rc2.dev114+geddaafc1c) with config: model='/run//user/1001/hf_model/Qwen3-Next-80B-A3B-Instruct-Int4-GPTQ/', speculative_config=None, tokenizer='/run//user/1001/hf_model/Qwen3-Next-80B-A3B-Instruct-Int4-GPTQ/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=/run//user/1001/hf_model/Qwen3-Next-80B-A3B-Instruct-Int4-GPTQ/, enable_prefix_caching=False, chunked_prefill_enabled=True, pooler_config=None, compilation_config={"level":0,"debug_dump_path":null,"cache_dir":"","backend":"","custom_ops":[],"splitting_ops":null,"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":0,"use_cudagraph":false,"cudagraph_num_of_warmups":0,"cudagraph_capture_sizes":[],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"use_inductor_graph_partition":false,"pass_config":{},"max_capture_size":0,"local_cache_dir":null}
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore_DP0 pid=1925120) INFO 09-29 00:05:30 [parallel_state.py:1208] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=1925120) WARNING 09-29 00:05:30 [topk_topp_sampler.py:66] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(EngineCore_DP0 pid=1925120) INFO 09-29 00:05:30 [gpu_model_runner.py:2679] Starting to load model /run//user/1001/hf_model/Qwen3-Next-80B-A3B-Instruct-Int4-GPTQ/...
(EngineCore_DP0 pid=1925120) INFO 09-29 00:05:30 [gpu_model_runner.py:2711] Loading model from scratch...
(EngineCore_DP0 pid=1925120) INFO 09-29 00:05:30 [gptq_marlin.py:316] Using MacheteLinearKernel for GPTQMarlinLinearMethod
(EngineCore_DP0 pid=1925120) INFO 09-29 00:05:30 [gptq_marlin.py:316] Using MarlinLinearKernel for GPTQMarlinLinearMethod
(EngineCore_DP0 pid=1925120) `torch_dtype` is deprecated! Use `dtype` instead!
(EngineCore_DP0 pid=1925120) INFO 09-29 00:05:30 [cuda.py:347] Using Flash Attention backend on V1 engine.
Loading safetensors checkpoint shards:   0% Completed | 0/9 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  11% Completed | 1/9 [00:03<00:26,  3.36s/it]
Loading safetensors checkpoint shards:  22% Completed | 2/9 [00:06<00:24,  3.51s/it]
Loading safetensors checkpoint shards:  33% Completed | 3/9 [00:10<00:20,  3.45s/it]
Loading safetensors checkpoint shards:  44% Completed | 4/9 [00:13<00:17,  3.46s/it]
Loading safetensors checkpoint shards:  56% Completed | 5/9 [00:17<00:13,  3.45s/it]
Loading safetensors checkpoint shards:  67% Completed | 6/9 [00:18<00:08,  2.84s/it]
Loading safetensors checkpoint shards:  78% Completed | 7/9 [00:22<00:06,  3.05s/it]
Loading safetensors checkpoint shards:  89% Completed | 8/9 [00:25<00:03,  3.10s/it]
Loading safetensors checkpoint shards: 100% Completed | 9/9 [00:29<00:00,  3.22s/it]
Loading safetensors checkpoint shards: 100% Completed | 9/9 [00:29<00:00,  3.23s/it]
(EngineCore_DP0 pid=1925120) 
(EngineCore_DP0 pid=1925120) INFO 09-29 00:06:00 [default_loader.py:267] Loading weights took 29.17 seconds
(EngineCore_DP0 pid=1925120) INFO 09-29 00:06:03 [gpu_model_runner.py:2730] Model loading took 39.3258 GiB and 32.666466 seconds
(EngineCore_DP0 pid=1925120) INFO 09-29 00:06:05 [gpu_worker.py:298] Available KV cache memory: 80.76 GiB
(EngineCore_DP0 pid=1925120) INFO 09-29 00:06:06 [kv_cache_utils.py:1087] GPU KV cache size: 881,824 tokens
(EngineCore_DP0 pid=1925120) INFO 09-29 00:06:06 [kv_cache_utils.py:1091] Maximum concurrency for 4,096 tokens per request: 589.64x
(EngineCore_DP0 pid=1925120) WARNING 09-29 00:06:06 [cudagraph_dispatcher.py:105] cudagraph dispatching keys are not initialized. No cudagraph will be used.
(EngineCore_DP0 pid=1925120) INFO 09-29 00:06:06 [core.py:211] init engine (profile, create kv cache, warmup model) took 2.61 seconds
(EngineCore_DP0 pid=1925120) INFO 09-29 00:06:06 [__init__.py:382] Cudagraph is disabled under eager mode
(EngineCore_DP0 pid=1925120) INFO 09-29 00:06:06 [gc_utils.py:41] GC Debug Config. enabled:False,top_objects:-1
INFO 09-29 00:06:06 [loggers.py:147] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 6486
INFO 09-29 00:06:06 [llm.py:306] Supported_tasks: ['generate']
WARNING 09-29 00:06:06 [model.py:1394] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
Adding requests: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 452.24it/s]
Processed prompts:   0%|                                                                                 | 0/4 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

@JartX
Copy link
Contributor

JartX commented Sep 28, 2025

@Isotr0py You're right, it works, and Autoround doesn't declare them. Does that mean it used to work “by chance”? I'm going to open an issue in Autoround.

@Isotr0py
Copy link
Member Author

Does that mean it used to work “by chance”?

That's right. The modules_in_block_to_quantize was never used in vLLM before this PR. So all modules were treated as quantized in GPTQ models, and we excluded unquantized modules manually with _maybe_ignore_quant_config.

pdasigi pushed a commit to pdasigi/vllm that referenced this pull request Oct 2, 2025
yewentao256 pushed a commit that referenced this pull request Oct 3, 2025
…25455)

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
…llm-project#25455)

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
choprahetarth pushed a commit to Tandemn-Labs/vllm that referenced this pull request Oct 11, 2025
lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025
alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
…llm-project#25455)

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants