[Quantization] Add Gemma2 and Gemma3 text model GGUF support #14766

Isotr0py · 2025-03-13T15:42:05Z

FIX #14753 #15480 (link existing issues this PR will resolve)
Continue #12186 here as well, because I'm bothered to rebase it with similar modifications.

Signed-off-by: Isotr0py <2037008807@qq.com>

github-actions · 2025-03-13T15:42:18Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: Isotr0py <2037008807@qq.com>

Isotr0py · 2025-03-14T08:07:17Z

Evaluation Result

gemma-2-2b-it

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.470	±	0.0354
		strict-match	5	exact_match	↑	0.465	±	0.0354

gemma-2-2b-it-Q4_K_M-Result

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.42	±	0.035
		strict-match	5	exact_match	↑	0.42	±	0.035

gemma3-1b-it-unquantized

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
arc_easy	1	none	0	acc	↑	0.64	±	0.0482
		none	0	acc_norm	↑	0.68	±	0.0469

gemma3-1b-it-Q4_K_M

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
arc_easy	1	none	0	acc	↑	0.67	±	0.0473
		none	0	acc_norm	↑	0.65	±	0.0479

Not sure if it's because of the accuracy issue on xformers backend, but gsm8k score on unquantized/quantized gemma-3-1b-it models are both 0 on my side locally, so I switched to arc_easy for evaluation.

rhajou · 2025-03-19T07:08:50Z

hello, any update on this ?

mpetruc · 2025-03-21T00:27:29Z

I'm not sure if the error below is because of user error or gemma-3 gguf is not supported in 0.8.1:

vllm serve gemma-3-27b-it-Q6_K.gguf --port $5000 --trust-remote-code --device cuda --gpu-memory-utilization 0.95 --enable-chunked-prefill --swap-space 1 --max_model_len 44592

INFO 03-20 19:11:29 [__init__.py:256] Automatically detected platform cuda.
INFO 03-20 19:11:31 [api_server.py:977] vLLM API server version 0.8.1
INFO 03-20 19:11:31 [api_server.py:978] args: Namespace(subparser='serve', model_tag='/home/petrucm/data/models/gemma3/gemma-3-27b-it-Q6_K.gguf', config='', [...]

Traceback (most recent call last): File "/mnt/pixstor/data/petrucm/.vllm/bin/vllm", line 10, in <module> sys.exit(main()) ^^^^^^ File "/mnt/pixstor/data/petrucm/.vllm/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 75, in main args.dispatch_function(args) File "/mnt/pixstor/data/petrucm/.vllm/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 33, in cmd uvloop.run(run_server(args)) File "/mnt/pixstor/data/petrucm/.vllm/lib/python3.12/site-packages/uvloop/__init__.py", line 109, in run return __asyncio.run( ^^^^^^^^^^^^^^ File "/home/petrucm/.local/share/uv/python/cpython-3.12.9-linux-x86_64-gnu/lib/python3.12/asyncio/runners.py", line 195, in run return runner.run(main) ^^^^^^^^^^^^^^^^ File "/home/petrucm/.local/share/uv/python/cpython-3.12.9-linux-x86_64-gnu/lib/python3.12/asyncio/runners.py", line 118, in run return self._loop.run_until_complete(task) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete File "/mnt/pixstor/data/petrucm/.vllm/lib/python3.12/site-packages/uvloop/__init__.py", line 61, in wrapper return await main ^^^^^^^^^^ File "/mnt/pixstor/data/petrucm/.vllm/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1012, in run_server async with build_async_engine_client(args) as engine_client: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/petrucm/.local/share/uv/python/cpython-3.12.9-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 210, in __aenter__ return await anext(self.gen) ^^^^^^^^^^^^^^^^^^^^^ File "/mnt/pixstor/data/petrucm/.vllm/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 141, in build_async_engine_client async with build_async_engine_client_from_engine_args( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/petrucm/.local/share/uv/python/cpython-3.12.9-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 210, in __aenter__ return await anext(self.gen) ^^^^^^^^^^^^^^^^^^^^^ File "/mnt/pixstor/data/petrucm/.vllm/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 161, in build_async_engine_client_from_engine_args vllm_config = engine_args.create_engine_config(usage_context=usage_context) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/pixstor/data/petrucm/.vllm/lib/python3.12/site-packages/vllm/engine/arg_utils.py", line 1206, in create_engine_config model_config = self.create_model_config() ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/pixstor/data/petrucm/.vllm/lib/python3.12/site-packages/vllm/engine/arg_utils.py", line 1121, in create_model_config return ModelConfig( ^^^^^^^^^^^^ File "/mnt/pixstor/data/petrucm/.vllm/lib/python3.12/site-packages/vllm/config.py", line 333, in __init__ hf_config = get_config(self.hf_config_path or self.model, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/pixstor/data/petrucm/.vllm/lib/python3.12/site-packages/vllm/transformers_utils/config.py", line 280, in get_config config_dict, _ = PretrainedConfig.get_config_dict( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/pixstor/data/petrucm/.vllm/lib/python3.12/site-packages/transformers/configuration_utils.py", line 594, in get_config_dict config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/pixstor/data/petrucm/.vllm/lib/python3.12/site-packages/transformers/configuration_utils.py", line 685, in _get_config_dict config_dict = load_gguf_checkpoint(resolved_config_file, return_tensors=False)["config"] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/pixstor/data/petrucm/.vllm/lib/python3.12/site-packages/transformers/modeling_gguf_pytorch_utils.py", line 399, in load_gguf_checkpoint raise ValueError(f"GGUF model with architecture {architecture} is not supported yet.") ValueError: GGUF model with architecture gemma3 is not supported yet.
Same error with and without export VLLM_USE_V1=1

edit: same error with vLLM API server version 0.8.2.dev31+g2b22290c

Isotr0py · 2025-03-21T05:16:18Z

@mpetruc You need to add --hf-config-path google/gemma-3-27b-it to use original model config, because transformers hasn't supported config conversion from gguf model for gemma3 yet.

leslliesayrus · 2025-03-22T16:29:46Z

@mpetruc You need to add --hf-config-path google/gemma-3-27b-it to use original model config, because transformers hasn't supported config conversion from gguf model for gemma3 yet.

I tried the same thing with the --hf-config-path added, but I still got the same error. Do you know what could be the issue?

Command used:

vllm serve "gemma-3-27b-it-Q4_K_M.gguf" --trust-remote-code --hf-config-path google/gemma-3-27b-it

My environment:
Transformers: Version 4.50.0
VLLM: 0.8.1(I got the same error with 0.8.2)

Error:

Traceback (most recent call last):
  File "/usr/local/bin/vllm", line 8, in <module>
    sys.exit(main())

ValueError: GGUF model with architecture gemma3 is not supported yet.

Isotr0py · 2025-03-22T16:58:08Z

@leslliesayrus Oh, you need to also add --tokenizer google/gemma-3-27b-it otherwise it will try to convert tokenizer from GGUF model through transformers as well.

Besides this, you need to install gguf from source as well, because its Gemma3 support hasn't released: https://github.com/ggml-org/llama.cpp/tree/master/gguf-py#development.

You can try using the command for serving:

vllm serve "gemma-3-27b-it-Q4_K_M.gguf" --trust-remote-code --hf-config-path unsloth/gemma-3-27b-it -tp 2 --max-model-len 4096 --tokenizer unsloth/gemma-3-27b-it --hf-overrides '{"architectures": ["Gemma3ForCausalLM"]}'

Let me update this PR with more user-friendly error messages...

anunknowperson · 2025-03-22T23:51:49Z

Hey, can I use gemma3 gguf with vision? Or is this text-only

Isotr0py · 2025-03-23T05:57:30Z

@anunknowperson This is text-only.

skytect · 2025-03-24T05:57:48Z

I am getting AttributeError: 'Gemma3Config' object has no attribute 'num_hidden_layers'

  File ".../vllm/lib/python3.12/site-packages/vllm/model_executor/model_loader/loader.py", line 1301, in _get_gguf_weights_map
    num_layers = config.num_hidden_layers
                 ^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../vllm/lib/python3.12/site-packages/transformers/configuration_utils.py", line 214, in __getattribute__
    return super().__getattribute__(key)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'Gemma3Config' object has no attribute 'num_hidden_layers'

Same command you mentioned:
vllm serve /home/user/models/gemma-3-27b-GGUF/gemma-3-27b-it-Q4_K_M.gguf --trust-remote-code --hf-config-path unsloth/gemma-3-27b-it -tp 2 --max-model-len 4096 --tokenizer unsloth/gemma-3-27b-it --hf-overrides '{"architectures": ["Gemma3ForCausalLM"]}'

transformers: 4.50.0
gguf: built from source https://github.com/ggml-org/llama.cpp/tree/77f9c6bbe55fccd9ea567794024cb80943947901
vllm: built from source https://github.com/Isotr0py/vllm/tree/6f9adf613da104ba70bdae956869171717eda0e9

weights used: https://huggingface.co/unsloth/gemma-3-27b-it-GGUF/blob/main/gemma-3-27b-it-Q4_K_M.gguf

mergify · 2025-04-01T17:15:52Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Isotr0py.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

zc142365 · 2025-04-09T07:05:43Z

What version of vllm will this pull be merged in?

JohnConnor123 · 2025-04-13T15:57:14Z

What version of vllm will this pull be merged in?

Also interested

Isotr0py · 2025-04-13T16:16:20Z

What version of vllm will this pull be merged in?

Hmmm, I would like to wait the transformers release of huggingface/transformers#37424, so that we don't need to pass --hf-config-path, especially Gemma3 has VLM variants...

Install transformers from main branch should also work with this PR.

akash-agr · 2025-04-19T14:54:46Z

Hi @Isotr0py , I believe [https://github.com/huggingface/transformers/pull/37424] pull request has been merged. When can we expect this to be merged? Thanks a lot for your efforts.

surak · 2025-04-22T12:26:30Z

LGTM

jayyang-zigbang · 2025-04-23T02:54:56Z

@Isotr0py firstly very thank you for your efforts. Finally i got run Gemma3-4b-it GGUF through vllm without error. But there is still a issue when i run gemma3 GGUF model.
I got empty output when using google/gemma-3-4b-it-qat-q4_0-gguf. but google/gemma-3-1b-it-qat-q4_0-gguf is fine.
And i knew gemma-3-4b-it original mode has BF16 problem with T4 gpu will get empty output. but i thought gemma-3-4b-it-qat-q4_0-gguf fix F16 problem, so it should work.
Have any idea?

akash-agr · 2025-04-29T08:20:54Z

Hi @DarkLight1337 @Isotr0py, I am still facing error in running the Gemma 3 GGUF models. Can you please help us as we intend to use these models in production and can't do it without the support of parallel processing through vllm. Thank you so much for your efforts.

Here's the code with error

from vllm import LLM, SamplingParams
import os

# Set Hugging Face token for authentication
os.environ["HUGGING_FACE_HUB_TOKEN"] = "hf_test"

llm = LLM(model="google/gemma-3-27b-it-qat-q4_0-gguf")  ## "unsloth/gemma-3-27b-it-GGUF"

sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

prompts = [
    "Hello, my name is",
    "The president of the United States is"
]

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

INFO 04-29 08:17:24 [init.py:239] Automatically detected platform cuda.
Traceback (most recent call last):
File "/root/vllm/vllm/transformers_utils/config.py", line 287, in get_config
raise ValueError(
ValueError: Could not detect config format for no config file found. Ensure your model has either config.json (HF format) or params.json (Mistral format).

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/root/vllm_test/gemma3_tes.py", line 7, in
llm = LLM(model="google/gemma-3-27b-it-qat-q4_0-gguf") ## "unsloth/gemma-3-27b-it-GGUF"
File "/root/vllm/vllm/utils.py", line 1161, in inner
return fn(*args, **kwargs)
File "/root/vllm/vllm/entrypoints/llm.py", line 247, in init
self.llm_engine = LLMEngine.from_engine_args(
File "/root/vllm/vllm/engine/llm_engine.py", line 503, in from_engine_args
vllm_config = engine_args.create_engine_config(usage_context)
File "/root/vllm/vllm/engine/arg_utils.py", line 1091, in create_engine_config
model_config = self.create_model_config()
File "/root/vllm/vllm/engine/arg_utils.py", line 979, in create_model_config
return ModelConfig(
File "/root/vllm/vllm/config.py", line 450, in init
hf_config = get_config(self.hf_config_path or self.model,
File "/root/vllm/vllm/transformers_utils/config.py", line 304, in get_config
raise ValueError(error_message) from e
ValueError: Invalid repository ID or local directory specified: 'google/gemma-3-27b-it-qat-q4_0-gguf'.
Please verify the following requirements:

Provide a valid Hugging Face repository ID.

Specify a local directory that contains a recognized configuration file.

For Hugging Face models: ensure the presence of a 'config.json'.

For Mistral models: ensure the presence of a 'params.json'.

Isotr0py · 2025-04-29T13:57:22Z

@akash-agr You should use the model path of your local gguf checkpoint.

akash-agr · 2025-04-29T17:38:32Z

Sorry to bother you @Isotr0py

I have downloaded the model and using below command. But I am getting the following error

vllm serve --model=~/gemma-3-4b-it-q4_0.gguf --gpu_memory_utilization=0.95 --max-model-len 4096

INFO 04-29 17:21:56 [init.py:239] Automatically detected platform cuda.
INFO 04-29 17:22:03 [api_server.py:1043] vLLM API server version 0.8.5.dev337+gd3cf61b89
INFO 04-29 17:22:03 [api_server.py:1044] args: Namespace(subparser='serve', model_tag=None, config='', host=None, port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/gemma-3-4b-it-q4_0.gguf', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, load_format='auto', download_dir=None, model_loader_extra_config={}, use_tqdm_on_load=True, config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', max_model_len=4096, guided_decoding_backend='auto', reasoning_parser=None, logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=None, gpu_memory_utilization=0.95, swap_space=4, kv_cache_dtype='auto', num_gpu_blocks_override=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', cpu_offload_gb=0, calculate_kv_scales=False, disable_sliding_window=False, use_v2_block_manager=True, seed=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config={}, limit_mm_per_prompt={}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=None, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=None, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', speculative_config=None, ignore_patterns=[], served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, max_num_batched_tokens=None, max_num_seqs=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, num_lookahead_slots=0, scheduler_delay_factor=0.0, preemption_mode=None, num_scheduler_steps=1, multi_step_stream_outputs=True, scheduling_policy='fcfs', enable_chunked_prefill=None, disable_chunked_mm_input=False, scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, additional_config=None, enable_reasoning=False, disable_cascade_attn=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x7b9f31734310>)
Traceback (most recent call last):
File "/root/vllm/vllm/transformers_utils/config.py", line 279, in get_config
if is_gguf or file_or_path_exists(
File "/root/vllm/vllm/transformers_utils/config.py", line 173, in file_or_path_exists
cached_filepath = try_to_load_from_cache(repo_id=model,
File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn
validate_repo_id(arg_value)
File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/validators.py", line 160, in validate_repo_id
raise HFValidationError(
huggingface_hub.errors.HFValidationError: Repo id must use alphanumeric chars or '-', '', '.', '--' and '..' are forbidden, '-' and '.' cannot start or end the name, max length is 96: '/gemma-3-4b-it-q4_0.gguf'.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/usr/local/bin/vllm", line 8, in
sys.exit(main())
File "/root/vllm/vllm/entrypoints/cli/main.py", line 53, in main
args.dispatch_function(args)
File "/root/vllm/vllm/entrypoints/cli/serve.py", line 27, in cmd
uvloop.run(run_server(args))
File "/usr/local/lib/python3.10/dist-packages/uvloop/init.py", line 82, in run
return loop.run_until_complete(wrapper())
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/usr/local/lib/python3.10/dist-packages/uvloop/init.py", line 61, in wrapper
return await main
File "/root/vllm/vllm/entrypoints/openai/api_server.py", line 1078, in run_server
async with build_async_engine_client(args) as engine_client:
File "/usr/lib/python3.10/contextlib.py", line 199, in aenter
return await anext(self.gen)
File "/root/vllm/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
File "/usr/lib/python3.10/contextlib.py", line 199, in aenter
return await anext(self.gen)
File "/root/vllm/vllm/entrypoints/openai/api_server.py", line 166, in build_async_engine_client_from_engine_args
vllm_config = engine_args.create_engine_config(usage_context=usage_context)
File "/root/vllm/vllm/engine/arg_utils.py", line 1100, in create_engine_config
model_config = self.create_model_config()
File "/root/vllm/vllm/engine/arg_utils.py", line 988, in create_model_config
return ModelConfig(
File "/root/vllm/vllm/config.py", line 460, in init
hf_config = get_config(self.hf_config_path or self.model,
File "/root/vllm/vllm/transformers_utils/config.py", line 304, in get_config
raise ValueError(error_message) from e
ValueError: Invalid repository ID or local directory specified: '~/gemma-3-4b-it-q4_0.gguf'.
Please verify the following requirements:

Provide a valid Hugging Face repository ID.

Specify a local directory that contains a recognized configuration file.

For Hugging Face models: ensure the presence of a 'config.json'.

For Mistral models: ensure the presence of a 'params.json'.

I also tried the below script with tokeniser, I am getting the same "KeyError: 'general.name'" error.

from huggingface_hub import hf_hub_download
from vllm import LLM, SamplingParams
import os

# Set Hugging Face token for authentication
os.environ["HUGGING_FACE_HUB_TOKEN"] = "hf_test"

def run_gguf_inference(model_path, tokenizer):
    # Sample prompts.
    prompts = [
        "How many helicopters can a human eat in one sitting?",
        "What's the future of AI?",
    ]
    prompts = [[{"role": "user", "content": prompt}] for prompt in prompts]
    # Create a sampling params object.
    sampling_params = SamplingParams(temperature=0, max_tokens=128)

    # Create an LLM.
    llm = LLM(model=model_path, tokenizer=tokenizer)

    outputs = llm.chat(prompts, sampling_params)
    # Print the outputs.
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")


if __name__ == "__main__":
    repo_id = "google/gemma-3-4b-it-qat-q4_0-gguf"
    filename = "gemma-3-4b-it-q4_0.gguf"
    tokenizer = "google/gemma-3-4b-it"
    model = hf_hub_download(repo_id, filename=filename)
    print('model', model)
    run_gguf_inference(model, tokenizer)

Error

INFO 04-29 17:17:47 [init.py:239] Automatically detected platform cuda.
gemma-3-4b-it-q4_0.gguf: 100%|███████████████████| 3.16G/3.16G [00:28<00:00, 109MB/s]
model /root/.cache/huggingface/hub/models--google--gemma-3-4b-it-qat-q4_0-gguf/snapshots/15f73f5eee9c28f53afefef5723e29680c2fc78a/gemma-3-4b-it-q4_0.gguf
Traceback (most recent call last):
File "/root/vllm_test/gemma3_gguf.py", line 35, in
run_gguf_inference(model, tokenizer)
File "/root/vllm_test/gemma3_gguf.py", line 19, in run_gguf_inference
llm = LLM(model=model_path, tokenizer=tokenizer)
File "/root/vllm/vllm/utils.py", line 1161, in inner
return fn(*args, **kwargs)
File "/root/vllm/vllm/entrypoints/llm.py", line 247, in init
self.llm_engine = LLMEngine.from_engine_args(
File "/root/vllm/vllm/engine/llm_engine.py", line 503, in from_engine_args
vllm_config = engine_args.create_engine_config(usage_context)
File "/root/vllm/vllm/engine/arg_utils.py", line 1100, in create_engine_config
model_config = self.create_model_config()
File "/root/vllm/vllm/engine/arg_utils.py", line 988, in create_model_config
return ModelConfig(
File "/root/vllm/vllm/config.py", line 460, in init
hf_config = get_config(self.hf_config_path or self.model,
File "/root/vllm/vllm/transformers_utils/config.py", line 307, in get_config
config_dict, _ = PretrainedConfig.get_config_dict(
File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 590, in get_config_dict
config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 681, in _get_config_dict
config_dict = load_gguf_checkpoint(resolved_config_file, return_tensors=False)["config"]
File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_gguf_pytorch_utils.py", line 369, in load_gguf_checkpoint
model_name = read_field(reader, "general.name")
File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_gguf_pytorch_utils.py", line 260, in read_field
value = reader.fields[field]
KeyError: 'general.name'

I have cloned the latest code from the main branch of vllm. Please let me know if I need to update the transformer version somehow. Since this PR is merged, I am not sure if I am using compatible version. Kindly let me know if i am making some other mistake altogether.

Thanks a lot for your efforts. It will help me in taking my app to production 🙏

FremyCompany · 2025-05-03T09:43:35Z

FYI with recent triton builds, one needs to modify vllm-gemma3/vllm/triton_utils/custom_cache_manager.py as the default folders are no longer stored in torch.runtime.cache. For now I modified the triton cache.py file to get this to run. See this commit for details of the breaking change in triton: triton-lang/triton@8505252#diff-1e6a363bbde516739874d46d8ee06c60a7a76f194275b0f4d46638a0b06af8acL6-R9.

FWIW, even after fixing this I didn't manage to get things to work however, and I ran out of time so I will put this on the backburner for now. Hopefully this PR gets merged at some point in the main branch :)

acsezen · 2025-09-07T23:11:01Z

@Isotr0py firstly very thank you for your efforts. Finally i got run Gemma3-4b-it GGUF through vllm without error. But there is still a issue when i run gemma3 GGUF model. I got empty output when using google/gemma-3-4b-it-qat-q4_0-gguf. but google/gemma-3-1b-it-qat-q4_0-gguf is fine. And i knew gemma-3-4b-it original mode has BF16 problem with T4 gpu will get empty output. but i thought gemma-3-4b-it-qat-q4_0-gguf fix F16 problem, so it should work. Have any idea?

Hello @jayyang-zigbang,

How did you manage to run it? what was your parameters?

Thanks!

jayyang-zigbang · 2025-09-10T07:03:36Z

@Isotr0py firstly very thank you for your efforts. Finally i got run Gemma3-4b-it GGUF through vllm without error. But there is still a issue when i run gemma3 GGUF model. I got empty output when using google/gemma-3-4b-it-qat-q4_0-gguf. but google/gemma-3-1b-it-qat-q4_0-gguf is fine. And i knew gemma-3-4b-it original mode has BF16 problem with T4 gpu will get empty output. but i thought gemma-3-4b-it-qat-q4_0-gguf fix F16 problem, so it should work. Have any idea?

Hello @jayyang-zigbang,

How did you manage to run it? what was your parameters?

Thanks!

@acsezen unfortunately i tried it long time ago. i didn't remember what parameters i used but i think from BF16 to F16 is not easy. Any suggestion?

qazi0 · 2025-10-08T10:32:11Z

@Isotr0py are you still working on this?

Isotr0py · 2025-10-09T08:15:28Z

Suspended by #26189

Isotr0py added 2 commits March 13, 2025 19:45

add gemma3 gguf support

2a2873b

Signed-off-by: Isotr0py <2037008807@qq.com>

add gemma2 and gemma3 GGUF

d22f91b

Signed-off-by: Isotr0py <2037008807@qq.com>

Isotr0py requested a review from mgoin March 13, 2025 15:42

loose gguf version

a985145

Signed-off-by: Isotr0py <2037008807@qq.com>

mergify bot added the ci/build label Mar 13, 2025

fix text config

c0c86c2

Signed-off-by: Isotr0py <2037008807@qq.com>

Isotr0py changed the title ~~[Quantization] Add Gemma2 and Gemma3 GGUF support~~ [Quantization] Add Gemma2 and Gemma3 text model GGUF support Mar 13, 2025

handle text config

6f9adf6

Signed-off-by: Isotr0py <2037008807@qq.com>

Isotr0py marked this pull request as draft March 22, 2025 16:58

jeejeelee mentioned this pull request Mar 26, 2025

[Bug]: Unknown gguf model_type: gemma3 #15480

Closed

1 task

mergify bot added the needs-rebase label Apr 1, 2025

This was referenced Apr 20, 2025

Add GGUF support to Gemma3 Text backbone huggingface/transformers#37424

Merged

[Feature]: gemma3 raise error #14723

Closed

DarkLight1337 mentioned this pull request Apr 22, 2025

[Feature]: Support gemma3 architecture #14696

Closed

1 task

joennlae mentioned this pull request Apr 23, 2025

[Feature]: Support Gemma 3 QAT series #16856

Open

1 task

Isotr0py mentioned this pull request May 2, 2025

[Bug]: Cannot load Gemma3 27b QAT GGUF on RTX 5090 #17587

Closed

1 task

hmellor mentioned this pull request May 9, 2025

[Misc] Add Gemma2 GGUF support #12186

Closed

lukaLLM mentioned this pull request Jun 19, 2025

[Bug]: 5090 gemma-3-12b-it using FP8/INT8/FP16 quantization for conncurent requests DOCKER. #19863

Closed

1 task

Isotr0py closed this Oct 9, 2025

Isotr0py deleted the gemma3-gguf branch October 9, 2025 08:15

Uh oh!

[Quantization] Add Gemma2 and Gemma3 text model GGUF support #14766

[Quantization] Add Gemma2 and Gemma3 text model GGUF support #14766

Uh oh!

Conversation

Isotr0py commented Mar 13, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 13, 2025

Uh oh!

Isotr0py commented Mar 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Evaluation Result

gemma-2-2b-it

gemma-2-2b-it-Q4_K_M-Result

gemma3-1b-it-unquantized

gemma3-1b-it-Q4_K_M

Uh oh!

rhajou commented Mar 19, 2025

Uh oh!

mpetruc commented Mar 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Isotr0py commented Mar 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leslliesayrus commented Mar 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Isotr0py commented Mar 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anunknowperson commented Mar 22, 2025

Uh oh!

Isotr0py commented Mar 23, 2025

Uh oh!

skytect commented Mar 24, 2025

Uh oh!

mergify bot commented Apr 1, 2025

Uh oh!

zc142365 commented Apr 9, 2025

Uh oh!

JohnConnor123 commented Apr 13, 2025

Uh oh!

Isotr0py commented Apr 13, 2025

Uh oh!

akash-agr commented Apr 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

surak commented Apr 22, 2025

Uh oh!

jayyang-zigbang commented Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

akash-agr commented Apr 29, 2025

Uh oh!

Isotr0py commented Apr 29, 2025

Uh oh!

akash-agr commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

FremyCompany commented May 3, 2025

Uh oh!

acsezen commented Sep 7, 2025

Uh oh!

jayyang-zigbang commented Sep 10, 2025

Uh oh!

qazi0 commented Oct 8, 2025

Uh oh!

Isotr0py commented Oct 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

14 participants

Isotr0py commented Mar 13, 2025 •

edited by github-actions bot

Loading

Isotr0py commented Mar 14, 2025 •

edited

Loading

mpetruc commented Mar 21, 2025 •

edited

Loading

Isotr0py commented Mar 21, 2025 •

edited

Loading

leslliesayrus commented Mar 22, 2025 •

edited

Loading

Isotr0py commented Mar 22, 2025 •

edited

Loading

akash-agr commented Apr 19, 2025 •

edited

Loading

jayyang-zigbang commented Apr 23, 2025 •

edited

Loading

akash-agr commented Apr 29, 2025 •

edited

Loading