Skip to content

Conversation

@Isotr0py
Copy link
Member

@Isotr0py Isotr0py commented Mar 13, 2025

FIX #14753 #15480 (link existing issues this PR will resolve)
Continue #12186 here as well, because I'm bothered to rebase it with similar modifications.

Signed-off-by: Isotr0py <2037008807@qq.com>
Signed-off-by: Isotr0py <2037008807@qq.com>
@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@Isotr0py Isotr0py requested a review from mgoin March 13, 2025 15:42
Signed-off-by: Isotr0py <2037008807@qq.com>
@mergify mergify bot added the ci/build label Mar 13, 2025
Signed-off-by: Isotr0py <2037008807@qq.com>
@Isotr0py Isotr0py changed the title [Quantization] Add Gemma2 and Gemma3 GGUF support [Quantization] Add Gemma2 and Gemma3 text model GGUF support Mar 13, 2025
Signed-off-by: Isotr0py <2037008807@qq.com>
@Isotr0py
Copy link
Member Author

Isotr0py commented Mar 14, 2025

Evaluation Result

gemma-2-2b-it

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.470 ± 0.0354
strict-match 5 exact_match 0.465 ± 0.0354

gemma-2-2b-it-Q4_K_M-Result

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match 0.42 ± 0.035
strict-match 5 exact_match 0.42 ± 0.035

gemma3-1b-it-unquantized

Tasks Version Filter n-shot Metric Value Stderr
arc_easy 1 none 0 acc 0.64 ± 0.0482
none 0 acc_norm 0.68 ± 0.0469

gemma3-1b-it-Q4_K_M

Tasks Version Filter n-shot Metric Value Stderr
arc_easy 1 none 0 acc 0.67 ± 0.0473
none 0 acc_norm 0.65 ± 0.0479

Not sure if it's because of the accuracy issue on xformers backend, but gsm8k score on unquantized/quantized gemma-3-1b-it models are both 0 on my side locally, so I switched to arc_easy for evaluation.

@rhajou
Copy link

rhajou commented Mar 19, 2025

hello, any update on this ?

@mpetruc
Copy link

mpetruc commented Mar 21, 2025

I'm not sure if the error below is because of user error or gemma-3 gguf is not supported in 0.8.1:

vllm serve gemma-3-27b-it-Q6_K.gguf --port $5000 --trust-remote-code --device cuda --gpu-memory-utilization 0.95 --enable-chunked-prefill --swap-space 1 --max_model_len 44592

INFO 03-20 19:11:29 [__init__.py:256] Automatically detected platform cuda.
INFO 03-20 19:11:31 [api_server.py:977] vLLM API server version 0.8.1
INFO 03-20 19:11:31 [api_server.py:978] args: Namespace(subparser='serve', model_tag='/home/petrucm/data/models/gemma3/gemma-3-27b-it-Q6_K.gguf', config='', [...]

Traceback (most recent call last): File "/mnt/pixstor/data/petrucm/.vllm/bin/vllm", line 10, in <module> sys.exit(main()) ^^^^^^ File "/mnt/pixstor/data/petrucm/.vllm/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 75, in main args.dispatch_function(args) File "/mnt/pixstor/data/petrucm/.vllm/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 33, in cmd uvloop.run(run_server(args)) File "/mnt/pixstor/data/petrucm/.vllm/lib/python3.12/site-packages/uvloop/__init__.py", line 109, in run return __asyncio.run( ^^^^^^^^^^^^^^ File "/home/petrucm/.local/share/uv/python/cpython-3.12.9-linux-x86_64-gnu/lib/python3.12/asyncio/runners.py", line 195, in run return runner.run(main) ^^^^^^^^^^^^^^^^ File "/home/petrucm/.local/share/uv/python/cpython-3.12.9-linux-x86_64-gnu/lib/python3.12/asyncio/runners.py", line 118, in run return self._loop.run_until_complete(task) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete File "/mnt/pixstor/data/petrucm/.vllm/lib/python3.12/site-packages/uvloop/__init__.py", line 61, in wrapper return await main ^^^^^^^^^^ File "/mnt/pixstor/data/petrucm/.vllm/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1012, in run_server async with build_async_engine_client(args) as engine_client: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/petrucm/.local/share/uv/python/cpython-3.12.9-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 210, in __aenter__ return await anext(self.gen) ^^^^^^^^^^^^^^^^^^^^^ File "/mnt/pixstor/data/petrucm/.vllm/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 141, in build_async_engine_client async with build_async_engine_client_from_engine_args( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/petrucm/.local/share/uv/python/cpython-3.12.9-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 210, in __aenter__ return await anext(self.gen) ^^^^^^^^^^^^^^^^^^^^^ File "/mnt/pixstor/data/petrucm/.vllm/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 161, in build_async_engine_client_from_engine_args vllm_config = engine_args.create_engine_config(usage_context=usage_context) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/pixstor/data/petrucm/.vllm/lib/python3.12/site-packages/vllm/engine/arg_utils.py", line 1206, in create_engine_config model_config = self.create_model_config() ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/pixstor/data/petrucm/.vllm/lib/python3.12/site-packages/vllm/engine/arg_utils.py", line 1121, in create_model_config return ModelConfig( ^^^^^^^^^^^^ File "/mnt/pixstor/data/petrucm/.vllm/lib/python3.12/site-packages/vllm/config.py", line 333, in __init__ hf_config = get_config(self.hf_config_path or self.model, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/pixstor/data/petrucm/.vllm/lib/python3.12/site-packages/vllm/transformers_utils/config.py", line 280, in get_config config_dict, _ = PretrainedConfig.get_config_dict( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/pixstor/data/petrucm/.vllm/lib/python3.12/site-packages/transformers/configuration_utils.py", line 594, in get_config_dict config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/pixstor/data/petrucm/.vllm/lib/python3.12/site-packages/transformers/configuration_utils.py", line 685, in _get_config_dict config_dict = load_gguf_checkpoint(resolved_config_file, return_tensors=False)["config"] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/pixstor/data/petrucm/.vllm/lib/python3.12/site-packages/transformers/modeling_gguf_pytorch_utils.py", line 399, in load_gguf_checkpoint raise ValueError(f"GGUF model with architecture {architecture} is not supported yet.") ValueError: GGUF model with architecture gemma3 is not supported yet.
Same error with and without export VLLM_USE_V1=1

edit: same error with vLLM API server version 0.8.2.dev31+g2b22290c

@Isotr0py
Copy link
Member Author

Isotr0py commented Mar 21, 2025

@mpetruc You need to add --hf-config-path google/gemma-3-27b-it to use original model config, because transformers hasn't supported config conversion from gguf model for gemma3 yet.

@leslliesayrus
Copy link

leslliesayrus commented Mar 22, 2025

@mpetruc You need to add --hf-config-path google/gemma-3-27b-it to use original model config, because transformers hasn't supported config conversion from gguf model for gemma3 yet.

I tried the same thing with the --hf-config-path added, but I still got the same error. Do you know what could be the issue?

Command used:

vllm serve "gemma-3-27b-it-Q4_K_M.gguf" --trust-remote-code --hf-config-path google/gemma-3-27b-it

My environment:
Transformers: Version 4.50.0
VLLM: 0.8.1(I got the same error with 0.8.2)

Error:

Traceback (most recent call last):
  File "/usr/local/bin/vllm", line 8, in <module>
    sys.exit(main())

ValueError: GGUF model with architecture gemma3 is not supported yet.

@Isotr0py
Copy link
Member Author

Isotr0py commented Mar 22, 2025

@leslliesayrus Oh, you need to also add --tokenizer google/gemma-3-27b-it otherwise it will try to convert tokenizer from GGUF model through transformers as well.

Besides this, you need to install gguf from source as well, because its Gemma3 support hasn't released: https://github.com/ggml-org/llama.cpp/tree/master/gguf-py#development.

You can try using the command for serving:

vllm serve "gemma-3-27b-it-Q4_K_M.gguf" --trust-remote-code --hf-config-path unsloth/gemma-3-27b-it -tp 2 --max-model-len 4096 --tokenizer unsloth/gemma-3-27b-it --hf-overrides '{"architectures": ["Gemma3ForCausalLM"]}'

Let me update this PR with more user-friendly error messages...

@Isotr0py Isotr0py marked this pull request as draft March 22, 2025 16:58
@anunknowperson
Copy link

Hey, can I use gemma3 gguf with vision? Or is this text-only

@Isotr0py
Copy link
Member Author

@anunknowperson This is text-only.

@skytect
Copy link

skytect commented Mar 24, 2025

I am getting AttributeError: 'Gemma3Config' object has no attribute 'num_hidden_layers'

  File ".../vllm/lib/python3.12/site-packages/vllm/model_executor/model_loader/loader.py", line 1301, in _get_gguf_weights_map
    num_layers = config.num_hidden_layers
                 ^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../vllm/lib/python3.12/site-packages/transformers/configuration_utils.py", line 214, in __getattribute__
    return super().__getattribute__(key)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'Gemma3Config' object has no attribute 'num_hidden_layers'

Same command you mentioned:
vllm serve /home/user/models/gemma-3-27b-GGUF/gemma-3-27b-it-Q4_K_M.gguf --trust-remote-code --hf-config-path unsloth/gemma-3-27b-it -tp 2 --max-model-len 4096 --tokenizer unsloth/gemma-3-27b-it --hf-overrides '{"architectures": ["Gemma3ForCausalLM"]}'

transformers: 4.50.0
gguf: built from source https://github.com/ggml-org/llama.cpp/tree/77f9c6bbe55fccd9ea567794024cb80943947901
vllm: built from source https://github.com/Isotr0py/vllm/tree/6f9adf613da104ba70bdae956869171717eda0e9

weights used: https://huggingface.co/unsloth/gemma-3-27b-it-GGUF/blob/main/gemma-3-27b-it-Q4_K_M.gguf

@mergify
Copy link

mergify bot commented Apr 1, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Isotr0py.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Apr 1, 2025
@zc142365
Copy link

zc142365 commented Apr 9, 2025

What version of vllm will this pull be merged in?

@JohnConnor123
Copy link

What version of vllm will this pull be merged in?

Also interested

@Isotr0py
Copy link
Member Author

What version of vllm will this pull be merged in?

Hmmm, I would like to wait the transformers release of huggingface/transformers#37424, so that we don't need to pass --hf-config-path, especially Gemma3 has VLM variants...

Install transformers from main branch should also work with this PR.

@akash-agr
Copy link

akash-agr commented Apr 19, 2025

Hi @Isotr0py , I believe [https://github.com/huggingface/transformers/pull/37424] pull request has been merged. When can we expect this to be merged? Thanks a lot for your efforts.

@surak
Copy link

surak commented Apr 22, 2025

LGTM

@jayyang-zigbang
Copy link

jayyang-zigbang commented Apr 23, 2025

@Isotr0py firstly very thank you for your efforts. Finally i got run Gemma3-4b-it GGUF through vllm without error. But there is still a issue when i run gemma3 GGUF model.
I got empty output when using google/gemma-3-4b-it-qat-q4_0-gguf. but google/gemma-3-1b-it-qat-q4_0-gguf is fine.
And i knew gemma-3-4b-it original mode has BF16 problem with T4 gpu will get empty output. but i thought gemma-3-4b-it-qat-q4_0-gguf fix F16 problem, so it should work.
Have any idea?

@akash-agr
Copy link

Hi @DarkLight1337 @Isotr0py, I am still facing error in running the Gemma 3 GGUF models. Can you please help us as we intend to use these models in production and can't do it without the support of parallel processing through vllm. Thank you so much for your efforts.

Here's the code with error

from vllm import LLM, SamplingParams
import os

# Set Hugging Face token for authentication
os.environ["HUGGING_FACE_HUB_TOKEN"] = "hf_test"

llm = LLM(model="google/gemma-3-27b-it-qat-q4_0-gguf")  ## "unsloth/gemma-3-27b-it-GGUF"

sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

prompts = [
    "Hello, my name is",
    "The president of the United States is"
]

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

INFO 04-29 08:17:24 [init.py:239] Automatically detected platform cuda.
Traceback (most recent call last):
File "/root/vllm/vllm/transformers_utils/config.py", line 287, in get_config
raise ValueError(
ValueError: Could not detect config format for no config file found. Ensure your model has either config.json (HF format) or params.json (Mistral format).

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/root/vllm_test/gemma3_tes.py", line 7, in
llm = LLM(model="google/gemma-3-27b-it-qat-q4_0-gguf") ## "unsloth/gemma-3-27b-it-GGUF"
File "/root/vllm/vllm/utils.py", line 1161, in inner
return fn(*args, **kwargs)
File "/root/vllm/vllm/entrypoints/llm.py", line 247, in init
self.llm_engine = LLMEngine.from_engine_args(
File "/root/vllm/vllm/engine/llm_engine.py", line 503, in from_engine_args
vllm_config = engine_args.create_engine_config(usage_context)
File "/root/vllm/vllm/engine/arg_utils.py", line 1091, in create_engine_config
model_config = self.create_model_config()
File "/root/vllm/vllm/engine/arg_utils.py", line 979, in create_model_config
return ModelConfig(
File "/root/vllm/vllm/config.py", line 450, in init
hf_config = get_config(self.hf_config_path or self.model,
File "/root/vllm/vllm/transformers_utils/config.py", line 304, in get_config
raise ValueError(error_message) from e
ValueError: Invalid repository ID or local directory specified: 'google/gemma-3-27b-it-qat-q4_0-gguf'.
Please verify the following requirements:

  1. Provide a valid Hugging Face repository ID.
  2. Specify a local directory that contains a recognized configuration file.
    • For Hugging Face models: ensure the presence of a 'config.json'.
    • For Mistral models: ensure the presence of a 'params.json'.

@Isotr0py
Copy link
Member Author

@akash-agr You should use the model path of your local gguf checkpoint.

@akash-agr
Copy link

akash-agr commented Apr 29, 2025

Sorry to bother you @Isotr0py

I have downloaded the model and using below command. But I am getting the following error

vllm serve --model=~/gemma-3-4b-it-q4_0.gguf --gpu_memory_utilization=0.95 --max-model-len 4096

INFO 04-29 17:21:56 [init.py:239] Automatically detected platform cuda.
INFO 04-29 17:22:03 [api_server.py:1043] vLLM API server version 0.8.5.dev337+gd3cf61b89
INFO 04-29 17:22:03 [api_server.py:1044] args: Namespace(subparser='serve', model_tag=None, config='', host=None, port=8000, uvicorn_log_level='info', disable_uvicorn_access_log=False, allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, enable_ssl_refresh=False, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/gemma-3-4b-it-q4_0.gguf', task='auto', tokenizer=None, hf_config_path=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, allowed_local_media_path=None, load_format='auto', download_dir=None, model_loader_extra_config={}, use_tqdm_on_load=True, config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', max_model_len=4096, guided_decoding_backend='auto', reasoning_parser=None, logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=1, tensor_parallel_size=1, data_parallel_size=1, enable_expert_parallel=False, max_parallel_loading_workers=None, ray_workers_use_nsight=False, disable_custom_all_reduce=False, block_size=None, gpu_memory_utilization=0.95, swap_space=4, kv_cache_dtype='auto', num_gpu_blocks_override=None, enable_prefix_caching=None, prefix_caching_hash_algo='builtin', cpu_offload_gb=0, calculate_kv_scales=False, disable_sliding_window=False, use_v2_block_manager=True, seed=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_token=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config={}, limit_mm_per_prompt={}, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=None, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=None, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', speculative_config=None, ignore_patterns=[], served_model_name=None, qlora_adapter_name_or_path=None, show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, max_num_batched_tokens=None, max_num_seqs=None, max_num_partial_prefills=1, max_long_partial_prefills=1, long_prefill_token_threshold=0, num_lookahead_slots=0, scheduler_delay_factor=0.0, preemption_mode=None, num_scheduler_steps=1, multi_step_stream_outputs=True, scheduling_policy='fcfs', enable_chunked_prefill=None, disable_chunked_mm_input=False, scheduler_cls='vllm.core.scheduler.Scheduler', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', worker_extension_cls='', generation_config='auto', override_generation_config=None, enable_sleep_mode=False, additional_config=None, enable_reasoning=False, disable_cascade_attn=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False, enable_server_load_tracking=False, dispatch_function=<function ServeSubcommand.cmd at 0x7b9f31734310>)
Traceback (most recent call last):
File "/root/vllm/vllm/transformers_utils/config.py", line 279, in get_config
if is_gguf or file_or_path_exists(
File "/root/vllm/vllm/transformers_utils/config.py", line 173, in file_or_path_exists
cached_filepath = try_to_load_from_cache(repo_id=model,
File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_validators.py", line 106, in _inner_fn
validate_repo_id(arg_value)
File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/validators.py", line 160, in validate_repo_id
raise HFValidationError(
huggingface_hub.errors.HFValidationError: Repo id must use alphanumeric chars or '-', '
', '.', '--' and '..' are forbidden, '-' and '.' cannot start or end the name, max length is 96: '
/gemma-3-4b-it-q4_0.gguf'.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/usr/local/bin/vllm", line 8, in
sys.exit(main())
File "/root/vllm/vllm/entrypoints/cli/main.py", line 53, in main
args.dispatch_function(args)
File "/root/vllm/vllm/entrypoints/cli/serve.py", line 27, in cmd
uvloop.run(run_server(args))
File "/usr/local/lib/python3.10/dist-packages/uvloop/init.py", line 82, in run
return loop.run_until_complete(wrapper())
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/usr/local/lib/python3.10/dist-packages/uvloop/init.py", line 61, in wrapper
return await main
File "/root/vllm/vllm/entrypoints/openai/api_server.py", line 1078, in run_server
async with build_async_engine_client(args) as engine_client:
File "/usr/lib/python3.10/contextlib.py", line 199, in aenter
return await anext(self.gen)
File "/root/vllm/vllm/entrypoints/openai/api_server.py", line 146, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
File "/usr/lib/python3.10/contextlib.py", line 199, in aenter
return await anext(self.gen)
File "/root/vllm/vllm/entrypoints/openai/api_server.py", line 166, in build_async_engine_client_from_engine_args
vllm_config = engine_args.create_engine_config(usage_context=usage_context)
File "/root/vllm/vllm/engine/arg_utils.py", line 1100, in create_engine_config
model_config = self.create_model_config()
File "/root/vllm/vllm/engine/arg_utils.py", line 988, in create_model_config
return ModelConfig(
File "/root/vllm/vllm/config.py", line 460, in init
hf_config = get_config(self.hf_config_path or self.model,
File "/root/vllm/vllm/transformers_utils/config.py", line 304, in get_config
raise ValueError(error_message) from e
ValueError: Invalid repository ID or local directory specified: '~/gemma-3-4b-it-q4_0.gguf'.
Please verify the following requirements:

  1. Provide a valid Hugging Face repository ID.
  2. Specify a local directory that contains a recognized configuration file.
    • For Hugging Face models: ensure the presence of a 'config.json'.
    • For Mistral models: ensure the presence of a 'params.json'.

I also tried the below script with tokeniser, I am getting the same "KeyError: 'general.name'" error.

from huggingface_hub import hf_hub_download
from vllm import LLM, SamplingParams
import os

# Set Hugging Face token for authentication
os.environ["HUGGING_FACE_HUB_TOKEN"] = "hf_test"

def run_gguf_inference(model_path, tokenizer):
    # Sample prompts.
    prompts = [
        "How many helicopters can a human eat in one sitting?",
        "What's the future of AI?",
    ]
    prompts = [[{"role": "user", "content": prompt}] for prompt in prompts]
    # Create a sampling params object.
    sampling_params = SamplingParams(temperature=0, max_tokens=128)

    # Create an LLM.
    llm = LLM(model=model_path, tokenizer=tokenizer)

    outputs = llm.chat(prompts, sampling_params)
    # Print the outputs.
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")


if __name__ == "__main__":
    repo_id = "google/gemma-3-4b-it-qat-q4_0-gguf"
    filename = "gemma-3-4b-it-q4_0.gguf"
    tokenizer = "google/gemma-3-4b-it"
    model = hf_hub_download(repo_id, filename=filename)
    print('model', model)
    run_gguf_inference(model, tokenizer)

Error

INFO 04-29 17:17:47 [init.py:239] Automatically detected platform cuda.
gemma-3-4b-it-q4_0.gguf: 100%|███████████████████| 3.16G/3.16G [00:28<00:00, 109MB/s]
model /root/.cache/huggingface/hub/models--google--gemma-3-4b-it-qat-q4_0-gguf/snapshots/15f73f5eee9c28f53afefef5723e29680c2fc78a/gemma-3-4b-it-q4_0.gguf
Traceback (most recent call last):
File "/root/vllm_test/gemma3_gguf.py", line 35, in
run_gguf_inference(model, tokenizer)
File "/root/vllm_test/gemma3_gguf.py", line 19, in run_gguf_inference
llm = LLM(model=model_path, tokenizer=tokenizer)
File "/root/vllm/vllm/utils.py", line 1161, in inner
return fn(*args, **kwargs)
File "/root/vllm/vllm/entrypoints/llm.py", line 247, in init
self.llm_engine = LLMEngine.from_engine_args(
File "/root/vllm/vllm/engine/llm_engine.py", line 503, in from_engine_args
vllm_config = engine_args.create_engine_config(usage_context)
File "/root/vllm/vllm/engine/arg_utils.py", line 1100, in create_engine_config
model_config = self.create_model_config()
File "/root/vllm/vllm/engine/arg_utils.py", line 988, in create_model_config
return ModelConfig(
File "/root/vllm/vllm/config.py", line 460, in init
hf_config = get_config(self.hf_config_path or self.model,
File "/root/vllm/vllm/transformers_utils/config.py", line 307, in get_config
config_dict, _ = PretrainedConfig.get_config_dict(
File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 590, in get_config_dict
config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py", line 681, in _get_config_dict
config_dict = load_gguf_checkpoint(resolved_config_file, return_tensors=False)["config"]
File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_gguf_pytorch_utils.py", line 369, in load_gguf_checkpoint
model_name = read_field(reader, "general.name")
File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_gguf_pytorch_utils.py", line 260, in read_field
value = reader.fields[field]
KeyError: 'general.name'

I have cloned the latest code from the main branch of vllm. Please let me know if I need to update the transformer version somehow. Since this PR is merged, I am not sure if I am using compatible version. Kindly let me know if i am making some other mistake altogether.

Thanks a lot for your efforts. It will help me in taking my app to production 🙏

@FremyCompany
Copy link

FYI with recent triton builds, one needs to modify vllm-gemma3/vllm/triton_utils/custom_cache_manager.py as the default folders are no longer stored in torch.runtime.cache. For now I modified the triton cache.py file to get this to run. See this commit for details of the breaking change in triton: triton-lang/triton@8505252#diff-1e6a363bbde516739874d46d8ee06c60a7a76f194275b0f4d46638a0b06af8acL6-R9.

FWIW, even after fixing this I didn't manage to get things to work however, and I ran out of time so I will put this on the backburner for now. Hopefully this PR gets merged at some point in the main branch :)

@acsezen
Copy link

acsezen commented Sep 7, 2025

@Isotr0py firstly very thank you for your efforts. Finally i got run Gemma3-4b-it GGUF through vllm without error. But there is still a issue when i run gemma3 GGUF model. I got empty output when using google/gemma-3-4b-it-qat-q4_0-gguf. but google/gemma-3-1b-it-qat-q4_0-gguf is fine. And i knew gemma-3-4b-it original mode has BF16 problem with T4 gpu will get empty output. but i thought gemma-3-4b-it-qat-q4_0-gguf fix F16 problem, so it should work. Have any idea?

Hello @jayyang-zigbang,

How did you manage to run it? what was your parameters?

Thanks!

@jayyang-zigbang
Copy link

@Isotr0py firstly very thank you for your efforts. Finally i got run Gemma3-4b-it GGUF through vllm without error. But there is still a issue when i run gemma3 GGUF model. I got empty output when using google/gemma-3-4b-it-qat-q4_0-gguf. but google/gemma-3-1b-it-qat-q4_0-gguf is fine. And i knew gemma-3-4b-it original mode has BF16 problem with T4 gpu will get empty output. but i thought gemma-3-4b-it-qat-q4_0-gguf fix F16 problem, so it should work. Have any idea?

Hello @jayyang-zigbang,

How did you manage to run it? what was your parameters?

Thanks!

@acsezen unfortunately i tried it long time ago. i didn't remember what parameters i used but i think from BF16 to F16 is not easy. Any suggestion?

@qazi0
Copy link

qazi0 commented Oct 8, 2025

@Isotr0py are you still working on this?

@Isotr0py
Copy link
Member Author

Isotr0py commented Oct 9, 2025

Suspended by #26189

@Isotr0py Isotr0py closed this Oct 9, 2025
@Isotr0py Isotr0py deleted the gemma3-gguf branch October 9, 2025 08:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Support Gemma3 GGUF