-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
Description
Your current environment
The output of python collect_env.py
Your output of `python collect_env.py` here
🐛 Describe the bug
I'm running the google/gemma-3-27b-it model with vLLM using the OpenAI-compatible API server.
CUDA_VISIBLE_DEVICES=0 VLLM_USE_V1=1 python /opt/VLLM/vllm/vllm/entrypoints/openai/api_server.py \
--model /opt/MODELS/gemma-3-27b-it/ \
--max-model-len 32000 \
--host 10.12.112.168 \
--port 9005 \
--tensor-parallel-size 1 \
--gpu_memory_utilization 0.9Then, I send a standard request to the /v1/chat/completions endpoint using Python:
import requests
import json
url = "http://10.12.112.168:9005/v1/chat/completions"
data = {
"model": "/opt/MODELS/gemma-3-27b-it/",
"messages": [
{"role": "user", "content": "hello"}
],
"temperature": 0.1,
"max_tokens": 500,
"enable_thinking": False
}
headers = {
"Content-Type": "application/json"
}
response = requests.post(url, headers=headers, data=json.dumps(data))
result = response.json()
print(result['choices'][0]['message']['content'])The request is processed, but the model fails to produce meaningful responses. It either:
outputs nothing,
or keeps repeating certain tokens or parts of the input (e.g., repeating “selamlar brom”).
This issue only happens with Gemma 3 IT models. I tested the exact same code and server setup with:
Qwen models
Mistral models
...and they work perfectly. No repetition, and responses are coherent and aligned with the prompt.
So this looks like a Gemma-specific compatibility issue with /chat/completions, possibly due to missing or misaligned prompt formatting (e.g., lack of a compatible chat template?).
Let me know if there’s a known workaround or proper configuration required for Gemma models.
Thanks!
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.