Skip to content

[Bug]: Gemma-3-12B-it model getting stuck in repetitive output loops #15752

@nbarr07

Description

@nbarr07

Gemma-3-12B-it model getting stuck in repetitive output loops

The output of `python collect_env.py`
Collecting environment information...
PyTorch version: 2.7.0a0+git295f2ed
Is debug build: False
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: 6.3.42133-1b9c17779

OS: Ubuntu 22.04.5 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: 18.0.0git (https://github.com/RadeonOpenCompute/llvm-project roc-6.3.1 24491 1e0fda770a2079fbd71e4b70974d74f62fd3af10)
CMake version: version 3.31.6
Libc version: glibc-2.35

Python version: 3.12.9 (main, Feb  5 2025, 08:49:00) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.15.0-125-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: AMD Instinct MI250X/MI250 (gfx90a:sramecc+:xnack-)
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: 6.3.42133
MIOpen runtime version: 3.3.0
Is XNNPACK available: True

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        48 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               128
On-line CPU(s) list:                  0-127
Vendor ID:                            AuthenticAMD
Model name:                           AMD EPYC 7713 64-Core Processor
CPU family:                           25
Model:                                1
Thread(s) per core:                   1
Core(s) per socket:                   64
Socket(s):                            2
Stepping:                             1
Frequency boost:                      enabled
CPU max MHz:                          3720.7029
CPU min MHz:                          1500.0000
BogoMIPS:                             3992.58
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm
Virtualization:                       AMD-V
L1d cache:                            4 MiB (128 instances)
L1i cache:                            4 MiB (128 instances)
L2 cache:                             64 MiB (128 instances)
L3 cache:                             512 MiB (16 instances)
NUMA node(s):                         8
NUMA node0 CPU(s):                    0-15
NUMA node1 CPU(s):                    16-31
NUMA node2 CPU(s):                    32-47
NUMA node3 CPU(s):                    48-63
NUMA node4 CPU(s):                    64-79
NUMA node5 CPU(s):                    80-95
NUMA node6 CPU(s):                    96-111
NUMA node7 CPU(s):                    112-127
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Mitigation; safe RET
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] pyzmq==26.3.0
[pip3] torch==2.7.0a0+git295f2ed
[pip3] torchvision==0.21.0+7af6987
[pip3] transformers==4.51.0.dev0
[pip3] triton==3.2.0+gite5be006a
[conda] Could not collect
ROCM Version: 6.3.42133-1b9c17779
Neuron SDK Version: N/A
vLLM Version: 0.8.2
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
============================ ROCm System Management Interface ============================
================================ Weight between two GPUs =================================
       GPU0         GPU1         
GPU0   0            40           
GPU1   40           0            

================================= Hops between two GPUs ==================================
       GPU0         GPU1         
GPU0   0            2            
GPU1   2            0            

=============================== Link Type between two GPUs ===============================
       GPU0         GPU1         
GPU0   0            PCIE         
GPU1   PCIE         0            

======================================= Numa Nodes =======================================
GPU[0]          : (Topology) Numa Node: 6
GPU[0]          : (Topology) Numa Affinity: 6
GPU[1]          : (Topology) Numa Node: 6
GPU[1]          : (Topology) Numa Affinity: 6
================================== End of ROCm SMI Log ===================================

LD_LIBRARY_PATH=/opt/rocm/lib:/usr/local/lib:
PYTORCH_ROCM_ARCH=gfx90a;gfx942;gfx1100;gfx1101;gfx1200;gfx1201
NCCL_LAUNCH_MODE=PARALLEL
OMP_NUM_THREADS=16
NCCL_MIN_NCHANNELS=112
NCCL_CUMEM_ENABLE=0
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY

🐛 Describe the bug

Description
When running Gemma 3 12B-it through vLLM, the model occasionally gets stuck in infinite repetitive loops during text generation.

Example prompt that failed:
I need help writing a description on a change control ticket. the title is "upgrade the teleport server" and have put "We need to upgrade the teleport server to meet security standards" in the description but need to be at least 3 lines

Example output pattern:

  • What level of detail is required?
  • Teleport your organization
    • Teleport version of your team' your Teleport installation setup with details regarding your configuration, for assistance on more
      to address the response a
      I can' description for request, please of assistance with that and
      description of service.
      description, can to to the change request information on assistance with information request, and description of to and get-request to to.
      -description of service andgetrequest andis_description of. request of a todescription andrequest. request. description of torequest, of description, andrequest.
      I information ofis-request
      description, andrequest. of information, andrequest,with- the service information and the and-service
      request, and, the description of, and request of, ofinstallation,and request, of installation andinstallation, installation assistance request, installation and installation of, installation and installation,installation of description of description,request, description ofinstallation,installation installation, of installation,request,request, installationimplementationrequestinstallation request, ofinstallation requeststinstallation, ofservice,installationrequest
      II' request of, of,installation,installation
      installation,requestinstallation,installation and, installation,installation,installation ofthat, request

Environment

  • vLLM version: rocm/vllm-dev:nightly_main_20250326 (tried multiple builds)
  • Model: "google/gemma-3-12b-it
    GPU: mi250 / rocm + mi300s
    UI: Open Webui + custom made chat tool
  • Deployment: cloud

Configuration

Example configuration used (tried many different ones)

{

"temperature": 1,0,

"top_p": 0.95

"top_k": 124,

"max_tokens": 4000,

"frequency_penalty": 0.5,

}

Steps to Reproduce

  1. Set up vLLM server with Gemma3 12B model
  2. Send chat completion requests
  3. After some interactions, the model may enter a repetitive loop
  4. The loop continues until hitting max_tokens or manual interruption

Additional Notes

  • The issue seems more prevalent with longer conversations or complex queries
  • The model seems unable to break out of these loops once they start and might continue repeating on the next turn.
    The issue didn’t appear with Gemma-3-1b-it or Gemma-3-4b-it

Questions

  1. Is this a known issue with this model?
  2. Are there recommended adjustments that might help prevent these loops?

Let me know if you need any additional information or specific examples to help investigate this issue.

Update: Tested on both an H200 and mi300, the same behaviour happened with the mi300 but no issue with H200

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingstaleOver 90 days of inactivity

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions