[Bug]: LongRoPE transition causes random outputs mid-sequence in Phi-3.5

### Your current environment

<details>
<summary>The output of <code>python collect_env.py</code></summary>

I don't currently have access to the machine, but this is reproducible across all vLLM versions from 0.7 to the current 0.11. The most recent I've tested is on 0.11 on an 8XH100 node.

</details>


### 🐛 Describe the bug

While serving  `microsoft/Phi-3.5-mini-instruct`  (or any other model with LongRoPE), concurrent requests end up with random token outputs after the LongRoPE transition point from short factor to long factor, in this case, 4K tokens. When a prompt starts below the ~4K transition boundary and continues generating past it, the model switches from the short to long scaling factors mid-sequence. This causes the previous KV-cache to become invalid for the long-format scales, resulting in random tokens beyond that point, since it was encoded with different RoPE parameters. Synchronous requests aren’t affected because they use consistent scaling without shared cache states. 

The only current workaround is to modify the model’s config.json to either set the long factor scales equal to the short factor or set the short factor equal to the long factor. This removes the changing scales at the transition point, ensuring consistency throughout the generation.

**Related issues:**

- https://github.com/vllm-project/vllm/pull/8254
    
- https://github.com/vllm-project/vllm/issues/14058
    
- https://github.com/huggingface/transformers/pull/33129
    

**Reproduction:**

Reproducible with any Phi-3.5 variant using the following workload designed to cross the 4K boundary:

```bash
vllm serve microsoft/Phi-3.5-mini-instruct

guidellm benchmark \
  --target "http://localhost:8080" \
  --rate-type concurrent \
  --rate 16 \
  --data "prompt_tokens=3900,output_tokens=500" \
  --max-requests 100
```

For the request outputs stored in the benchmark.json, all of them will start producing random tokens after the 4000th token.

**Current Solution/Workaround:**

Download the model and set the long_factor and short_factor scales equal to each other, then rerun:

```python
huggingface-cli download microsoft/Phi-3.5-mini-instruct --local-dir ./phi3.5/

jq '.rope_scaling.long_factor = .rope_scaling.short_factor' config.json > config.tmp && mv config.tmp config.json

vllm serve ./phi3.5

guidellm benchmark \
  --target "http://localhost:8080" \
  --rate-type concurrent \
  --rate 16 \
  --data "prompt_tokens=3900,output_tokens=500" \
  --max-requests 100
```

The request outputs will now be generated without randomness after the 4K token threshold.

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: LongRoPE transition causes random outputs mid-sequence in Phi-3.5 #27414

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: LongRoPE transition causes random outputs mid-sequence in Phi-3.5 #27414

Description

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions