feat: Enable engine-level arguments with speculators models #24962

rahul-tuli · 2025-09-16T12:09:28Z

This PR enables users to combine engine-level arguments (like --tensor-parallel-size, --seed, --max-model-len) with speculators models using simplified command syntax.

Problem

Previously, users had to use verbose commands with explicit speculative configuration:

  VLLM_USE_V1=1 vllm serve "RedHatAI/Meta-Llama-3.1-8B-Instruct-FP8-dynamic" \
      --seed 42 \
      --tensor-parallel-size 4 \
      --speculative-config '{"model": "nm-testing/SpeculatorLlama3-1-8B-Eagle3-converted-0717-quantized", "num_speculative_tokens": 3, 
  "method":"eagle3"}'

Solution

Now users can use the simplified syntax:

vllm serve --seed 42 --tensor-parallel-size 4 "nm-testing/SpeculatorLlama3-1-8B-Eagle3-converted-0717-quantized"

Detects speculators models by checking for embedded speculators_config
Extracts and converts embedded speculative configuration to vLLM format
CLI Precedence: Engine-level CLI arguments take precedence over embedded settings
Maintains compatibility with regular models and existing workflows

Testing

Serve Command:

  export CUDA_VISIBLE_DEVICES=0,1
  export VLLM_USE_V1=1
  vllm serve \
      --host 127.0.0.1 \
      --port 8000 \
      --tensor-parallel-size 2 \
      --seed 42 \
      --max-model-len 4096 \
      "nm-testing/SpeculatorLlama3-1-8B-Eagle3-converted-0717-quantized"

Test Request:

  curl -s \
      -H "Content-Type: application/json" \
      -d '{
          "prompt": "The capital of France is",
          "max_tokens": 10,
          "temperature": 0.7
      }' \
      "http://127.0.0.1:8000/v1/completions"

dsikka · 2025-09-16T12:37:00Z

vllm/transformers_utils/configs/speculators/base.py

We have several other functions that are very similar to this. Instead of adding this method, I'd look into reusing something like get_config for example

dsikka · 2025-09-16T12:40:47Z

vllm/entrypoints/utils.py

Wont this be incorrect?
You're missing the update step specific to each algorithm https://github.com/neuralmagic/vllm/blob/0faf3cc3e84a83bac56492d297f2c9909fec7de9/vllm/transformers_utils/configs/speculators/base.py#L39?

This commit implements enhanced engine layer detection for speculators models, allowing users to apply engine arguments directly using simplified syntax: ```bash vllm serve --seed 42 --tensor-parallel-size 4 "speculators-model" ``` Instead of verbose explicit configuration: ```bash vllm serve --seed 42 --tensor-parallel-size 4 "target-model" \ --speculative-config '{"model": "speculators-model", "method": "eagle3", ...}' ``` ## Key Changes ### Enhanced Engine Layer (`vllm/engine/arg_utils.py`) - Modified `create_speculative_config()` to return tuple of (ModelConfig, SpeculativeConfig) - Added automatic speculators model detection at model creation time - Implemented proper model resolution: speculators model → target model - Engine arguments now correctly applied to target model instead of speculators model ### Complete Algorithm Processing (`vllm/transformers_utils/configs/speculators/base.py`) - Added `get_vllm_config()` method with full algorithm-specific processing - Includes Eagle3 fields like draft_vocab_size, target_hidden_size - Leverages existing validation and transformation infrastructure ## Benefits - ✅ Proper architectural layering (engine layer handles model configuration) - ✅ Complete algorithm-specific field processing - ✅ Backward compatibility (existing workflows unchanged) - ✅ Simplified user experience - ✅ Single point of truth for speculative model logic ## Testing - ✅ Speculators model: Auto-detection and target model resolution - ✅ Regular model: No regression, normal serving unaffected - ✅ Engine arguments correctly applied in both cases 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Rahul Tuli <rtuli@redhat.com>

rahul-tuli · 2025-09-19T11:07:12Z

Closed in favor of #25250

mergify bot added the frontend label Sep 16, 2025

rahul-tuli force-pushed the feat/allow-server-args-with-speculators-model branch from 58952c4 to be17980 Compare September 16, 2025 12:20

rahul-tuli marked this pull request as ready for review September 16, 2025 12:21

rahul-tuli requested review from aarnphm and chaunceyjiang as code owners September 16, 2025 12:21

dsikka suggested changes Sep 16, 2025

View reviewed changes

rahul-tuli force-pushed the feat/allow-server-args-with-speculators-model branch from f55db17 to 89435cc Compare September 17, 2025 13:48

rahul-tuli marked this pull request as draft September 17, 2025 13:49

rahul-tuli force-pushed the feat/allow-server-args-with-speculators-model branch from 89435cc to e1d1ac6 Compare September 17, 2025 13:52

rahul-tuli force-pushed the feat/allow-server-args-with-speculators-model branch from e1d1ac6 to 50d2ca6 Compare September 17, 2025 13:57

rahul-tuli closed this Sep 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat: Enable engine-level arguments with speculators models #24962

feat: Enable engine-level arguments with speculators models #24962

Uh oh!

rahul-tuli commented Sep 16, 2025 •

edited by github-actions bot

Loading

Uh oh!

dsikka Sep 16, 2025

Uh oh!

dsikka Sep 16, 2025

Uh oh!

rahul-tuli commented Sep 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

feat: Enable engine-level arguments with speculators models #24962

feat: Enable engine-level arguments with speculators models #24962

Uh oh!

Conversation

rahul-tuli commented Sep 16, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Solution

Testing

Uh oh!

dsikka Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

dsikka Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

rahul-tuli commented Sep 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rahul-tuli commented Sep 16, 2025 •

edited by github-actions bot

Loading