Skip to content

Conversation

@rahul-tuli
Copy link
Contributor

@rahul-tuli rahul-tuli commented Sep 16, 2025

This PR enables users to combine engine-level arguments (like --tensor-parallel-size, --seed, --max-model-len) with speculators models using simplified command syntax.

Problem

Previously, users had to use verbose commands with explicit speculative configuration:

  VLLM_USE_V1=1 vllm serve "RedHatAI/Meta-Llama-3.1-8B-Instruct-FP8-dynamic" \
      --seed 42 \
      --tensor-parallel-size 4 \
      --speculative-config '{"model": "nm-testing/SpeculatorLlama3-1-8B-Eagle3-converted-0717-quantized", "num_speculative_tokens": 3, 
  "method":"eagle3"}'

Solution

Now users can use the simplified syntax:

vllm serve --seed 42 --tensor-parallel-size 4 "nm-testing/SpeculatorLlama3-1-8B-Eagle3-converted-0717-quantized"
  • Detects speculators models by checking for embedded speculators_config
  • Extracts and converts embedded speculative configuration to vLLM format
  • CLI Precedence: Engine-level CLI arguments take precedence over embedded settings
  • Maintains compatibility with regular models and existing workflows

Testing

Serve Command:

  export CUDA_VISIBLE_DEVICES=0,1
  export VLLM_USE_V1=1
  vllm serve \
      --host 127.0.0.1 \
      --port 8000 \
      --tensor-parallel-size 2 \
      --seed 42 \
      --max-model-len 4096 \
      "nm-testing/SpeculatorLlama3-1-8B-Eagle3-converted-0717-quantized"

Test Request:

  curl -s \
      -H "Content-Type: application/json" \
      -d '{
          "prompt": "The capital of France is",
          "max_tokens": 10,
          "temperature": 0.7
      }' \
      "http://127.0.0.1:8000/v1/completions"

@mergify mergify bot added the frontend label Sep 16, 2025
@rahul-tuli rahul-tuli force-pushed the feat/allow-server-args-with-speculators-model branch from 58952c4 to be17980 Compare September 16, 2025 12:20
@rahul-tuli rahul-tuli marked this pull request as ready for review September 16, 2025 12:21
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have several other functions that are very similar to this. Instead of adding this method, I'd look into reusing something like get_config for example

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rahul-tuli rahul-tuli force-pushed the feat/allow-server-args-with-speculators-model branch from f55db17 to 89435cc Compare September 17, 2025 13:48
@rahul-tuli rahul-tuli marked this pull request as draft September 17, 2025 13:49
@rahul-tuli rahul-tuli force-pushed the feat/allow-server-args-with-speculators-model branch from 89435cc to e1d1ac6 Compare September 17, 2025 13:52
This commit implements enhanced engine layer detection for speculators models,
allowing users to apply engine arguments directly using simplified syntax:

```bash
vllm serve --seed 42 --tensor-parallel-size 4 "speculators-model"
```

Instead of verbose explicit configuration:

```bash
vllm serve --seed 42 --tensor-parallel-size 4 "target-model" \
  --speculative-config '{"model": "speculators-model", "method": "eagle3", ...}'
```

## Key Changes

### Enhanced Engine Layer (`vllm/engine/arg_utils.py`)
- Modified `create_speculative_config()` to return tuple of (ModelConfig, SpeculativeConfig)
- Added automatic speculators model detection at model creation time
- Implemented proper model resolution: speculators model → target model
- Engine arguments now correctly applied to target model instead of speculators model

### Complete Algorithm Processing (`vllm/transformers_utils/configs/speculators/base.py`)
- Added `get_vllm_config()` method with full algorithm-specific processing
- Includes Eagle3 fields like draft_vocab_size, target_hidden_size
- Leverages existing validation and transformation infrastructure

## Benefits
- ✅ Proper architectural layering (engine layer handles model configuration)
- ✅ Complete algorithm-specific field processing
- ✅ Backward compatibility (existing workflows unchanged)
- ✅ Simplified user experience
- ✅ Single point of truth for speculative model logic

## Testing
- ✅ Speculators model: Auto-detection and target model resolution
- ✅ Regular model: No regression, normal serving unaffected
- ✅ Engine arguments correctly applied in both cases

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

Signed-off-by: Rahul Tuli <rtuli@redhat.com>
@rahul-tuli rahul-tuli force-pushed the feat/allow-server-args-with-speculators-model branch from e1d1ac6 to 50d2ca6 Compare September 17, 2025 13:57
@rahul-tuli
Copy link
Contributor Author

Closed in favor of #25250

@rahul-tuli rahul-tuli closed this Sep 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants