Skip to content

Conversation

@qandrew
Copy link
Contributor

@qandrew qandrew commented Sep 2, 2025

Purpose

In encountering vllm speculative decoding code, we noticed that throughput was a more useful metric to read than the total number of tokens. Worked with @Jialin to devise better metrics

Test Plan

send a curl request to a vllm server running spec decoding

server

vllm serve facebook/opt-125m \
  --swap-space 16 \
  --disable-log-requests \
  --host :: \
  --dtype float16 \
  --speculative_config \
    "{\"method\":\"ngram\",\"num_speculative_tokens\":5,\"prompt_lookup_min\":5,\"prompt_lookup_max\":10}" \
  2>&1 | tee /data/users/$USER/logs/vllm_serving.$(date +%Y%m%d_%H%M%S).log

Test Result

metrics

[axia@devvm30969.cln0 ~/uv_env/gpt_oss_edit/bin]$ curl http://localhost:8000/v1/completions   -H "Content-Type: application/json"   -d '{
    "model": "facebook/opt-125m",
    "prompt": "Write a short story about plants.",
    "max_tokens": 1000,
    "temperature": 0.7
  }'


(APIServer pid=2236552) INFO 09-09 11:51:35 [metrics.py:96] SpecDecoding metrics: Draft acceptance rate: 88.8%, Mean acceptance length: 5.44, Accepted throughput: 4.37 tokens/s, Drafted throughput: 4.92 tokens/s, Accepted: 262 tokens, Drafted: 295 tokens, Per-position acceptance rate: 0.983, 0.949, 0.915, 0.881, 0.712

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@github-actions
Copy link

github-actions bot commented Sep 2, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@qandrew qandrew marked this pull request as draft September 2, 2025 22:32
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the speculative decoding metrics to include throughput for drafted and accepted tokens. The changes correctly use time.monotonic() to measure the elapsed time for accurate throughput calculation. The implementation is sound, adding last_log_time to SpecDecodingLogging, calculating throughput in the log method, and updating the log message accordingly. The code handles potential division-by-zero errors. Overall, the changes are a good addition for better performance monitoring, and I have no high or critical severity feedback.

@qandrew qandrew force-pushed the andrew/jialin-spec-logging branch from 75cef51 to 710a8fa Compare September 2, 2025 23:20
Copy link
Collaborator

@Jialin Jialin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thanks for patching the change to OSS. Please add some screen shot in the test plan.

Signed-off-by: Andrew Xia <axia@meta.com>
Signed-off-by: Andrew Xia <axia@meta.com>
@qandrew qandrew force-pushed the andrew/jialin-spec-logging branch from 710a8fa to e2fbcce Compare September 5, 2025 18:26
@qandrew qandrew marked this pull request as ready for review September 5, 2025 18:29
@Jialin
Copy link
Collaborator

Jialin commented Sep 5, 2025

CC @yeqcharlotte @houseroad

@qandrew qandrew requested a review from benchislett as a code owner September 9, 2025 18:53
Signed-off-by: Andrew Xia <axia@meta.com>
@qandrew qandrew requested a review from benchislett September 9, 2025 22:30
Copy link
Collaborator

@luccafong luccafong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

Copy link
Collaborator

@benchislett benchislett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@benchislett benchislett enabled auto-merge (squash) September 10, 2025 19:11
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 10, 2025
@benchislett benchislett merged commit 79ac59f into vllm-project:main Sep 11, 2025
38 checks passed
skyloevil pushed a commit to skyloevil/vllm that referenced this pull request Sep 13, 2025
dsxsteven pushed a commit to dsxsteven/vllm_splitPR that referenced this pull request Sep 15, 2025
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
…ughput (vllm-project#24127)

Signed-off-by: Andrew Xia <axia@meta.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
…ughput (vllm-project#24127)

Signed-off-by: Andrew Xia <axia@meta.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed speculative-decoding v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants