- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 10.9k
[V1][Metrics] Initial speculative decoding metrics #15151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| 👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run  Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add  🚀 | 
cd3ecad    to
    4cac140      
    Compare
  
            
          
                vllm/v1/core/scheduler.py
              
                Outdated
          
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit uncertain whether to include this in the count ... it matches what I was getting at the rejection sampler level, but ... I guess it depends what we want the system efficiency metric to mean.
Will dig into it, but thoughts welcome!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See commit 586f1233797b3401ce2d25cb2d9983d2dd9d6943
    [V1][Speculative Decoding] Properly account for ngram drafter
    
    When the ngram drafter proposes zero tokens, we should treat
    this as a proposal of num_spec_tokens which were all rejected.
    This will allow us to compare fairly across ngram vs draft
    models.
    
    Encode a zero-token proposal in ModelRunnerOutput as an
    empty list, whereas None means no speculating was done. This
    also requires Request.spec_token_ids empty list to be
    interpretted this way by the scheduler.
7f4ecd7    to
    4cac140      
    Compare
  
    | This pull request has merge conflicts that must be resolved before it can be | 
4cac140    to
    5bb7953      
    Compare
  
    | This pull request has merge conflicts that must be resolved before it can be | 
5bb7953    to
    478238d      
    Compare
  
    | I've rebased and pulled out a bunch of stuff from this PR to try and simplify the discussion: 
 On the topic of properly tracking  
 This PR will calculate  | 
| I've removed  | 
| Note also, at @WoosukKwon's request we retain the most simplistic behavior for ngram prompt lookup - for  | 
| For reference, here is a mini design doc we've been using to discuss the above | 
Fixes vllm-project#13990, part of vllm-project#10582 Omitting system efficiency for now. Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Now just num_accepted_tokens, num_draft_tokens, and acceptance rate. Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
19b10b8    to
    840f4ce      
    Compare
  
    | Rebased to pull in CI fix from #15757 | 
| Looks generally good to me. However, as we discussed offline, I think the draft & accepted number of tokens per position (0, 1, ...,  | 
| 
 Absolutely. I think there's a lot more ideas to explore. I'd like to take this incrementally though and start by adding back the acceptance rate metric that we had in V0. This lays the groundwork for adding future metrics also. | 
| 
 Unless we think the acceptance rate metric (i.e. num_draft and num_accepted) itself is useless and we want to deprecate those too! I don't believe so, though. I think we will continue to want a single metric to compare performance over time and between models. | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@markmc Got it. Looks good to me as the first step.
Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: xinyuxiao <xinyuxiao2024@gmail.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>
Fixes #13990, part of #10582
cc @WoosukKwon @LiuXiaoxuanPKU @sroy745