Skip to content

Conversation

@zhuohan123
Copy link
Member

@zhuohan123 zhuohan123 commented Feb 28, 2023

TODOs:

  • Parallel embedding and softmax.
  • Merge with the main branch.
  • Modify README.
  • Remove unused codes.
  • Fix the bug that downloads the weight twice.
  • Test with larger models.

In another PR:

  • Merge QKV into one.

@zhuohan123 zhuohan123 changed the title [WIP] Support tensor parallel Support tensor parallel Mar 9, 2023
@zhuohan123 zhuohan123 requested a review from WoosukKwon March 19, 2023 02:51
Copy link
Collaborator

@WoosukKwon WoosukKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic! Left minor comments.

BTW, the sampling results were different when using TP:

  • Current master (python server.py --model facebook/opt-13b)
# GPU blocks: 1826, # CPU blocks: 3276
Seq 0: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of the'
Seq 1: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of several'
Seq 2: 'Ion Stoica is a professor of philosophy at the University of Bucharest, Romania. He is the author'
Seq 3: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of The'
Seq 4: "UC Berkeley is a very liberal school, but I don't think they would look at it."
Seq 5: 'UC Berkeley is about to get some more tree-hugging support from the University of Washington'
Seq 6: "UC Berkeley is the university of utah\nNot even close\nYeah I'd say it's"
Seq 7: 'The future of cloud computing is React\n\n6 Avril, 2016 | By Maxime Boklan\n\n'
  • 4-way TP (python server.py --model facebook/opt-13b --tensor-parallel-size 4)
# GPU blocks: 4970, # CPU blocks: 3276
Seq 0: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of the'
Seq 1: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of several'
Seq 2: 'Ion Stoica is a professor of philosophy at the University of Bucharest, Romania. He is the author'
Seq 3: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of The'
Seq 4: "UC Berkeley is a very liberal school, but I don't think they would've been too much"
Seq 5: 'UC Berkeley is about to get some more visiting team fans in the seats, as the school�'
Seq 6: 'UC Berkeley is the university of weed.\n*school of vape\nNot everyone who vapes'
Seq 7: "The future of cloud computing is blazing bright\nIf there's a consensus in the tech world today, it's"
  • 8-way TP (python server.py --model facebook/opt-13b --tensor-parallel-size 8)
# GPU blocks: 5464, # CPU blocks: 3276
Seq 0: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of the'
Seq 1: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of several'
Seq 2: 'Ion Stoica is a professor of philosophy at the University of Bucharest, Romania. He is the author'
Seq 3: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of The'
Seq 4: "UC Berkeley is a very liberal school, but I don't think they would put a limit."
Seq 5: 'UC Berkeley is about to get some more visiting team fans in the stands, as the school is'
Seq 6: 'UC Berkeley is the university of weed.\n*school of anarchy\nAll respect to the academics'
Seq 7: "The future of cloud computing is blazing bright\nIf there's a consensus in the tech world today, it's"

@zhuohan123
Copy link
Member Author

@WoosukKwon Thanks again for the review! All comments resolved. Regarding the different sampling results, I think it's too hard to get the same sampling results for different tensor parallel configs. Adding more GPUs changes the model and the execution flow on each GPU, and thus it can change the random process here and there. I cannot and don't think it's necessary to keep their sampling results to be the same.

Copy link
Collaborator

@WoosukKwon WoosukKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @zhuohan123 for your huge effort! This is fantastic!

yuz207 referenced this pull request in IluvatarLabs/vllm Sep 27, 2025
Bug #1 (CRITICAL): Add missing begin() and stage() methods to KVWriteRouter
- Flash attention backend calls router.begin() and router.stage()
- KVWriteRouter only had write() and commit() methods
- Added begin() to store slot_mapping and initialize shadow buffer
- Added stage() to extract per-timestep slot and stage KV pairs
- Without these, no tokens were being staged → 0% acceptance rate

Bug #2 (MODERATE): Fix bonus token counting in accepted_lens
- valid_sampled_token_ids includes [accepted_draft_tokens..., bonus_token]
- Previous: len([bonus]) = 1, incorrectly counted as 1 accepted draft token
- Fixed: Use max(0, len(seq) - 1) to exclude bonus token from count
- Now correctly reports 0 accepted when only bonus token is present

Files modified:
- vllm/v1/kv_cache/write_router.py: Added begin() and stage() methods
- vllm/v1/worker/gpu_model_runner.py: Fixed accepted_lens calculation
yuz207 referenced this pull request in IluvatarLabs/vllm Sep 27, 2025
Bug #1: EAGLE tree proposal returned zeros for draft_logprobs
- Root cause: When using topk for tree branching, code set draft_logp_list=None,
  then created zeros tensor as fallback (lines 850-851)
- Fix: Compute actual log-probs from logits using log_softmax + gather
- Applied at 2 locations: root level (lines 698-704) and tree levels (lines 839-846)

Bug #2: Added diagnostic logging in rejection sampler
- Log draft_p (nonzero) min/med/max to detect zeros
- Log p_target min/med/max to detect degenerate softmax
- Helps identify if target logits are masked/filtered before sampling

Expected results after fix:
- draft_logp: -3.2/-1.6/-0.0 (real log-probs, all ≤ 0) instead of 0/0/0
- p_target: 1e-6/1e-3/0.7 (realistic distribution) instead of 1/1/1
- Acceptance rate: 30-70% instead of 0%

Files changed:
- vllm/v1/spec_decode/eagle.py: Fix draft_logp computation
- vllm/v1/sample/rejection_sampler.py: Add sanity logging
yuz207 referenced this pull request in IluvatarLabs/vllm Sep 30, 2025
Bug #4 fix: Change nucleus top_p fallback from 1.0 to 0.95, add
[NUCLEUS_DEBUG] diagnostic logging. This ensures nucleus runs even if
config attribute is missing, preventing 32000 survivors (full vocab).

Bug #5 fix: Add [SMOOTH_DEBUG] diagnostic logging for smoothing lambda.

These fixes were accidentally removed during the bug #2 draft-anchored
rewrite (commit 595a371). Restoring them does not affect bug #2's
core algorithm - they only improve fallback behavior and diagnostics.
yuz207 referenced this pull request in IluvatarLabs/vllm Sep 30, 2025
ROOT CAUSE: draft_q_soft_temp=0.50 was SHARPENING the distribution
instead of softening it (dividing by tau<1.0 doubles logit magnitudes).
This caused nucleus to collapse to 1-2 survivors → q≈1.0 → acceptance
stuck at ~0.7038 (average p_target).

FIXES:

1. Config defaults (config.py, arg_utils.py):
   - draft_q_temp_offset: 0.15 → 0.25 (better dynamic range)
   - draft_q_soft_temp: 0.50 → 2.0 (SOFTENS instead of sharpens)

   At draft_temp=0.05:
   - Before: tau_q = max(0.05+0.15, 0.50) = 0.50 (2x sharper!)
   - After:  tau_q = max(0.05+0.25, 2.0)  = 2.0  (2x softer)

2. Force min_keep=2 in nucleus (eagle.py line 271):
   - Added keep_sorted[..., :2] = True
   - Prevents survivors=1 by construction (defensive programming)

3. Fix smoothing to uniform over kept set (eagle.py lines 275-287):
   - Before: Mixed with untempered baseline (wrong approach)
   - After:  Uniform distribution over survivors only (correct)
   - Prevents q from reaching exactly 1.0 in corner cases

4. Remove dead code (eagle.py line 322):
   - Deleted unused self._current_sampling_metadata assignment
   - No longer needed with draft-anchored approach (bug #2 fix)

Expected results:
- tau_q ≥ 2.0 at ultracold temps → softer distribution
- NUC_DEBUG: survivors = hundreds/thousands (not 1-2)
- Q_DEBUG: q ∈ [0.5, 0.8] (not 0.98-1.0)
- Accept rate: dynamic range restored across temp sweep
dcmaddix referenced this pull request in dcmaddix/vllm Oct 5, 2025
vllm-bot pushed a commit that referenced this pull request Oct 9, 2025
Signed-off-by: Nick Hill <nhill@redhat.com>
zhangsicheng5 pushed a commit to zhangsicheng5/vllm that referenced this pull request Oct 9, 2025
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
vllm-project#26445)

Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Dhruvilbhatt pushed a commit to Dhruvilbhatt/vllm that referenced this pull request Oct 14, 2025
vllm-project#26445)

Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: Dhruvil Bhatt <bhattdbh@amazon.com>
lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025
IwakuraRein pushed a commit to IwakuraRein/vllm that referenced this pull request Oct 21, 2025
Fixes for support_materials/2-tilelang/
alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
vllm-project#26445)

Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
vllm-project#26445)

Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
vllm-project#26445)

Signed-off-by: Nick Hill <nhill@redhat.com>
Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
wangln19 pushed a commit to wangln19/vllm that referenced this pull request Oct 27, 2025
Bounty-hunter pushed a commit to Bounty-hunter/vllm that referenced this pull request Nov 4, 2025
* # This is a combination of 6 commits.
# This is the 1st commit message:

mooncake store connector

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

# This is the commit message vllm-project#2:

mooncake store connector

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

# This is the commit message vllm-project#3:

mooncake store connector

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

# This is the commit message vllm-project#4:

mooncake store connector

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

# This is the commit message vllm-project#5:

mooncake store connector

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

# This is the commit message vllm-project#6:

mooncake store connector

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

* mooncake store connector

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

* mooncake store connector

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

mooncake store connector

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

mooncake store connector

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

mooncake store connector

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

mooncake store connector

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

mooncake store connector

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

mooncake store connector

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>

fix comments

* Update vllm/distributed/ec_transfer/utils/tensor_memory_pool.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update vllm/distributed/ec_transfer/ec_lookup_buffer/mooncake_store.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update vllm/distributed/ec_transfer/ec_connector/mooncake_storage_connector.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Apply suggestion from @wuhang2014

line length format

* Apply suggestion from @wuhang2014

remove extra empty line

---------

Signed-off-by: CHEN <116010019@link.cuhk.edu.cn>
Co-authored-by: wuhang <whlbx@hotmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025
access2rohit pushed a commit to access2rohit/vllm that referenced this pull request Nov 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants