Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Speculative Decoding] Support draft model on different tensor-parallel size than target model #5414
[Speculative Decoding] Support draft model on different tensor-parallel size than target model #5414
Changes from 89 commits
f5b5f94
709de21
0eacc96
2011ed0
2e16c4e
b412a51
593ccfa
c5d3476
44e623b
98caf17
7fc4ff5
a96e720
db39576
b2e8595
756442a
32094f1
7890191
53b2ea9
a29c9c5
52ba09d
d26ef08
80c4994
0f16f3f
140f478
3fd7e91
495aa30
3a5a47f
07ddbb8
b0a677d
96782a2
9998b9c
e92ecdc
b421607
386ab9b
b25f74e
8b51f08
d4b283c
dfc90cb
9bef5e4
85d087d
9af36b7
5a0bf45
531c9f0
287da20
08d1b2a
237c966
0bb38c2
c097d6c
957a325
3ec8cb5
8a8a1e4
7f06f64
1e87579
abc546c
7880cb0
2ebe6f3
90d46ee
7e1426c
ad52d93
355475b
9cfdb5b
6a6c5ff
ddef229
965f648
1bb5534
ea6b8f5
71977d2
bc5f77a
5655a49
eabc16a
f748edf
c099c94
4b74a45
c9786ad
a42664a
ac7701a
eea6a7e
a648f5d
f23ba8c
aa9af93
56c8927
385b4f8
43f37eb
99350e2
a9f3e23
6ba250d
3e78613
6532af7
6839797
aac586b
98e584d
2d5e64d
ba88bd4
46e5274
85f4f25
c1b5373
4a58617
b09e7be
7168d78
fe0bd5b
2e0d170
36f8aa5
54bf514
bfd7d2f
f337428
4654b9f
e39926e
1c6eefd
f2d2ee5
302955c
3d4754e
620b224
b245d3c
1e71e98
a01c00d
debffc2
39fe67f
af1b0be
834c6e0
5bc2bc3
8740369
4d82ca1
7bf831c
3fccc76
e8d0e93
91c2e43
fac7e68
271822e
ae0d7f1
b84a070
86fda24
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have a test where we disable some speculations? This will verify that control-flow logic behaves correctly even when draft TP == 1 or draft TP == 2.
see this test for example.
vllm/tests/spec_decode/e2e/test_multistep_correctness.py
Lines 525 to 537 in 8065a7e
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: if the user provides
--speculative-tensor-parallel-size 0
, this branch causes unexpected behavior. Can we explicitly guard against this?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the catch!
To prevent spec_tp from being set as target_tp when given spec_tp is 0, I've changed the code as below:
In addition, to prevent tp value from being 0, I think we need to make a separate PR to handle that case by adding a check in ParallelConfig._verify_args(). Because It seems to be the same in the
--tensor-parallel-size 0
case.What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'll look at your new changes; no need to completely fix this (just a nit)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this global variable patching potentially create problem? For example, is it possible that other workers will use this context unknowingly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current design of speculative decoding, draft and target workers execute sequentially.
So there's no chance of target workers using the patched/overridden context.
But if draft and target worker execute concurrently in the future, the code should be redesigned to prevent states being mixed with each other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about only override tp_group and not world_group here.
will it work?
I saw get_world_group() is only used once in the codebase (very early in the initialization stage).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is used only during initialization.
If we do not override world_group then it will fail during initialization due to an assertion check (link).
This checks the world_group size consistency between workers when spawning multiple workers in the same process (or ray worker).
In our case, it asserts the current world_group should have the same size with the world_group being initialized.
Note that this check is added by #5293 and slightly modified in this PR to support the small draft-tp case.
I'm not sure in which scenarios this check is used, but I thought it would be safer to keep it.
What do you think?
It's a slightly different story, but when I opened this PR, things were a little different.
get_world_group() was also used after initialization by broadcast_tensor_dict(), which is for driver to control workers.
But this use case has gone after refactoring by other PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the explanation.
How about add a comment about world_group? Since this function is named as
patch_tensor_parallel_group
. Or maybe rename the function topatch_distributed_group
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion :)
I've added a comment in the docstring of
patch_tensor_parallel_group()
.