-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] [Speculative Decoding] Support draft model on different tensor-parallel size than target model (Extended) #5856
base: main
Are you sure you want to change the base?
Conversation
26db458
to
2338d52
Compare
The timeout seems to be ray related:
I'm trying to get e2e tests running using |
These e2e spec decode tests seem to be performed on the ray backend, as they are using AsyncLLM. |
What is the latest status? is there a hang in CI? btw I suspect it's due to non-driver workers being different from driver worker (but I haven't looked). |
Sorry, I've been rather busy with other stuff. Regarding the question about CI, I tried to debug the problem by running distributed tests in CI, but I found that it requires much time (4hrs) for distributed tests to launch, because of resource queueing I guess. (though I removed all other tests for the faster result) I'll work on it some more and share my findings. Thanks! 👍 |
2338d52
to
e584fcd
Compare
e584fcd
to
e62cf05
Compare
No problem. you should also follow this PR #6032. it adds SPMD worker which removes the need to communicate from rank0 to other ranks |
e62cf05
to
7dab316
Compare
Take a look at this also @wooyeonlee0 #6556 |
This PR is to support spec_draft_tp larger than 1.
spec_draft_tp=1 has been enabled in #5414.
Note: To test the code in CI, I removed unrelated test cases temporarily.
What I've done in this PR:
Resolves #4632 further