Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] [Speculative Decoding] Support draft model on different tensor-parallel size than target model (Extended) #5856

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

wooyeonlee0
Copy link
Contributor

This PR is to support spec_draft_tp larger than 1.
spec_draft_tp=1 has been enabled in #5414.

Note: To test the code in CI, I removed unrelated test cases temporarily.

What I've done in this PR:

  • Changed config.py to allow tp values larger than 1 but not than the target model's tp value.
  • Added a test case of spec_draft_tp=2 in test_integration_dist_tp4.py

Resolves #4632 further

@zifeitong
Copy link
Contributor

zifeitong commented Jun 27, 2024

The timeout seems to be ray related:

[2024-06-26T07:36:39Z]                 if self.engine_use_ray:
[2024-06-26T07:36:39Z] >                   await self.engine.add_request.remote(  # type: ignore
[2024-06-26T07:36:39Z]                         **new_request)
[2024-06-26T07:36:39Z] E                       asyncio.exceptions.CancelledError

I'm trying to get e2e tests running using mp backend, but it's not so straightforward.

@wooyeonlee0
Copy link
Contributor Author

The timeout seems to be ray related:

[2024-06-26T07:36:39Z]                 if self.engine_use_ray:
[2024-06-26T07:36:39Z] >                   await self.engine.add_request.remote(  # type: ignore
[2024-06-26T07:36:39Z]                         **new_request)
[2024-06-26T07:36:39Z] E                       asyncio.exceptions.CancelledError

I'm trying to get e2e tests running using mp backend, but it's not so straightforward.

These e2e spec decode tests seem to be performed on the ray backend, as they are using AsyncLLM.
I thought it was using the mp backend because it's the default option.
Please let me know when you get some more information. Thank you!

@cadedaniel
Copy link
Collaborator

What is the latest status? is there a hang in CI? btw I suspect it's due to non-driver workers being different from driver worker (but I haven't looked).

@wooyeonlee0
Copy link
Contributor Author

What is the latest status? is there a hang in CI? btw I suspect it's due to non-driver workers being different from driver worker (but I haven't looked).

Sorry, I've been rather busy with other stuff.
Now I can take another look at this issue. I'll try to handle it this week ~ next week.

Regarding the question about CI, I tried to debug the problem by running distributed tests in CI, but I found that it requires much time (4hrs) for distributed tests to launch, because of resource queueing I guess. (though I removed all other tests for the faster result)

I'll work on it some more and share my findings.

Thanks! 👍

@cadedaniel
Copy link
Collaborator

No problem. you should also follow this PR #6032. it adds SPMD worker which removes the need to communicate from rank0 to other ranks

@cadedaniel
Copy link
Collaborator

Take a look at this also @wooyeonlee0 #6556

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Performance] [Speculative decoding]: Support draft model on different tensor-parallel size than target model
4 participants