Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Eliminate parallel worker per-step task scheduling overhead #4894

Merged
merged 7 commits into from
May 22, 2024

Conversation

njhill
Copy link
Member

@njhill njhill commented May 18, 2024

This PR replaces #3763.

Common logic is handled in the DistributedGPUExecutor superclass and used by both Ray and Mutliprocessing executors.

@njhill
Copy link
Member Author

njhill commented May 18, 2024

This is failing the distributed spec decoding test - TP spec decoding was added since the original PR and I need to look more closely at the flow for it and how to best integrate these changes with that.

@rkooo567
Copy link
Collaborator

we found nccl broadcasting overhead is very big now for high tp, and this PR is very important for reducing the gap. I will take a look at the PR tmrw!

@njhill
Copy link
Member Author

njhill commented May 20, 2024

Thanks @rkooo567! This PR actually doesn't reduce any nccl broadcast overhead, it just eliminates most of the non-torch.distributed RPCs which turn out to be much more significant (see the measurements in the original PR).

We can/should additionally reduce the amount of nccl broadcasting done, #4844 is a first step, I'm working on the next steps now.

I'll try to address the above-mentioned spec decoding TP issue with this PR today.

Copy link
Collaborator

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm wonder if there's a cleaner way to stop the execution loop... (seems like too much impl details are leaked) unfortunately, I couldn't come up with a very good idea, so right now it is probably the best to document things more verbosely.

vllm/executor/distributed_gpu_executor.py Show resolved Hide resolved
vllm/executor/distributed_gpu_executor.py Outdated Show resolved Hide resolved
return await self._driver_execute_model_async(execute_model_req)

async def stop_remote_worker_execution_loop_async(self) -> None:
if self.parallel_worker_tasks is None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe assert instead? If this is None, doesn't that mean the state is kind of screwed up?

vllm/executor/distributed_gpu_executor.py Outdated Show resolved Hide resolved
vllm/executor/distributed_gpu_executor.py Show resolved Hide resolved
vllm/executor/multiproc_gpu_executor.py Outdated Show resolved Hide resolved
vllm/engine/async_llm_engine.py Show resolved Hide resolved
vllm/worker/worker.py Outdated Show resolved Hide resolved
vllm/executor/ray_gpu_executor.py Show resolved Hide resolved
vllm/executor/distributed_gpu_executor.py Show resolved Hide resolved
@rkooo567
Copy link
Collaborator

rkooo567 commented May 21, 2024

Btw, I also talked with @cadedaniel. I think this optimization makes spec decoding pretty complicated (and I feel like PP as well). @njhill I am curious if there's a way to disable this when spec decoding is enabled for the short term in a clean way?

@andoorve
Copy link
Collaborator

Btw, I also talked with @cadedaniel. I think this optimization makes spec decoding pretty complicated (and I feel like PP as well). @njhill I am curious if there's a way to disable this when spec decoding is enabled for the short term in a clean way?

+1 for PP @rkooo567

@njhill
Copy link
Member Author

njhill commented May 21, 2024

Thanks for the great review comments @rkooo567!

Btw, I also talked with @cadedaniel. I think this optimization makes spec decoding pretty complicated (and I feel like PP as well). @njhill I am curious if there's a way to disable this when spec decoding is enabled for the short term in a clean way?

@rkooo567 @andoorve this would definitely be an option but would also come at the cost of at least some additional complexity. I haven't yet dug in enough but I'm hopeful about adapting it to work with spec decoding too. That could be misplaced optimism and I'm less sure about PP. I'll first look at those and if the complexity is too great then changing it to work only in non- spec decoding / PP case could be the backup plan.

I was actually hoping this could be a stepping-stone to elimination of the "control plane" RPCs altogether. The workers could remain in a permanent torch.distributed broadcast loop, with keepalives when idle if timeouts are an issue.

@andoorve
Copy link
Collaborator

I was actually hoping this could be a stepping-stone to elimination of the "control plane" RPCs altogether. The workers could remain in a permanent torch.distributed broadcast loop, with keepalives when idle if timeouts are an issue.

@njhill This is a great point, I also committed a change to PyTorch in anticipation of these kinds of optimizations (Replace control-place RPC with torch distributed) for PP in the future: pytorch/pytorch@b96b1e8

It's a real pain to get multiple sends/recvs on a single rank though, so this is definitely a more future item once we get the basic functionality of PP working.

@rkooo567
Copy link
Collaborator

@njhill btw, this was the major bottleneck for us now, so lmk if there's any way I can help accelerating the PR merge!!

@njhill
Copy link
Member Author

njhill commented May 21, 2024

@rkooo567 about to update now, will push within next couple of hours hopefully!

@rkooo567
Copy link
Collaborator

@njhill let me know when it is ready to review again!

@njhill
Copy link
Member Author

njhill commented May 22, 2024

Thanks @rkooo567, I made some updates but had to deal with other things for the remainder of the day. If I don't get a chance to finish debugging tonight I'll do it first thing in the morning (pacific time)

@cadedaniel
Copy link
Collaborator

Approving spec decode changes

Copy link
Collaborator

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if tests pass!

vllm/engine/async_llm_engine.py Show resolved Hide resolved
raise NotImplementedError

@abstractmethod
def _wait_for_tasks_completion(self, parallel_worker_tasks: Any) -> None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QQ: this can hang forever if the loop is not finished properly. Should we allow timeout here and kill workers if timeout is reached with an exception? (theoretically, I think if it takes > 30s, I think there's something wrong).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rkooo567 I will check but I'm not sure that this is necessary. At least in the async case there's already an overall per-step timeout that would cover this I think.

vllm/spec_decode/spec_decode_worker.py Outdated Show resolved Hide resolved
@andoorve
Copy link
Collaborator

Hey @njhill, did you get a chance to look at this? I feel like this would cause quite a few changes to PP in its current form.

@rkooo567 @andoorve this would definitely be an option but would also come at the cost of at least some additional complexity. I haven't yet dug in enough but I'm hopeful about adapting it to work with spec decoding too. That could be misplaced optimism and I'm less sure about PP. I'll first look at those and if the complexity is too great then changing it to work only in non- spec decoding / PP case could be the backup plan.

@njhill
Copy link
Member Author

njhill commented May 22, 2024

@rkooo567 OK the spec decoding thing was a tiny fix. And I have added some more detail to the code comments per your latest review.

Thanks @cadedaniel for reviewing too, I was going to ping you to ask for review of that part once the tests were passing :)

@andoorve I can help to look at how we can adapt it to work with PP but it sounds like there is some urgency to get this PR merged first (for the upcoming release?)

@njhill njhill mentioned this pull request May 22, 2024
6 tasks
@andoorve
Copy link
Collaborator

Hi @njhill,

Got it, don't want to delay this PR, but just wanted to see if there was any way to loosen this assumption (some mechanism or fallback to per-step task scheduling) that only the rank 0 worker is sending over RPC and receiving results.

The main idea of having other workers (non-drivers) in a task loop is thankfully compatible with PP, but it's just this assumption that there's only 1 parallel group that could cause quite a bit of friction. Happy to chat offline as well.

@rkooo567 rkooo567 merged commit eb6d3c2 into vllm-project:main May 22, 2024
63 checks passed
@rkooo567
Copy link
Collaborator

@njhill thank you so much for the quick fix!

@njhill njhill deleted the streamline-tp-new branch May 22, 2024 21:19
@njhill
Copy link
Member Author

njhill commented May 22, 2024

@andoorve I will look more closely at the PP PR tomorrow (and we could chat too if you're free then). Based on what you said and a high level understanding/assumption about how that works I'm fairly confident we can solve it without too much effort.

Temirulan pushed a commit to Temirulan/vllm-whisper that referenced this pull request Sep 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants