Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC]: Single Program Multiple Data (SPMD) Worker Control Plane #6556

Open
Tracked by #6801
ruisearch42 opened this issue Jul 19, 2024 · 6 comments
Open
Tracked by #6801

[RFC]: Single Program Multiple Data (SPMD) Worker Control Plane #6556

ruisearch42 opened this issue Jul 19, 2024 · 6 comments
Labels

Comments

@ruisearch42
Copy link
Contributor

Motivation.

TL;DR: Introduce SPMD-style control plane to improve control plane architecture and optimize performance.

For distributed inference, vLLM currently leverages a “driver-worker”, along with other workers. As shown in the diagram below, this driver-worker is in the same process as the driver. It prepares the arguments, then broadcasts them to all other workers to execute the sharded model, leveraging NCCL as the control plane.

Screenshot 2024-07-18 at 5 37 48 PM

This architecture has a few drawbacks. First, the driver-worker needs to participate in the NCCL group and execute the model. Since NCCL broadcast is a synchronous operation, this creates interference with other driver functionality such as scheduling and affects performance.

Moreover, this architecture made it difficult to support speculative decoding. Specifically,

  1. Speculative decoding framework may not run the draft model if Dynamic Speculative Decoding (DSD) or other policy is enabled. In this case, the decision of whether to run the draft model must be communicated to other ranks. So DSD cannot work with TP>1, unless there is additional communication (which incurs latency overhead).
  2. Pipeline parallelism can be composed within the speculative decoding framework. However the speculative tokens must be sent to all workers, e.g. cross-node. If we have SPMD, then all PP ranks have access to the same information, and we don't need to do any communication on top of normal PP. This is important for latency.

Proposed Change.

We propose an architecture change to support SPMD-style control plane, as shown in the diagram below.

Screenshot 2024-07-18 at 5 38 21 PM

Specifically, we remove the argument preparation and model execution functionality from the driver, and make all workers SPMD-style: The LLMEngine/driver now passes the input to all the SPMD workers via a Ray DAG channel (shared memory), and each worker prepares arguments and executes its model shard. The results are passed back to the driver with Ray DAG channel as well.

Roadmap

SPMD functionality and optimizations:

Features to build on top of SPMD:

  • Pipeline parallelism with Ray accelerated DAG
  • Speculative decoding

After comprehensive benchmarking and optimizations, SPMD will become the default and NCCL based control plane code path will be cleaned up.

Feedback Period.

No response

CC List.

@youkaichao @stephanie-wang @rkooo567 @cadedaniel

Any Other Things.

No response

@cadedaniel
Copy link
Collaborator

cc @LiuXiaoxuanPKU for DSD

@njhill
Copy link
Member

njhill commented Jul 31, 2024

@ruisearch42 sorry for the delay but I have a few questions. Coming from TGI which takes an SPMD approach, I actually saw the way that the driver worker participates in the collective communications an advantage in terms of reduced data movement.

  1. Speculative decoding framework may not run the draft model if Dynamic Speculative Decoding (DSD) or other policy is enabled. In this case, the decision of whether to run the draft model must be communicated to other ranks. So DSD cannot work with TP>1, unless there is additional communication (which incurs latency overhead).

To clarify, we're talking specifically about TP>1 for the draft model here right? not for draft tp=1 and target tp>1?

Could you elaborate on how SPMD in particular solves this? If the decision is per top-level spec-decoding step, why can't that be included in the existing metadata that's broadcast from the driver?

  1. Pipeline parallelism can be composed within the speculative decoding framework. However the speculative tokens must be sent to all workers, e.g. cross-node. If we have SPMD, then all PP ranks have access to the same information, and we don't need to do any communication on top of normal PP. This is important for latency.

Is the thinking that the layers of the draft model itself would also be distributed between nodes with PP? Could you explain why the PP ranks don't have access to the same information with the current implementation?

I am probably missing some obvious things here so apologies in advance!

@cadedaniel
Copy link
Collaborator

  1. Speculative decoding framework may not run the draft model if Dynamic Speculative Decoding (DSD) or other policy is enabled. In this case, the decision of whether to run the draft model must be communicated to other ranks. So DSD cannot work with TP>1, unless there is additional communication (which incurs latency overhead).

To clarify, we're talking specifically about TP>1 for the draft model here right? not for draft tp=1 and target tp>1?

Could you elaborate on how SPMD in particular solves this? If the decision is per top-level spec-decoding step, why can't that be included in the existing metadata that's broadcast from the driver?

First, to lay out the problem clearly:

  1. When draft_tp>1 and target_tp>1, the non-zero ranks do not know whether they should run the draft model for a given input. This is because there can be a policy to turn off speculation which only rank0 is privy to (currently either DSD or if all sequences are too long for the proposal model).
  2. When draft_tp=1 and target_tp>1, the non-zero ranks do not know what the proposal tokens are, because only rank0 runs the draft model. These must be communicated in some way to the other ranks so that they may form a batch for scoring with the target model.

Succinctly, we need to (1) communicate the result of any dynamic speculation policy from rank0 to nonzero ranks, and (2) communicate the proposal tokens from rank0 to nonzero ranks.

To implement this, there are two different options I can think of:

  • Communicate same input to all workers, which use SPMD style where all workers have perfect information. There is no subsequent control flow communication required since dynamic speculation policies can be deterministic, and all ranks will have access to sampled tokens because the sampler does an allgather already to get logits.
  • Add explicit control flow communication after the dynamic speculation policy runs. Add another explicit control flow communication after the draft model runs to communicate the proposal tokens. All ranks have the dynamic speculation policy decision, and all ranks have the proposal tokens.

SPMD is better than the alternative for two reasons:

  • Latency: we only need to write input metadata once to shared memory, then all TP processes can read from it and execute their control flow without any further communication. Not only can you not do better than this in terms of low latency, it requires more complex software engineering to create low-latency communications on GPU.
  • Separation of concerns: Once the input metadata information is communicated, the worker logic can execute unobstructed without further control plane communication. This simplifies the software (don't have to deal with control-flow communication deadlocks!), and also allows composability of workers within workers for more advanced algorithms such as staged speculative decoding.

The downside of SPMD is we waste energy on the machine as we expend the flops for the draft model on each GPU instead of on only one GPU, which is an acceptable tradeoff given current requirements.

  1. Pipeline parallelism can be composed within the speculative decoding framework. However the speculative tokens must be sent to all workers, e.g. cross-node. If we have SPMD, then all PP ranks have access to the same information, and we don't need to do any communication on top of normal PP. This is important for latency.

Is the thinking that the layers of the draft model itself would also be distributed between nodes with PP? Could you explain why the PP ranks don't have access to the same information with the current implementation?

I am probably missing some obvious things here so apologies in advance!

The primary concern here is a separation of concerns, so that the proposal method can use the deployment configuration that's best what the user wants. So if they are using PP for the target model and PP is also good for the draft model, they can be free to do so without refactoring the framework.

I am not sure if this will be a popular configuration or one we support, but one could imagine such a scenario when the user is trying to balance the speculative workload across PP ranks. We could communicate control flow communication for requirements (1) and (2) above in PP, or we could simply communicate the initial state to the world and have each rank perform work according to its identity.

@njhill
Copy link
Member

njhill commented Aug 2, 2024

Thanks @cadedaniel for taking the time to explain this in so much detail.

Still trying to wrap my head around it fully. I guess I still don't see why the driver can't participate, with a single metadata broadcast per top-level step (as is already happens for non spec-decode), and within each top-level step the actions can still be mirrored (and can include proposer + target, or staged chain, etc.) including in the driver.

@cadedaniel
Copy link
Collaborator

cadedaniel commented Aug 4, 2024

Still trying to wrap my head around it fully. I guess I still don't see why the driver can't participate, with a single metadata broadcast per top-level step (as is already happens for non spec-decode), and within each top-level step the actions can still be mirrored (and can include proposer + target, or staged chain, etc.) including in the driver.

Yeah, what you describe is exactly the goal. This is "spmd" since each rank runs the same program. The question is where that metadata broadcast happens -- for spec decode we need it to happen above the worker entrypoint so that workers may assume all peers have the same information.

@rkooo567
Copy link
Collaborator

rkooo567 commented Aug 6, 2024

The first PR is out #7109

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants