[Roadmap] veRL Development Roadmap #22

PeterSH6 · 2024-11-22T09:47:53Z

Themes

We categorized our roadmap into 8 themes: Broad Model Support, Regular Update, More RL Algorithms support, Dataset Coverage, Plugin Support, Scaling Up RL, More LLM Infrastructure Support, Wide Hardware Coverage

Broad Model Support

To add a new model in veRL, the model should satisfy the following requirements:

The models are supported in vLLM and huggingface transformers. Then you can directly use dummy_hf load format to run the new model
[Optional for DTensor] For FSDP Backend, implement the dtensor_weight_loader for the model to transfer actor weights from FSDP checkpoint to vLLM model. See FSDP Document for more information
For Megatron Backend, users need to implement the ParallelModel similar to modeling_llama_megatron.py , implement some corresponding checkpoint_utils to load checkpoints from the huggingface, and implement the megatron_weight_loader to transfer actor weights from ParallelModel directly to the vLLM model. See Megatron-LM Document for more information

Regular Update

Use postition_idsto support remove padding in transformers models (transformers >= v4.45) [misc] feat: spport rmpad/data-packing in FSDP with transformers #91
Upgrade the vLLM version to the latest -> Integrate SPMD-version of vLLM
Ray upgrade to latest version (test multiple resource_pool colocate) [misc] fix: weak reference of WorkerDict in RayTrainer #65
- An Megatron Example for multiple WorkerGroup on same resource_pool.
Megatron-LM/MCore Upgrade and GPTModel Support [RFC] Megatron-LM and MCore maintaining issues for veRL #15

More RL Algorithms Support

Make sure the algorithms can converge on some math datasets (e.g., GSM8k)

GRPO
Online DPO
Safe-RLHF (Multiple rm)
ReMax

Dataset Coverage

Plugin Support

Integrate SandBox and its corresponding datasets for Code Generation tasks

Scaling up RL

Context Parallel
- Deepspeed Ulyssess [misc][Long Context] feat: support ulysses for long context training #109
- Ring Attention
Integrate Ray Compiled Graphs (aDAGs) to speedup data transfer
Support FSDP HybridShard
Aggressive offload techniques for all models
Support vLLM Rollout utilizes larger TP size than Actor model
Support Pipeline parallelism in rollout generation (in vllm or other LLM serving infra)

More LLM Infrastructure Support

LLM Training Infrastructure

Support TorchTitan for TP + PP parallelism
Support VeScale for Auto-Parallelism training

LLM Serving Infrastructure

At present, our project supports vLLM using the SPMD execution paradigm. This means we've eliminated the need for a standalone single-controller process (known as LLMEngine) by integrating its functionality directly into the multiple worker processes, making the system SPMD.

Basic Tutorial: Basic Tutorial: Adding a New LLM Inference/Serving Backend #21
Support SGLang (offline + SPMD) for rollout generation. Reference: [Feature] several features for veRL integration sgl-project/sglang#2736
Support vLLM-SPMD version: [testing][rollout] feat: support integration of vllm>=0.7.0 (spmd-version) #209
Support TensorRT-LLM for rollout generation

Wide Hardware Coverage

Supporting a new hardware type in our project involves the following requirements:

Ray compatibility: The hardware type must be supported by the Ray framework, allowing it to be recognized and managed through the ray.utils.placement_group functionality.
LLM infra and transformers support: To leverage the new hardware effectively, it is crucial that both LLM infra (e.g., vLLM, torch, Megatron-LM and others) and the transformers library provide native support for the hardware type.
CUDA kernel replacement: We need to replace the CUDA kernels currently used in FSDP and Megatron-LM with the corresponding kernels specific to the new hardware.

Support Ascend NPUs
- vLLM Ascend Support [Feature]: vllm support for Ascend NPU vllm-project/vllm#6728
- Megatron-LM -> MindSpeed
Low-end NVIDIA GPUs (e.g., Volta, Tesla series)
- For Megatron-LM, implement no-rmpad and no flash-attention version of ParallelModel Is non-RmPad version model and RmPad verison mdoel interchangeable? #20

The text was updated successfully, but these errors were encountered:

leifeng666 · 2025-02-01T19:12:28Z

Hi @PeterSH6 @eric-haibin-lin, I am exploring this project recently and I am happy to contribute to it. Are there any recommended good first issues that I can work on to get myself familiar with it? I am interested in issues related to model parallelism and serving.

eric-haibin-lin · 2025-02-05T19:47:04Z

Hi @PeterSH6 @eric-haibin-lin, I am exploring this project recently and I am happy to contribute to it. Are there any recommended good first issues that I can work on to get myself familiar with it? I am interested in issues related to model parallelism and serving.

Hi @leifeng666 thanks for your interest. One good issue is to support more models with model parallelism / Megatron. Currently we have an example of Megatron deepseek-llm. I think a good path towards more model support would be:

meta-llama/Meta-Llama-3-8B-Instruct, since its config is quite similar to that of deepseek-llm
meta-llama/Llama-3.2, by support features such as rope

You can find reference Megatron commands and logs in #210 . Other ways to verify the correctness would be just compare Megatron tensor parallel runs with FSDP runs.
You're welcome to join wechat/slack channel to further discuss and contribute!

https://verl.readthedocs.io/en/latest/advance/megatron_extension.html

zhe-thoughts · 2025-02-15T01:38:21Z

Support TorchTitan for TP + PP parallelism

Is there a timeline for this? Thanks

#22 . WIP, will add more details tomorrow :) --------- Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>

volcengine#22 . WIP, will add more details tomorrow :) --------- Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>

PeterSH6 pinned this issue Nov 22, 2024

PeterSH6 self-assigned this Jan 19, 2025

ocss884 mentioned this issue Mar 5, 2025

[rollout] feat: add SGLang as rollout engine to verl #490

Merged

vermouth1992 pushed a commit that referenced this issue Mar 17, 2025

[rollout] feat: add SGLang as rollout engine to verl (#490)

333e6d6

#22 . WIP, will add more details tomorrow :) --------- Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>

mansicer pushed a commit to mansicer/verl that referenced this issue Mar 18, 2025

[rollout] feat: add SGLang as rollout engine to verl (volcengine#490)

2a4c23f

volcengine#22 . WIP, will add more details tomorrow :) --------- Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>

eric-haibin-lin mentioned this issue Mar 21, 2025

[roadmap] verl development Q2 #710

Open

31 tasks

eric-haibin-lin unpinned this issue Mar 21, 2025

zpqiu pushed a commit to zpqiu/verl that referenced this issue Mar 25, 2025

[rollout] feat: add SGLang as rollout engine to verl (volcengine#490)

4db6a7e

volcengine#22 . WIP, will add more details tomorrow :) --------- Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>

shenofusc mentioned this issue Mar 31, 2025

Ascend 910B quickstart error WorkerDict pid=904548) [E331 20:50:12.688969421 socket.cpp:1011] [c10d] The client socket has timed out after 1800000ms while trying to connect to (10.123.x.x, 38321) #849

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Roadmap] veRL Development Roadmap #22

[Roadmap] veRL Development Roadmap #22

PeterSH6 commented Nov 22, 2024 •

edited by vermouth1992

Loading

leifeng666 commented Feb 1, 2025

eric-haibin-lin commented Feb 5, 2025 •

edited

Loading

zhe-thoughts commented Feb 15, 2025

[Roadmap] veRL Development Roadmap #22

[Roadmap] veRL Development Roadmap #22

Comments

PeterSH6 commented Nov 22, 2024 • edited by vermouth1992 Loading

Themes

Broad Model Support

Regular Update

More RL Algorithms Support

Dataset Coverage

Plugin Support

Scaling up RL

More LLM Infrastructure Support

LLM Training Infrastructure

LLM Serving Infrastructure

Wide Hardware Coverage

leifeng666 commented Feb 1, 2025

eric-haibin-lin commented Feb 5, 2025 • edited Loading

zhe-thoughts commented Feb 15, 2025

PeterSH6 commented Nov 22, 2024 •

edited by vermouth1992

Loading

eric-haibin-lin commented Feb 5, 2025 •

edited

Loading