Development Roadmap (2024 Q4) #1487

Ying1123 · 2024-09-21T22:38:00Z

Here is the development roadmap for 2024 Q4. Contributions and feedback are welcome (Join Bi-weekly Development Meeting). Previous 2024 Q3 roadmap can be found in #634.

Performance

Hide CPU overhead with overlapped scheduler (Faster overlap mode scheduler #1738, Enable overlap by default #2067)
Support speculative decoding
- Eagle Eagle speculative decoding part 4: Add EAGLE2 worker #2150
- Reference-based. Reference speculative decoding #270
- Medusa head [Feature] plan to support medusa? #859
- Draft model based.
Sparse Attention Support double sparsity #1459
Faster grammar parsing library for constrained decoding [Performance] Support both xgrammar and outlines for constrained decoding #1752
Multi-layer radix cache (GPU/CPU/Disk) Hierarchical Caching for SGLang #2693 @xiezhq-hermann
Improve the performance of mixed chunked prefill. see a draft Rewrite mixed chunked prefill #1383
Integrate CuDNN paged attention kernels

Parallelism

Support sequence parallelism [Feature] Add initial support for sequence parallelism #1436. Related paper
Support pipeline parallelism.
Support expert parallelism + data parallelism for DeepSeek/MoE models. @ispobock
- Data parallelism Support DP MLA #1970
- Expert parallelism # [Feature] Expert parallelism support #1435
Implement a better cache-aware load balancer for data parallelism. [router] cache-aware load-balancing router v1 #2114 [Feature] Cache-aware Data Parallel Router #1732 @ByronHsu @yichuan520030910320
Overlap communication in tensor parallelsim. @ZhuohaoL
Support disaggregated serving to separate prefill and decoding.

Hardware Coverage

AMD optimizations. cc @HaiShaw
- CK kernels
- Setup CI (accuracy/performance) for AMD
Intel XPU support.
- [Feature, Hardware] Enable SGLang on XPU GPUs via PyTorch #1480
- Add initial support for intel Gaudi accelerators #2121

Model Coverage

Multi-modal models
- Llama 3.2 Vision Llama3.2 vision model support #1551
- QWen2-VL Support qwen2 vl model #1546
- DeepSeek VL2 [Feature] Support DeepSeek VL 2 #2653
- mistralai/Pixtral [Feature] Support mistralai/Pixtral #2351
- GLM 4V Add GLM-4v Multimodal Model support for SGLang #1641
- VILA https://arxiv.org/abs/2412.04468
- InternVL
- Phi-vision
- FishSpeech audio model support
- Ultravox
Language models
- Mamba models @rahulbatra85 @HaiShaw
- xLSTM
Reward models
- [Feature] Support reward model LxzGordon/URM-LLaMa-3.1-8B #1525
- Gemma2 reward model support #1954

New features

Integrate with LMCache https://github.com/LMCache/LMCache
A padded batch mode to make results more deterministic

sglang/docs/references/faq.md

Line 3 in 8912b76

## The results are not deterministic, even with a temperature of 0
Performance optimizations for multi-LoRA serving [LoRA, Performance] Add gemm expand triton kernel for multi-LoRA #1728

Quantization

@HaiShaw @zhyncs @ispobock

Torchao integration Add llama implementation with no tensor parallel linears #1561
Turbomind operators integration
More CUTLASS mixed precision gemm integration
KV cache quantization (more formats + scaling factor)

Server API

Support directly taking embedding as inputs. [Feature] Generation Inputs: input_embeds #745
Add APIs for using the inference engine in a single script without launching a separate server. See also examples.
- Provide an offline engine API #1567
Support endpoint other than OpenAI (Anthropic, Mistral) in the language frontend.
Better APIs to support RL trainers, including https://github.com/huggingface/trl and https://github.com/OpenRLHF/OpenRLHF @zhaochenyang20
Support generalized reward API (adding linear layers to any Causal LM to get the reward) https://github.com/OpenRLHF/OpenRLHF @zhaochenyang20

Observability

Integrate Grafana / Prometheus
- support prometheus metrics #1853 [WIP] Prometheus Metrics #1461

Others

Notebook-style interactive tutorials. @zhaochenyang20
Compiler mode optimizations for the language (e.g. support sending a full serialized SGL program to the server). @hnyls2002
Memory pool refactor to better support mixing different attention layers (e.g., interleaved window attention). @Ying1123
Make vLLM an optional dependency. @zhyncs @ByronHsu @yizhang2077 [Feature] Make vLLM optional in model code #1673

fengyang95 · 2024-09-22T02:02:41Z

Are there any plans to optimize long context latency?

lumiere-ml · 2024-10-17T02:24:33Z

Hi，can I help for Multi-layer radix cache (GPU/CPU/Disk)？ Really insterested in that.

tanzelin430 · 2024-10-17T11:58:58Z

Are there any plans to optimize long context latency?

I am interested in contributing to P-D split inference architechure and I have machines that support me to develop the architechure, if you guys got any related develop plans please let me know. Thank you @Ying1123 @zhyncs @fengyang95

merrymercy · 2024-10-19T13:58:47Z

@lumiere-ml @tanzelin430 Are you in the slack channel? We can follow up on that.

zhyncs · 2024-10-20T06:01:03Z

@lumiere-ml @tanzelin430 Are you in the slack channel? We can follow up on that.

@lumiere-ml @tanzelin430 Welcome to join our slack channel https://join.slack.com/t/sgl-fru7574/shared_invite/zt-2ngly9muu-t37XiH87qvD~6rVBTkTEHw

tanzelin430 · 2024-10-20T06:14:54Z

@lumiere-ml @tanzelin430 Are you in the slack channel? We can follow up on that.

@lumiere-ml @tanzelin430 Welcome to join our slack channel https://join.slack.com/t/sgl-fru7574/shared_invite/zt-2ngly9muu-t37XiH87qvD~6rVBTkTEHw

thanks for invitation, I am in slack now. forward to collaberate with you

lumiere-ml · 2024-10-20T09:01:30Z

@lumiere-ml @tanzelin430 Are you in the slack channel? We can follow up on that.

@lumiere-ml @tanzelin430 Welcome to join our slack channel https://join.slack.com/t/sgl-fru7574/shared_invite/zt-2ngly9muu-t37XiH87qvD~6rVBTkTEHw

Thanks for your invitation！

Edenzzzz · 2024-11-11T03:30:14Z

@lumiere-ml @tanzelin430 Are you in the slack channel? We can follow up on that.

@lumiere-ml @tanzelin430 Welcome to join our slack channel https://join.slack.com/t/sgl-fru7574/shared_invite/zt-2ngly9muu-t37XiH87qvD~6rVBTkTEHw

Thanks for your invitation！

@lumiere-ml @zhyncs I'm also very interested, could you share which channel you're using to discuss?
Perhaps we can combine radix tree prefix matching with P-D disaggregation similar to Mooncake?

mfdj2002 · 2024-11-21T07:40:18Z

If no one is actively working on supporting pipeline parallelism, I'm down to help

Edenzzzz · 2024-11-25T17:20:24Z

@mfdj2002 I think @CalvinXKY has expressed interest on slack, you can chat with him there

merrymercy · 2024-11-26T00:25:30Z

No one is working on pipeline parallelism. Feel free to contribute one.

m0g1cian · 2024-12-03T07:51:23Z

I recently completed a reward model implementation for RMs trained by LlamaFactory. Everything worked well but I've noticed a relatively small value diff in last hidden states between my SGLang implementation and the counterpart in TRL (resulting a ROC loss of ~0.3%)

Regardless, I think I can help with the task "Support generalized reward API (adding linear layers to any Causal LM to get the reward)"

kuangdao · 2024-12-04T06:32:48Z

i am interested in sequence parallelism, i want to know if the sequence parallelism will use the method of Context Parallelism for Scalable Million-Token Inference , thanks

zhaochenyang20 · 2024-12-04T20:38:26Z

I recently completed a reward model implementation for RMs trained by LlamaFactory. Everything worked well but I’ve noticed a relatively small value diff in last hidden states between my SGLang implementation and the counterpart in TRL (resulting a ROC loss of ~0.3%)

Regardless, I think I can help with the task “Support generalized reward API (adding linear layers to any Causal LM to get the reward)”

Amazing, could you please send an Email with your wechat or other connection to zhaochenyang20@gmail.com

We can also discuss this on our Slack. find zhaochenyang20@gmail.com on sglang slack plz!

@m0g1cian

trh11111 · 2024-12-11T02:50:50Z

I am also very interested in the scenario of PD disaggregation, and I hope to combine radix tree with PD disaggregation for some experiments. I saw that someone mentioned this in October. May I ask how the current development plan is progressing?

zhaochenyang20 · 2024-12-11T03:36:24Z

I am also very interested in the scenario of PD disaggregation, and I hope to combine radix tree with PD disaggregation for some experiments. I saw that someone mentioned this in October. May I ask how the current development plan is progressing?

@trh11111 Yeah. We have new members joined our team work on this and PD disaggregation is the first-priority in our developmap for our next quoter.

tanzelin430 · 2024-12-11T09:14:57Z

I am also very interested in the scenario of PD disaggregation, and I hope to combine radix tree with PD disaggregation for some experiments. I saw that someone mentioned this in October. May I ask how the current development plan is progressing?

Hi, I have just finish my graduation recruiment senson and am working on my ATC paper. I'll be soon looking into the development

zhaochenyang20 · 2024-12-11T23:08:57Z

I am also very interested in the scenario of PD disaggregation, and I hope to combine radix tree with PD disaggregation for some experiments. I saw that someone mentioned this in October. May I ask how the current development plan is progressing?

Hi, I have just finish my graduation recruiment senson and am working on my ATC paper. I'll be soon looking into the development

@trh11111 if you feel interested in this part, could reach out to us on slack.

mpjlu · 2024-12-18T02:40:16Z

@lumiere-ml @tanzelin430 Are you in the slack channel? We can follow up on that.

@lumiere-ml @tanzelin430 Welcome to join our slack channel https://join.slack.com/t/sgl-fru7574/shared_invite/zt-2ngly9muu-t37XiH87qvD~6rVBTkTEHw

how to join this slack channel

zhyncs · 2024-12-20T18:06:47Z

Hi @mpjlu https://join.slack.com/t/sgl-fru7574/shared_invite/zt-2rtikx2pv-DUfPrhx2SaNAq~47YtV1XQ

mpjlu · 2024-12-22T01:04:47Z

Hi @mpjlu https://join.slack.com/t/sgl-fru7574/shared_invite/zt-2rtikx2pv-DUfPrhx2SaNAq~47YtV1XQ

Thanks

jjjjohnson · 2025-01-08T10:07:47Z

@Ying1123 Reference-based speculative decoding #2790

zhyncs · 2025-02-03T21:34:38Z

This roadmap will be updated with Q1 2025 soon. Please stay tuned.

GGKOP · 2025-02-06T11:52:45Z

More CUTLASS mixed precision GEMM integration
KV cache quantization (more formats + scaling factor)
Can I help with these two parts? I need more requirements and details.

zhyncs · 2025-02-06T11:55:19Z

More CUTLASS mixed precision GEMM integration KV cache quantization (more formats + scaling factor) Can I help with these two parts? I need more requirements and details.

Yeah https://github.com/sgl-project/sglang/tree/main/sgl-kernel/src/sgl-kernel/csrc
We currently support W8A8 Int8 and W8A8 FP8 using CUTLASS.

GGKOP · 2025-02-10T10:39:47Z

More CUTLASS mixed precision GEMM integration KV cache quantization (more formats + scaling factor) Can I help with these two parts? I need more requirements and details.

Yeah https://github.com/sgl-project/sglang/tree/main/sgl-kernel/src/sgl-kernel/csrc We currently support W8A8 Int8 and W8A8 FP8 using CUTLASS.

Hi, I spent some time figuring out parts of the code for FP8 GEMM and INT8 GEMM, but I didn't find any examples of applying W8A8. Did I miss something?

zhyncs · 2025-03-04T00:34:48Z

Hi all

The SGLang team has released the development roadmap for 2025 H1. Feel free to reach out if you want to collaborate or discuss.

Our main focus is on:

Large-scale deployment focused on throughput, like the DeepSeek inference system
Optimizations for long contexts
Speculative decoding with low latency
Integration of reinforcement learning training framework
Kernel optimizations

For more details, please refer to #4042 Cheers!

Ying1123 changed the title ~~[WIP] Development Roadmap (2024 Q4)~~ Development Roadmap (2024 Q4) Sep 22, 2024

zhyncs pinned this issue Sep 22, 2024

zhyncs mentioned this issue Sep 22, 2024

[Feature] Are there plans to implement a prefill-decode split inference architecture? #1080

Closed

ByronHsu mentioned this issue Oct 4, 2024

Provide an offline engine API #1567

Merged

3 tasks

ByronHsu mentioned this issue Oct 15, 2024

Support vLLM-style rope flashinfer-ai/flashinfer#530

Closed

zhaochenyang20 mentioned this issue Oct 20, 2024

Add documentations for Installation #1733

Closed

3 tasks

zhyncs mentioned this issue Nov 1, 2024

Development Roadmap (2024 Q3) #634

Closed

29 tasks

liangzelang mentioned this issue Nov 15, 2024

[Feature] Expert parallelism support #1435

Closed

2 tasks

zhaochenyang20 mentioned this issue Dec 10, 2024

[Feature] Support General Reward Model #2427

Open

3 tasks

kerthcet mentioned this issue Feb 9, 2025

Add an example of multi-host inference using SGLang kubernetes-sigs/lws#371

Closed

3 tasks

zhaochenyang20 mentioned this issue Feb 24, 2025

How to contribute an optimized R1 operator in SGlang? #3816

Closed

zhaochenyang20 mentioned this issue Mar 3, 2025

Development Roadmap (2025 H1) #4035

Closed

22 tasks

zhaochenyang20 closed this as completed Mar 3, 2025

zhaochenyang20 unpinned this issue Mar 3, 2025

zhyncs mentioned this issue Mar 4, 2025

Development Roadmap (2025 H1) #4042

Open

58 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Development Roadmap (2024 Q4) #1487

Development Roadmap (2024 Q4) #1487

Ying1123 commented Sep 21, 2024 •

edited by merrymercy

Loading

fengyang95 commented Sep 22, 2024

lumiere-ml commented Oct 17, 2024

tanzelin430 commented Oct 17, 2024

merrymercy commented Oct 19, 2024

zhyncs commented Oct 20, 2024

tanzelin430 commented Oct 20, 2024

lumiere-ml commented Oct 20, 2024

Edenzzzz commented Nov 11, 2024 •

edited

Loading

mfdj2002 commented Nov 21, 2024

Edenzzzz commented Nov 25, 2024

merrymercy commented Nov 26, 2024

m0g1cian commented Dec 3, 2024

kuangdao commented Dec 4, 2024

zhaochenyang20 commented Dec 4, 2024

trh11111 commented Dec 11, 2024

zhaochenyang20 commented Dec 11, 2024 •

edited

Loading

tanzelin430 commented Dec 11, 2024

zhaochenyang20 commented Dec 11, 2024

mpjlu commented Dec 18, 2024

zhyncs commented Dec 20, 2024

mpjlu commented Dec 22, 2024

jjjjohnson commented Jan 8, 2025 •

edited

Loading

zhyncs commented Feb 3, 2025

GGKOP commented Feb 6, 2025

zhyncs commented Feb 6, 2025

GGKOP commented Feb 10, 2025

zhyncs commented Mar 4, 2025

Development Roadmap (2024 Q4) #1487

Development Roadmap (2024 Q4) #1487

Comments

Ying1123 commented Sep 21, 2024 • edited by merrymercy Loading

Performance

Parallelism

Hardware Coverage

Model Coverage

New features

Quantization

Server API

Observability

Others

fengyang95 commented Sep 22, 2024

lumiere-ml commented Oct 17, 2024

tanzelin430 commented Oct 17, 2024

merrymercy commented Oct 19, 2024

zhyncs commented Oct 20, 2024

tanzelin430 commented Oct 20, 2024

lumiere-ml commented Oct 20, 2024

Edenzzzz commented Nov 11, 2024 • edited Loading

mfdj2002 commented Nov 21, 2024

Edenzzzz commented Nov 25, 2024

merrymercy commented Nov 26, 2024

m0g1cian commented Dec 3, 2024

kuangdao commented Dec 4, 2024

zhaochenyang20 commented Dec 4, 2024

trh11111 commented Dec 11, 2024

zhaochenyang20 commented Dec 11, 2024 • edited Loading

tanzelin430 commented Dec 11, 2024

zhaochenyang20 commented Dec 11, 2024

mpjlu commented Dec 18, 2024

zhyncs commented Dec 20, 2024

mpjlu commented Dec 22, 2024

jjjjohnson commented Jan 8, 2025 • edited Loading

zhyncs commented Feb 3, 2025

GGKOP commented Feb 6, 2025

zhyncs commented Feb 6, 2025

GGKOP commented Feb 10, 2025

zhyncs commented Mar 4, 2025

Ying1123 commented Sep 21, 2024 •

edited by merrymercy

Loading

Edenzzzz commented Nov 11, 2024 •

edited

Loading

zhaochenyang20 commented Dec 11, 2024 •

edited

Loading

jjjjohnson commented Jan 8, 2025 •

edited

Loading