Support tensor parallel #2

zhuohan123 · 2023-02-28T08:40:38Z

TODOs:

In another PR:

Merge QKV into one.

WoosukKwon

Fantastic! Left minor comments.

BTW, the sampling results were different when using TP:

Current master (python server.py --model facebook/opt-13b)

# GPU blocks: 1826, # CPU blocks: 3276
Seq 0: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of the'
Seq 1: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of several'
Seq 2: 'Ion Stoica is a professor of philosophy at the University of Bucharest, Romania. He is the author'
Seq 3: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of The'
Seq 4: "UC Berkeley is a very liberal school, but I don't think they would look at it."
Seq 5: 'UC Berkeley is about to get some more tree-hugging support from the University of Washington'
Seq 6: "UC Berkeley is the university of utah\nNot even close\nYeah I'd say it's"
Seq 7: 'The future of cloud computing is React\n\n6 Avril, 2016 | By Maxime Boklan\n\n'

4-way TP (python server.py --model facebook/opt-13b --tensor-parallel-size 4)

# GPU blocks: 4970, # CPU blocks: 3276
Seq 0: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of the'
Seq 1: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of several'
Seq 2: 'Ion Stoica is a professor of philosophy at the University of Bucharest, Romania. He is the author'
Seq 3: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of The'
Seq 4: "UC Berkeley is a very liberal school, but I don't think they would've been too much"
Seq 5: 'UC Berkeley is about to get some more visiting team fans in the seats, as the school�'
Seq 6: 'UC Berkeley is the university of weed.\n*school of vape\nNot everyone who vapes'
Seq 7: "The future of cloud computing is blazing bright\nIf there's a consensus in the tech world today, it's"

8-way TP (python server.py --model facebook/opt-13b --tensor-parallel-size 8)

# GPU blocks: 5464, # CPU blocks: 3276
Seq 0: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of the'
Seq 1: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of several'
Seq 2: 'Ion Stoica is a professor of philosophy at the University of Bucharest, Romania. He is the author'
Seq 3: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of The'
Seq 4: "UC Berkeley is a very liberal school, but I don't think they would put a limit."
Seq 5: 'UC Berkeley is about to get some more visiting team fans in the stands, as the school is'
Seq 6: 'UC Berkeley is the university of weed.\n*school of anarchy\nAll respect to the academics'
Seq 7: "The future of cloud computing is blazing bright\nIf there's a consensus in the tech world today, it's"

cacheflow/utils.py

cacheflow/models/memory_analyzer.py

cacheflow/models/model_utils.py

cacheflow/models/opt.py

server.py

cacheflow/models/opt.py

cacheflow/worker/controller.py

cacheflow/worker/worker.py

zhuohan123 · 2023-03-21T09:36:06Z

@WoosukKwon Thanks again for the review! All comments resolved. Regarding the different sampling results, I think it's too hard to get the same sampling results for different tensor parallel configs. Adding more GPUs changes the model and the execution flow on each GPU, and thus it can change the random process here and there. I cannot and don't think it's necessary to keep their sampling results to be the same.

WoosukKwon

Thanks a lot @zhuohan123 for your huge effort! This is fantastic!

cacheflow/models/model_utils.py

Sync release with main

[QuickFix] Vision Attention using SPDA when platform is out of tree

Use torchax in pallas.py

fix: detect neuron platform using torch_neuronx

…tch-1 Get num_experts from layer

Refactor function to cut down on the repeated code

Signed-off-by: qingjun <qingjun@minimaxi.com>

v0.8.2 vLLM + SwissLM

zhuohan123 added 9 commits February 28, 2023 01:30

copy code from fairseq

e8d661c

remove files from fairscale

827f85f

copy files from megatron

76ed019

[WIP] add distributed init

55e5d86

Parallelize the Transformer layers

7100db2

Load weight on a single GPU

1e86393

support multi-gpu tensor parallelism

90970e1

support tensor parallelism on multiple gpus

88960f7

fix correctness

900eace

zhuohan123 changed the title ~~[WIP] Support tensor parallel~~ Support tensor parallel Mar 9, 2023

zhuohan123 added 6 commits March 17, 2023 14:05

Merge branch 'main' into tensor_parallel

6a6f7cc

fix merging errors

d5a70ab

add filelock

60bf11e

support parallel decoding

a7be5b8

update readme

538d067

remove unused files

893d4b3

zhuohan123 requested a review from WoosukKwon March 19, 2023 02:51

fix loading for large models

e0f9f48

WoosukKwon reviewed Mar 21, 2023

View reviewed changes

zhuohan123 added 5 commits March 21, 2023 03:24

Fix some smaller issues raised by Woosuk first.

6ef5111

Fix more review issues

6727083

remove duplicate set_seed

ddc1ab0

Support the case where embedding_size != hidden_size

1d532c5

Resolve comments on weight loading and device id comments.

64e3950

WoosukKwon approved these changes Mar 21, 2023

View reviewed changes

cacheflow/models/model_utils.py Show resolved Hide resolved

WoosukKwon merged commit 2f49f15 into main Mar 21, 2023

zhuohan123 deleted the tensor_parallel branch June 18, 2023 07:22

TheBloke mentioned this pull request Jul 20, 2023

Can't launch OpenAI API server on newly installed vLLM in Docker - fastchat not found #537

Closed

Danielkinz mentioned this pull request Aug 15, 2023

[Feature | CI] Added a github action to build wheels #746

Merged

xxzhang0927 mentioned this pull request Oct 30, 2024

[Bug]: Engine iteration timed out. This should never happen! #9839

Open

1 task

prashantgupta24 mentioned this pull request Nov 15, 2024

[Core] Reduce TTFT with concurrent partial prefills #10235

Merged

russellb mentioned this pull request Nov 25, 2024

[misc] do not read HOST_IP #10644

Merged

Xaenalt pushed a commit to Xaenalt/vllm that referenced this pull request Dec 9, 2024

Merge pull request vllm-project#2 from opendatahub-io/main

4f9d7e5

Sync release with main

G1017 mentioned this pull request Dec 24, 2024

[Usage]: Trying to add codeshell 7b model, but got an error #11451

Closed

1 task

MengqingCao pushed a commit to MengqingCao/vllm that referenced this pull request Jan 20, 2025

Merge pull request vllm-project#2 from maoxx241/qwen2vl_quick_fix

b3899b3

[QuickFix] Vision Attention using SPDA when platform is out of tree

lsy323 pushed a commit to lsy323/vllm that referenced this pull request Feb 5, 2025

Merge pull request vllm-project#2 from lsy323/lsiyuan/torchax-on-v1

d121759

Use torchax in pallas.py

liangfu pushed a commit to liangfu/vllm that referenced this pull request Feb 13, 2025

Merge pull request vllm-project#2 from sunnyszy/zhenyu_dev

31e7393

fix: detect neuron platform using torch_neuronx

HelenaSak mentioned this pull request Feb 19, 2025

[Bug]: watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered #8177

Open

1 task

mengniwang95 pushed a commit to mengniwang95/vllm that referenced this pull request Feb 23, 2025

Merge pull request vllm-project#2 from yiliu30/dev/yi/ds_inc_quant-pa…

98c5340

…tch-1 Get num_experts from layer

AntonBalagaev mentioned this pull request Feb 25, 2025

[Doc]: guided grammar example lack parameter guided_decoding_backend #13847

Closed

russellb mentioned this pull request Mar 3, 2025

[V1][Core] Support for Structured Outputs #12388

Merged

ctdavi mentioned this pull request Mar 25, 2025

[Doc]: Steps to run vLLM on your RTX5080 or 5090! #14452

Open

1 task

alokkrsahu mentioned this pull request Apr 9, 2025

[Bug]: meta-llama/Llama-4-Scout-17B-16E-Instruct compatibility #16330

Closed

1 task

jifa513 mentioned this pull request Apr 10, 2025

[Bug]: Error: Failed to initialize the TMA descriptor 700 #13961

Open

1 task

qiuhaining mentioned this pull request Apr 10, 2025

[Bug]: corrupted double-linked list (not small) Aborted #16412

Open

1 task

southfreebird mentioned this pull request Apr 11, 2025

[Bug]: Medusa speculation hangs when tp > 1 #16477

Open

1 task

QuanhuiGuan mentioned this pull request Apr 14, 2025

[Bug]: CUDA error: an illegal memory access was encountered #16398

Open

1 task

maxdebayser referenced this pull request in maxdebayser/vllm Apr 14, 2025

Merge pull request #2 from maxdebayser/truncation-control

50da440

Refactor function to cut down on the repeated code

qscqesze added a commit to qscqesze/vllm that referenced this pull request Apr 16, 2025

fix import bug vllm-project#2

f98c8e2

Signed-off-by: qingjun <qingjun@minimaxi.com>

yyu6969 pushed a commit to yyu6969/vllm that referenced this pull request Apr 24, 2025

vllm-project#2

679b759

vanbasten23 mentioned this pull request Apr 28, 2025

[Misc][Tools][Benchmark] Publish script to auto tune server parameters #17207

Merged

AllenHaoHuang added a commit to AllenHaoHuang/vllm that referenced this pull request May 7, 2025

Merge pull request vllm-project#2 from EduardDurech/v0.8.2

88fc1a5

v0.8.2 vLLM + SwissLM

hao-cold mentioned this pull request May 13, 2025

[Bug]: CUDA error: an illegal instruction was encountered #18045

Open

1 task

markmc mentioned this pull request May 21, 2025

[Bug][Failing Test]: Distributed Comm Ops - distributed/test_shm_broadcast.py #18492

Closed

1 task

yarongmu-google mentioned this pull request Jun 11, 2025

[RFC]: A Strategic Framework for Extensibility and Innovation in vLLM #19376

Open

1 task

cyc00518 mentioned this pull request Jun 12, 2025

[Bug] Mistral Tool-Call via Jinja Template: Missing parallel_tool_prompt Injection and Incorrect tool_response Handling #19545

Closed

1 task

zerosurplus mentioned this pull request Jun 16, 2025

[Bug]: torch.distributed.DistNetworkError: The client socket has timed out after 600000ms while trying to connect to (172.17.0.9, 46229). #19670

Open

1 task

xiaocode337317439 mentioned this pull request Jun 27, 2025

[Bug]:RuntimeError: CUDA error: an illegal memory access was encountered #20170

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Support tensor parallel #2

Support tensor parallel #2

Uh oh!

zhuohan123 commented Feb 28, 2023 •

edited

Loading

Uh oh!

WoosukKwon left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhuohan123 commented Mar 21, 2023

Uh oh!

WoosukKwon left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Support tensor parallel #2

Support tensor parallel #2

Uh oh!

Conversation

zhuohan123 commented Feb 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WoosukKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhuohan123 commented Mar 21, 2023

Uh oh!

WoosukKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

zhuohan123 commented Feb 28, 2023 •

edited

Loading