Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support tensor parallel #2

Merged
merged 21 commits into from
Mar 21, 2023
Merged

Support tensor parallel #2

merged 21 commits into from
Mar 21, 2023

Conversation

zhuohan123
Copy link
Member

@zhuohan123 zhuohan123 commented Feb 28, 2023

TODOs:

  • Parallel embedding and softmax.
  • Merge with the main branch.
  • Modify README.
  • Remove unused codes.
  • Fix the bug that downloads the weight twice.
  • Test with larger models.

In another PR:

  • Merge QKV into one.

@zhuohan123 zhuohan123 changed the title [WIP] Support tensor parallel Support tensor parallel Mar 9, 2023
@zhuohan123 zhuohan123 requested a review from WoosukKwon March 19, 2023 02:51
Copy link
Collaborator

@WoosukKwon WoosukKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic! Left minor comments.

BTW, the sampling results were different when using TP:

  • Current master (python server.py --model facebook/opt-13b)
# GPU blocks: 1826, # CPU blocks: 3276
Seq 0: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of the'
Seq 1: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of several'
Seq 2: 'Ion Stoica is a professor of philosophy at the University of Bucharest, Romania. He is the author'
Seq 3: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of The'
Seq 4: "UC Berkeley is a very liberal school, but I don't think they would look at it."
Seq 5: 'UC Berkeley is about to get some more tree-hugging support from the University of Washington'
Seq 6: "UC Berkeley is the university of utah\nNot even close\nYeah I'd say it's"
Seq 7: 'The future of cloud computing is React\n\n6 Avril, 2016 | By Maxime Boklan\n\n'
  • 4-way TP (python server.py --model facebook/opt-13b --tensor-parallel-size 4)
# GPU blocks: 4970, # CPU blocks: 3276
Seq 0: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of the'
Seq 1: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of several'
Seq 2: 'Ion Stoica is a professor of philosophy at the University of Bucharest, Romania. He is the author'
Seq 3: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of The'
Seq 4: "UC Berkeley is a very liberal school, but I don't think they would've been too much"
Seq 5: 'UC Berkeley is about to get some more visiting team fans in the seats, as the school�'
Seq 6: 'UC Berkeley is the university of weed.\n*school of vape\nNot everyone who vapes'
Seq 7: "The future of cloud computing is blazing bright\nIf there's a consensus in the tech world today, it's"
  • 8-way TP (python server.py --model facebook/opt-13b --tensor-parallel-size 8)
# GPU blocks: 5464, # CPU blocks: 3276
Seq 0: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of the'
Seq 1: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of several'
Seq 2: 'Ion Stoica is a professor of philosophy at the University of Bucharest, Romania. He is the author'
Seq 3: 'Ion Stoica is a professor of philosophy at the University of Bucharest. He is the author of The'
Seq 4: "UC Berkeley is a very liberal school, but I don't think they would put a limit."
Seq 5: 'UC Berkeley is about to get some more visiting team fans in the stands, as the school is'
Seq 6: 'UC Berkeley is the university of weed.\n*school of anarchy\nAll respect to the academics'
Seq 7: "The future of cloud computing is blazing bright\nIf there's a consensus in the tech world today, it's"

cacheflow/utils.py Show resolved Hide resolved
cacheflow/models/memory_analyzer.py Outdated Show resolved Hide resolved
cacheflow/models/memory_analyzer.py Outdated Show resolved Hide resolved
cacheflow/models/memory_analyzer.py Outdated Show resolved Hide resolved
cacheflow/models/model_utils.py Show resolved Hide resolved
cacheflow/models/opt.py Show resolved Hide resolved
server.py Outdated Show resolved Hide resolved
cacheflow/models/opt.py Show resolved Hide resolved
cacheflow/worker/controller.py Show resolved Hide resolved
cacheflow/worker/worker.py Outdated Show resolved Hide resolved
@zhuohan123
Copy link
Member Author

@WoosukKwon Thanks again for the review! All comments resolved. Regarding the different sampling results, I think it's too hard to get the same sampling results for different tensor parallel configs. Adding more GPUs changes the model and the execution flow on each GPU, and thus it can change the random process here and there. I cannot and don't think it's necessary to keep their sampling results to be the same.

Copy link
Collaborator

@WoosukKwon WoosukKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @zhuohan123 for your huge effort! This is fantastic!

cacheflow/models/model_utils.py Show resolved Hide resolved
Bellk17 added a commit to Bellk17/vllm that referenced this pull request May 10, 2024
Starmys pushed a commit to Starmys/vllm that referenced this pull request May 20, 2024
@alixiaodi alixiaodi mentioned this pull request Aug 2, 2024
PanJason pushed a commit to PanJason/vllm that referenced this pull request Sep 21, 2024
- Add initial support for context caching:
    1. Support the endpoint
    2. Introduce another sequence type is_fixed. Now that is_fixed is also considered as is_finished
    3. Note that now the context caching always resides in HBM because the blocks are marked as allocated, and by default, the allocated blocks will not be swapped to any secondary storage.
PanJason pushed a commit to PanJason/vllm that referenced this pull request Sep 21, 2024
- Add initial support for context caching:
    1. Support the endpoint
    2. Introduce another sequence type is_fixed. Now that is_fixed is also considered as is_finished
    3. Note that now the context caching always resides in HBM because the blocks are marked as allocated, and by default, the allocated blocks will not be swapped to any secondary storage.
Xaenalt pushed a commit to Xaenalt/vllm that referenced this pull request Dec 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants