Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support beam search & parallel generation #7

Merged
merged 43 commits into from
Mar 10, 2023
Merged

Conversation

WoosukKwon
Copy link
Collaborator

@WoosukKwon WoosukKwon commented Mar 9, 2023

This PR adds support for beam search and parallel generation (i.e., n > 1).

NOTE: The correctness is only checked for beam search, but not for random sampling methods.

Tested models:

  • OPT-125M
  • OPT-350M
  • OPT-1.3B
  • OPT-2.7B
  • OPT-6.7B
  • OPT-13B

Tested GPUs:

  • A100

@WoosukKwon WoosukKwon requested a review from zhuohan123 March 10, 2023 00:03
Copy link
Member

@zhuohan123 zhuohan123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in general. Left some small comments.

cacheflow/models/sample.py Show resolved Hide resolved
probs: torch.Tensor,
p: torch.Tensor,
) -> torch.Tensor:
# TODO(woosuk): Optimize.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's faster to simply mask out the tokens whose cumulative gradient is smaller than top_p (example code)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for letting me know the code. I feel that implementation would not be remarkably more efficient than ours, because it includes 2 softmax rather than 1.

@WoosukKwon WoosukKwon merged commit 1a7eb7d into main Mar 10, 2023
@WoosukKwon WoosukKwon deleted the parallel-generation branch March 10, 2023 17:58
v1nc3nt27 pushed a commit to v1nc3nt27/vllm that referenced this pull request Sep 12, 2023
xiangyuT added a commit to xiangyuT/vllm that referenced this pull request Oct 24, 2023
add torch.cuda.empty_cache()
hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024
slyalin pushed a commit to slyalin/vllm that referenced this pull request Mar 22, 2024
mzusman added a commit to mzusman/vllm that referenced this pull request Apr 16, 2024
* Return support for other models apart from jamba

* Support n>1

* Revert 2 commits

d054737 'Support n>1'
b5167cc 'Return support for other models apart from jamba'

* TP on input and output

* Basic TP impl , working, correctness not working

* TP is working

* Roll back the verification that everything in the weights fits into the
model

* Cleanup

* Use world size func

* clean up

* Import

* Apply whitespace suggestions from code review

* Organize imports

* Add comment on the unsqueeze in conv1d

* Organize and remove redundant code in forward pass

* Remove print

* Add comments

Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com>

* White spaces

* Set as A

* better comment

---------

Co-authored-by: Mor Zusman <morz@ai21.com>
Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com>
sfc-gh-hazhang pushed a commit to sfc-gh-hazhang/vllm that referenced this pull request May 7, 2024
@alixiaodi alixiaodi mentioned this pull request Aug 2, 2024
zeroorhero pushed a commit to zeroorhero/vllm that referenced this pull request Sep 23, 2024
Felixllq pushed a commit to Felixllq/vllm that referenced this pull request Nov 12, 2024
* llama support

* flash_attention

* sharded

* expend

* fix: remove redunctant info

* change main

* llama and opt model supported

---------

Co-authored-by: Shao Siyang FYP PDCL <shaosy@scsehg.cm.cluster>
Co-authored-by: lairuiqi <lrq619@outlook.com>
Co-authored-by: LaiRuiqi <58351056+lrq619@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants