Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add LLAMA #1446

Closed
wants to merge 28 commits into from
Closed

Add LLAMA #1446

wants to merge 28 commits into from

Conversation

msaroufim
Copy link
Member

@msaroufim msaroufim commented Mar 3, 2023

Fixes #1443

Notable things I had to do - @ezyang hopefully these are OK

  • Since LLAMA requires special permission to download weights and checkpoints and the tokenizer, I went ahead with random checkpoints and random tokenizer - not sure CI qualifies as a valid research endeavour
  • I removed the dependency on fairscale so had to make a few adjustments like turning ParallelLinear into Linear or ParallelEmbedding into Embedding and things mostly seem to work fine. And added bonus is you can run the example on a single machine
  • Inference in the code using torch inference mode, I removed it since it has a weird interaction with torch.compile
  • The open source LLAMA repo is inference only so there is no training support in this script

Some other things I can improve in an another PR

  • Better configuration including sequence length and batching
  • Reenabling distributed support with FAIRSCALE

I can run the code now

(bench) ubuntu@ip-172-31-39-186:~/benchmark$ python run.py llama -d cuda
Running eval method from llama on cuda in eager mode with input batch size 32.
GPU Time:             10.006 milliseconds
CPU Total Wall Time:  10.045 milliseconds

@msaroufim msaroufim marked this pull request as draft March 3, 2023 01:47
@xuzhao9
Copy link
Contributor

xuzhao9 commented Mar 6, 2023

I get the following error when running the model locally with run.py:

$ python run.py llama -d cpu
Warning: Could not find dependent module llama for Model llama, skip it.
 No module named 'llama'
Unable to find model matching llama.

Can you help try and see if you can reproduce?

@msaroufim msaroufim changed the title Add LLAMA Add Preliminary support for LLAMA Mar 6, 2023
@msaroufim msaroufim marked this pull request as ready for review March 6, 2023 22:21
@msaroufim msaroufim requested review from wconstab and xuzhao9 March 6, 2023 22:32
@ezyang
Copy link
Contributor

ezyang commented Mar 7, 2023

The limitations look fine; I prefer not downloading the checkpoint.

@msaroufim
Copy link
Member Author

Ok this should work

(bench) ubuntu@ip-172-31-38-220:~/benchmark$ python run.py llama -d cpu
Running eval method from llama on cpu in eager mode with input batch size 32.
CPU Total Wall Time:  10.136 milliseconds
CPU Peak Memory:                2.4590 GB
(bench) ubuntu@ip-172-31-38-220:~/benchmark$ python run.py llama -d cuda
Running eval method from llama on cuda in eager mode with input batch size 32.
GPU Time:              8.040 milliseconds
CPU Total Wall Time:   8.067 milliseconds
GPU 0 Peak Memory:              1.9117 GB
CPU Peak Memory:                2.0645 GB

@msaroufim msaroufim changed the title Add Preliminary support for LLAMA Add LLAMA Mar 9, 2023
torchbenchmark/models/llama/__init__.py Outdated Show resolved Hide resolved
torchbenchmark/models/llama/__init__.py Outdated Show resolved Hide resolved
torchbenchmark/models/llama/__init__.py Outdated Show resolved Hide resolved
@facebook-github-bot
Copy link
Contributor

@xuzhao9 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@msaroufim msaroufim requested a review from xuzhao9 March 10, 2023 02:45
@msaroufim msaroufim requested a review from xuzhao9 March 10, 2023 17:46
Copy link
Contributor

@xuzhao9 xuzhao9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@facebook-github-bot
Copy link
Contributor

@xuzhao9 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Co-authored-by: Xu Zhao <xzhao9@fb.com>


if device == "cuda":
torch.set_default_device("cuda")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering why torch.set_default_device('cuda') is needed?

Copy link
Member Author

@msaroufim msaroufim Mar 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's convenient in general to make sure everything runs on the same device that way you dont need to add to(device) calls for inputs and models - but it works for any device so just made that clearer now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

torchbenchmark/models/llama/test.py Outdated Show resolved Hide resolved
torchbenchmark/models/llama/tokenizer.py Outdated Show resolved Hide resolved
torchbenchmark/models/llama/tokenizer.py Outdated Show resolved Hide resolved
@facebook-github-bot
Copy link
Contributor

@xuzhao9 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

1 similar comment
@facebook-github-bot
Copy link
Contributor

@xuzhao9 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

self.model_args = ModelArgs(vocab_size=32)
self.model = Transformer(self.model_args)

torch.set_default_device(device)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious why we don't need to explicitly move self.model and self.example_inputs to the device here? For example:

self.model = Transformer(self.model_args).to(self.device)
self.example_inputs = (torch.tensor([[1, 1], [1,1]], dtype=torch.int).to(self.device), 1)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@facebook-github-bot
Copy link
Contributor

@xuzhao9 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

1 similar comment
@facebook-github-bot
Copy link
Contributor

@xuzhao9 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@xuzhao9 merged this pull request in c78f1f3.

@msaroufim
Copy link
Member Author

msaroufim commented Mar 13, 2023

This weeekend the internet was buzzing with news of this cpp llama implementation ggerganov/llama.cpp#33 (comment) - It seems like it might not be that tricky to also validate correctness for this model by loading in real checkpoints and tokenizer despite minor differences in the model

Might need to do a round 2 pass on this PR soon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add LLAMA model to benchmarks
4 participants