-
Notifications
You must be signed in to change notification settings - Fork 867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama2 70b chat accelerate example #2494
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2494 +/- ##
=======================================
Coverage 72.64% 72.64%
=======================================
Files 79 79
Lines 3733 3733
Branches 58 58
=======================================
Hits 2712 2712
Misses 1017 1017
Partials 4 4 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
device_map="balanced", | ||
low_cpu_mem_usage=True, | ||
torch_dtype=torch.float16, | ||
load_in_4bit=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should run some tests for 4bit vs 8bit quantization. For real world deployment 8bit might give better results
load_in_4bit=True, | ||
trust_remote_code=True) | ||
self.tokenizer = AutoTokenizer.from_pretrained(model_name) | ||
self.tokenizer.pad_token = self.tokenizer.eos_token |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see the HF docs for llama2 model for padding token handling:
https://huggingface.co/docs/transformers/main/model_doc/llama2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
### Step 1: Download model Permission | ||
|
||
Follow [this instruction](https://huggingface.co/meta-llama/Llama-2-70b-hf) to get permission |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest switching to chat model instead, so that we can showcase the processing for prompts for the chat model
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lxning Thanks for getting this started. Please see the llama-recipes for prompt handling part and update accordingly
logger.info("Model %s loaded successfully", ctx.model_name) | ||
self.initialized = True | ||
|
||
def preprocess(self, requests): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check the processing logic in llama-recipes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HamidShojanazeri Please verify the preprocessing logic is aligned with the llama model processing
examples/large_models/Huggingface_accelerate/llama2/custom_handler.py
Outdated
Show resolved
Hide resolved
@@ -0,0 +1 @@ | |||
How are you |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For chat example use sample similar to chats.json
load_in_4bit=True, | ||
trust_remote_code=True) | ||
self.tokenizer = AutoTokenizer.from_pretrained(model_name) | ||
self.tokenizer.pad_token = self.tokenizer.eos_token |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
examples/large_models/Huggingface_accelerate/llama2/custom_handler.py
Outdated
Show resolved
Hide resolved
inferences = self.tokenizer.batch_decode( | ||
outputs, skip_special_tokens=True, clean_up_tokenization_spaces=False | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lxning shall we add response streaming as well, its sounds like a good place to show it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can update this after response streaming is merged
examples/large_models/Huggingface_accelerate/llama2/custom_handler.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
verified to be working
torch_dtype=torch.float16, | ||
load_in_8bit=True, | ||
trust_remote_code=True) | ||
self.tokenizer = AutoTokenizer.from_pretrained(model_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lxning can you pls add accelerated/Bt support as well, as shown here. https://github.com/facebookresearch/llama-recipes/blob/main/inference/inference.py#L68-L69
Proceeding to unblock branch cut. Suggested changes will be addressed in a subsequent PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving, please address the outstanding issues in a follow-up PR
Regression test failing because of not being able to download from FAIR gan zoo |
Description
Please read our CONTRIBUTING.md prior to creating your first pull request.
Please include a summary of the feature or issue being fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.
Fixes #(issue)
Type of change
Please delete options that are not relevant.
Feature/Issue validation/testing
Please describe the Unit or Integration tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.