Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama2 70b chat accelerate example #2494

Merged
merged 16 commits into from
Aug 28, 2023
Merged

llama2 70b chat accelerate example #2494

merged 16 commits into from
Aug 28, 2023

Conversation

lxning
Copy link
Collaborator

@lxning lxning commented Jul 24, 2023

Description

Please read our CONTRIBUTING.md prior to creating your first pull request.

Please include a summary of the feature or issue being fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

Fixes #(issue)

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

Feature/Issue validation/testing

Please describe the Unit or Integration tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.

  • Test A
torchserve --ncs --start --model-store model_store/ --models llama2-70b-chat --ts-config config.properties

curl http://localhost:8080/predictions/llama2-70b-chat -T sample.txt
how areyou
I'm fine,


- [ ] Test B
Logs for Test B


## Checklist:

- [ ] Did you have fun?
- [ ] Have you added tests that prove your fix is effective or that this feature works?
- [ ] Has code been commented, particularly in hard-to-understand areas?
- [ ] Have you made corresponding changes to the documentation?

@codecov
Copy link

codecov bot commented Jul 24, 2023

Codecov Report

Merging #2494 (d67037d) into master (683608b) will not change coverage.
The diff coverage is n/a.

❗ Current head d67037d differs from pull request most recent head 116b1a4. Consider uploading reports for the commit 116b1a4 to get more accurate results

@@           Coverage Diff           @@
##           master    #2494   +/-   ##
=======================================
  Coverage   72.64%   72.64%           
=======================================
  Files          79       79           
  Lines        3733     3733           
  Branches       58       58           
=======================================
  Hits         2712     2712           
  Misses       1017     1017           
  Partials        4        4           

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@lxning lxning self-assigned this Jul 24, 2023
@lxning lxning added documentation Improvements or additions to documentation example labels Jul 24, 2023
@lxning lxning added this to the v0.9.0 milestone Jul 24, 2023
@lxning lxning changed the title [wip]llam2 accelerate example llama2 70b chat accelerate example Jul 24, 2023
device_map="balanced",
low_cpu_mem_usage=True,
torch_dtype=torch.float16,
load_in_4bit=True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should run some tests for 4bit vs 8bit quantization. For real world deployment 8bit might give better results

load_in_4bit=True,
trust_remote_code=True)
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.tokenizer.pad_token = self.tokenizer.eos_token
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see the HF docs for llama2 model for padding token handling:

https://huggingface.co/docs/transformers/main/model_doc/llama2

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lxning you can also follow this example for the pad token settings, you would need to resize the token embeddings as well.


### Step 1: Download model Permission

Follow [this instruction](https://huggingface.co/meta-llama/Llama-2-70b-hf) to get permission
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest switching to chat model instead, so that we can showcase the processing for prompts for the chat model

Copy link
Contributor

@chauhang chauhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lxning Thanks for getting this started. Please see the llama-recipes for prompt handling part and update accordingly

logger.info("Model %s loaded successfully", ctx.model_name)
self.initialized = True

def preprocess(self, requests):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check the processing logic in llama-recipes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HamidShojanazeri Please verify the preprocessing logic is aligned with the llama model processing

@@ -0,0 +1 @@
How are you
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For chat example use sample similar to chats.json

load_in_4bit=True,
trust_remote_code=True)
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.tokenizer.pad_token = self.tokenizer.eos_token
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lxning you can also follow this example for the pad token settings, you would need to resize the token embeddings as well.

inferences = self.tokenizer.batch_decode(
outputs, skip_special_tokens=True, clean_up_tokenization_spaces=False
)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lxning shall we add response streaming as well, its sounds like a good place to show it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can update this after response streaming is merged

Copy link
Collaborator

@agunapal agunapal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

verified to be working

torch_dtype=torch.float16,
load_in_8bit=True,
trust_remote_code=True)
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@agunapal agunapal dismissed chauhang’s stale review August 28, 2023 19:01

Proceeding to unblock branch cut. Suggested changes will be addressed in a subsequent PR

Copy link
Contributor

@chauhang chauhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving, please address the outstanding issues in a follow-up PR

@agunapal
Copy link
Collaborator

Regression test failing because of not being able to download from FAIR gan zoo

@agunapal agunapal merged commit 04e0b37 into master Aug 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation example
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants