-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama2 70B cause OOM #83
Comments
@congdamaS we're trying to test 70B models soon. So, we will get back to you after that. Thanks! |
+1 on this. I'm evaluating an unquantized 7B model ( |
not sure which branch you're on but python main.py --model hf-causal-experimental --model_args pretrained=meta-llama/Llama-2-70b-chat-hf,dtype=float16,use_accelerate=True --no_cache --num_fewshot=25 --tasks arc_challenge works just fine with 70B parameter models on an a40 node |
Hi @dakotamahan-stability and thanks for the reply! Sorry for the lack of information. I'm on the I'm running this notebook on a Colab Pro+ VM. The eval script throws an OOM error when run with a V100 GPU (w/ 16.0 GB of VRAM):
What's strange is that the below code uses just around 14.3 GB of VRAM on the exact same machine: import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
# Setup model and tokenizer
model_name = "stabilityai/japanese-stablelm-instruct-ja_vocab-beta-7b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, low_cpu_mem_usage=True, device_map="auto")
def format_prompt(input_text):
prompt_template = """<s>[INST] <<SYS>>\nあなたは役立つアシスタントです。\n<<SYS>>\n\nユーザの質問に答えてください。\n\n{input}[/INST]"""
return prompt_template.format(input=input_text)
def generate_text(input_text):
formatted_prompt = format_prompt(input_text)
input_ids = tokenizer.encode(
formatted_prompt,
add_special_tokens=False,
return_tensors="pt"
)
# Set seed for reproducibility
seed = 23
torch.manual_seed(seed)
tokens = model.generate(
input_ids.to(device=model.device),
max_new_tokens=1024,
temperature=0.99,
top_p=0.95,
do_sample=True,
)
# Remove the input tokens from the generated tokens before decoding
output_tokens = tokens[0][len(input_ids[0]):]
return tokenizer.decode(output_tokens, skip_special_tokens=True)
prompt = "もう冬ですね。最近は寝室が寒くて寝られません。どうすればいいですか?"
generated_text = generate_text(prompt)
print(generated_text) I'm wondering if I've misconfigured the eval script, or the script is prefetching/preloading the dataset to the GPU (which would make sense, given that the prompt in the snippet is short). |
When testing with llama2 70B, the need memory is too large(>250GB).
This issue is not seen in the original lm-evaluation-harness for English.
How to set to test the llama2(70B)?
The text was updated successfully, but these errors were encountered: