-
Notifications
You must be signed in to change notification settings - Fork 27.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extremely slow model inference for load_in_4bit #24502
Comments
Hey @cnut1648 👋 We also had an internal user reporting the same issue, I'm currently exploring whether it is from the text generation end or from the 4-bit end. Our internal user also reported that unbatched text generation worked fine (in terms of output quality and inference time), so you can try that route until this issue gets sorted :) |
Hi @cnut1648 |
Hey @gante, @younesbelkada thanks! Excited to see how bnb 4bit inference will accelerate the generation. For unbatched inference (bsz=1) w/ multi-gpu, I tried that it takes more than 1 hour and only produced 4 out of 6 inputs and I have to cut it to save cost. As for one single A 100 4 bit, I have
|
Actually, I had the same confusion, I used the load_in_4bit parameter and got a 2-3x slower inference time than full precision |
@BaileyWei 2-3x slower is to be expected with |
If we take the code example from @cnut1648 and play around with the following settings
We quickly conclude that the problem seems to be related to Falcon itself, not the 4-bit part nor As for solutions: currently, the Falcon code sits on the Hub, and we have a PR open to add it to |
Thank you so much for this @gante! |
@cnut1648 |
Hmm @younesbelkada I have a test run today using llama-65b and falcon-40b. from transformers import LlamaTokenizer, AutoModelForCausalLM, AutoTokenizer
import torch
from time import time
# model_name = "tiiuae/falcon-40b-instruct"
model_name = "huggyllama/llama-65b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", load_in_4bit=True, trust_remote_code=True)
tokenizer.padding_side = "left"
# Define PAD Token = EOS Token
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = model.config.eos_token_id
# use different length sentences to test batching
# measure time
start_time = time()
sentences = [
"Hello, my dog is a little",
"Today, I",
"I love",
"Repeat after me: I love you.",
"Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
"You are a human that uses a lot of expletives. For each one of your responses to the user you use expletives liberally. If you find yourself questioning whether it’s appropriate or not you’re to say fuck that noise and keep using your expletives regardless. Human: Tell me a story. You:"
]
for sentence in sentences:
inputs = tokenizer(sentence, return_tensors="pt", padding=True).to(model.device)
# inputs.pop("token_type_ids")
output_sequences = model.generate(**inputs, max_new_tokens=400, temperature=0.7)
print(tokenizer.decode(output_sequences[0], skip_special_tokens=True))
print("Elapsed time: ", time() - start_time) Essentially for falcon-40b, the issue still remains, that the model in 4bit is just extremely slow (2561s).
|
@cnut1648 the Falcon code on the hub is known to be very slow, and it may explain the issue. We are about to release the |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Hello @gante , what's the difference between load_in_4bit=True vs torch_dtype=torch.bfloat16, are they both quantisation techniques? |
@Ali-Issa-aems This guide answers all related questions: https://huggingface.co/docs/transformers/perf_infer_gpu_one |
System Info
transformers
version: 4.31.0.dev0Who can help?
@gante
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Using
load_in_4bit
makes the model extremely slow (with accelerate 0.21.0.dev0 and bitsandbytes 0.39.1, should be latest version and I installed from source)Using the following code
This gives me 3138 seconds on 8xA100 40G GPUs.
Expected behavior
If I instead use bf16 version, i.e. by using this as model init
It gives 266 seconds, more than 10x faster. On the other hand, load in 4bit only cut down memory footprint by 4x. I wonder if there are other things I should do to fully exploit the benefits of 4bit. Right now the generation speed is not usable for real time conversation. Thanks.
The text was updated successfully, but these errors were encountered: