-
Notifications
You must be signed in to change notification settings - Fork 641
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support 4bit on CPU backend #1206
Support 4bit on CPU backend #1206
Conversation
out_dq = torch.empty(out_uint8.shape).to(quant_state.dtype) | ||
for i in range(len(quant_state.code)): | ||
out_dq[out_uint8 == i] = quant_state.code[i] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using index select will be faster out_dq = quant_state.code[out_uint8.to(torch.int32)]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like torch.compile
result of this code gives wrong results. And removing torch.compile
results in lower performance. Let's keep this implementation for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bug in torch.compile? Can you submit a bug to PyTorch? I will try to fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, I cannot reproduce the issue with the script below. May need more investigation.
import torch
NF4_DEQUANT_TABLE = torch.Tensor([
-1.0,
-0.6961928009986877,
-0.5250730514526367,
-0.39491748809814453,
-0.28444138169288635,
-0.18477343022823334,
-0.09105003625154495,
0.0,
0.07958029955625534,
0.16093020141124725,
0.24611230194568634,
0.33791524171829224,
0.44070982933044434,
0.5626170039176941,
0.7229568362236023,
1.0,
])
@torch.compile
def dequant_nf4_compile(t_in: torch.Tensor, out_dtype):
return NF4_DEQUANT_TABLE[t_in.to(torch.int)].to(out_dtype)
def dequant_nf4_eager(t_in: torch.Tensor, out_dtype):
return NF4_DEQUANT_TABLE[t_in.to(torch.int)].to(out_dtype)
x = torch.randint(0, 16, (1024, 1024), dtype=torch.uint8)
y1 = dequant_nf4_compile(x, torch.bfloat16)
y1 = dequant_nf4_compile(x, torch.bfloat16)
y2 = dequant_nf4_eager(x, torch.bfloat16)
print(torch.equal(y1, y2))
print("max diff =", torch.abs(y1 - y2).max())
Hi @Titus-von-Koeller . Here is the test results on Intel 4th Gen Xeon CPU of this PR: The big difference between NF4 and FP4 is that we can use fused ops in NF4, but they are not prepared in FP4. FP4 will also support fused ops and is supposed to get the same performance as NF4, maybe in the next Ipex release. Would you please review it? Thx! test script import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import time
MAX_NEW_TOKENS = 64
model_id = "meta-llama/Llama-2-7b-chat-hf"
text = 'I am happy because'
tokenizer = AutoTokenizer.from_pretrained(model_id)
input_ids = tokenizer(text, return_tensors="pt").input_ids
print('Loading model {}...'.format(model_id))
quantization_config = BitsAndBytesConfig(load_in_4bit=True,
bnb_4bit_quant_type="fp4",
bnb_4bit_use_double_quant=False,
bnb_4bit_compute_dtype=torch.bfloat16)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config, torch_dtype=torch.bfloat16)
# model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16)
print('model dtype = {}'.format(model.dtype))
with torch.no_grad():
# warmup
model.generate(input_ids, max_length=MAX_NEW_TOKENS)
model.generate(input_ids, max_length=MAX_NEW_TOKENS)
print("warm-up complite")
t0 = time.time()
generated_ids = model.generate(input_ids, max_length=MAX_NEW_TOKENS, do_sample=False, num_beams=1)
latency = time.time() - t0
print(input_ids.shape)
print(generated_ids.shape)
result = "| latency: " + str(round(latency * 1000, 3)) + " ms |"
print('+' + '-' * (len(result) - 2) + '+')
print(result)
print('+' + '-' * (len(result) - 2) + '+')
output = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
print(f"output: {output}") |
Dear @Xia-Weiwen et al, Unfortunately we're (mostly me alone) quite resource constrained and humbled by the workload associated with the We both took a look at this PR and the one from AMD and think that at first glance everything looks really good. At this time, both me and Younes are not in a position to give detailed feedback and I need to focus on concretizing the path forward on how to integrate with the PyTorch dispatcher (tensor driven dispatch, as requested) through the torch.library Python-level APIs. After extensive research and yesterday's consultation with 3 PyTorch devs at Meta that are experts on the topic I need to focus on making this new input concrete. However, for the purpose of iterative progress (as agreed in our prior conversations), we've decided to already go ahead and merge both the open Intel and AMD branches into Once we've made some progress on the Among other things, there's also been extensive ongoing work in the background on things like moving BNB to a new independent/non-profit Github org, but under the umbrella of Hugging Face and the support of their infra team for managing the complexities of the CI/CD backend and runners. Also, we're working to make Github runners for the different hardware platforms a reality (thanks for your help on that!). Thanks again for the good work and active collaboration! ❤️ 🚀 |
701c5aa
into
bitsandbytes-foundation:multi-backend-refactor
P.S. Also see this: README: asking for help from volunteer alpha testers Let us know if you have further thoughts on this and how you think it's best to communicate about this. |
Hi @Titus-von-Koeller Thanks a lot for your help on this. We are glad to provide feedbacks on the adoption of |
Hi @Titus-von-Koeller May I learn more details about how you are going to refactor things via Meanwhile, it would be beneficial to allow the flexibility of backend integration without adding native code explicitly to bitsandbytes too, like optimizing via |
Adds implementation for the following ops on CPU backend:
Limitations:
quant_storage
must be torch.uint8compress_statistics
is not supported yet (bnb_4bit_use_double_quant
must be false)fp4
is slow currently because there is no fused kernel yet.Difference from CUDA implementation:
gemv_4bit
. But on CPU backend, it's actually GEMM.Here is the code snippet of an example to run HuggingFace models with 4bit on CPU backend: https://gist.github.com/Xia-Weiwen/592d6e24e03f904a18692b3e27794c53. You will have to bypass CUDA checks in transformers to run.
cc @jiqing-feng @jgong5 @jianan-gu