-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG][IPEX/XPU] init_ipex_linear
taking very long time, >10 minutes, with a small 1B model on XPU
#977
Comments
@notsyncing First run is slower due to model loading from disk but 10 minutes vs 4 second is not normal. Let's directly load using gptqmodel internal code without hf integration: Use below code and re-test. Do not move to model to model_4bit = GPTQModel.load(
"Qwen/Qwen2.5-Coder-1.5B-Instruct-GPTQ-Int4",
device="xpu"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-1.5B-Instruct-GPTQ-Int4")
generator = pipeline("text-generation", model=model_4bit, tokenizer=tokenizer)
print(f"{datetime.now()}: Generating...")
print(generator("def helloWorld() {"))
print(f"{datetime.now()}: Generate again...")
print(generator("Hello!"))
print(f"{datetime.now()}: End!") |
Also test not using hf pipeline. Not sure if pipeline is doing extra torch.compile steps. model = GPTQModel.load(
"Qwen/Qwen2.5-Coder-1.5B-Instruct-GPTQ-Int4",
device="xpu"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-1.5B-Instruct-GPTQ-Int4")
result = model.generate(
**tokenizer(
"def helloWorld() {", return_tensors="pt"
).to("xpu")
)[0] I don't trust any api that wraps too many layers deep. I have not looked at |
I tested again with your code, it first complains:
Then I manually created that file with the
Still the same generation time, with same CPU and GPU usage. Full code: from datetime import datetime
import torch
import intel_extension_for_pytorch as ipex
from transformers import AutoTokenizer
from gptqmodel import GPTQModel
import gptqmodel.integration
print(torch.__version__)
print(ipex.__version__)
[print(f'[{i}]: {torch.xpu.get_device_properties(i)}') for i in range(torch.xpu.device_count())]
gptqmodel.integration.patch_hf()
model = GPTQModel.load(
"Qwen/Qwen2.5-Coder-1.5B-Instruct-GPTQ-Int4",
device="xpu"
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-1.5B-Instruct-GPTQ-Int4")
print(f"{datetime.now()}: Generating...")
result = model.generate(
**tokenizer(
"def helloWorld() {", return_tensors="pt"
).to("xpu")
)[0]
print(result)
print(f"{datetime.now()}: Generating again...")
result = model.generate(
**tokenizer(
"Hello!", return_tensors="pt"
).to("xpu")
)[0]
print(result)
print(f"{datetime.now()}: End!") btw, I forgot to mention my disk: an NVMe 4T SSD on a PCIe 3.0 x4 slot. So the disk loading cannot be the bottleneck. If I interrupt the first generation with Ctrl+C at about the 5th minute, it breaks at:
|
@notsyncing We will check this on our B580 test device on Monday and report back if norm with ipex/xpu or gptqmodel specific. |
@jiqing-feng Can you check this? We isolated the issue to |
init_ipex_linear
taking very long time, >10 minutes, with a small 1B model on XPU
Only 1B model take so much time or all models like 3b and 7b? |
@jiqing-feng Tested the same script with |
@notsyncing With help from @jiqing-feng We have tracked down the issue to following:
In light of this, I will open up an issue with the IPEX packaging team so they can compile kernels for the B580 arch in their next release. |
Tracking IPEX issue: intel/intel-extension-for-pytorch#767 |
Describe the bug
Hello, I'm trying out gptqmodel on an Intel A770 16G with
Qwen/Qwen2.5-Coder-1.5B-Instruct-GPTQ-Int4
model using the following script:And it takes almost 10 minutes after
Generating...
to get first generation output, with 100% CPU usage (one core) and about 20% GPU usage. The second generation takes about 4 seconds. Is this expected or something is wrong?CPU is Intel Core i9-10940X.
Full output:
GPU Info
Show output of:
Software Info
Fedora 41, running distrobox from intel/oneapi-basekit:2025.0.1-0-devel-ubuntu24.04, Python 3.12.3
Show output of:
If you are reporting an inference bug of a post-quantized model, please post the content of
config.json
andquantize_config.json
.To Reproduce
Run the script above.
Expected behavior
First generation takes less than 1 minute.
Model/Datasets
Qwen/Qwen2.5-Coder-1.5B-Instruct-GPTQ-Int4
The text was updated successfully, but these errors were encountered: