-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
performance downgrade on dGPU Arc770 after loading more than one LLM model #12660
Comments
import os WHISPER_SAMPLING_RATE = 16000 def test_chatglm(llm_model, llm_tokenizer, report, is_report):
def test_sd(sd_model, report, is_report):
def test_minicpm(model, tokenizer, report, is_report):
def test_whisper(whisper_processor, whisper_model, report, is_report):
if name == 'main':
|
Please use below value to run the test cases (if want to run the case, set it to True): |
The code you provided has confusing indentation. Can you provide a formatted code? |
please modify attached test.txt to test.py |
https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/Quickstart/install_linux_gpu.md
https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/HuggingFace/LLM/chatglm3
if load more than one models, the inference latency increase:
llm infer 1.22 s
wsp infer 1.01 s
llm infer 2.07 s
cpm infer 2.97 s
sd infer 0.74 s
wsp infer 1.93 s
The text was updated successfully, but these errors were encountered: