Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

intelanalytics/ipex-llm-inference-cpp-xpu:2.2.0 docker image causes memory issue with intel arc a380 #11993

Open
bobsdacool opened this issue Sep 2, 2024 · 6 comments
Assignees

Comments

@bobsdacool
Copy link

Hey. Not a computer scientist here, but thought you guys'd like to know that the latest pushed container image is causing issues with gpu inference for me.

System specs
CPU: AMD Ryzen 3600
GPU: Intel arc a380
RAM: DDR4 ECC RAM unregistered 3200mhz single channel 16gb
OS: Debian 12
Kernel: 6.7.12+bpo-amd64
Docker version 27.2.0, build 3ab4256

logs attached.
Logs_Latest.txt
Logs_2.1.0.txt

@hzjane
Copy link
Contributor

hzjane commented Sep 3, 2024

Native API failed. Native API returns: -6 (PI_ERROR_OUT_OF_HOST_MEMORY) -6 (PI_ERROR_OUT_OF_HOST_MEMORY) .This looks like an OOM error. You can try a smaller model such as dolphin-phi:latest.

@bobsdacool
Copy link
Author

bobsdacool commented Sep 3, 2024

Hi, yes, a smaller model does work for me on the latest container ~0.3Gb. I think there is an issue though as using version 2.1.0 allows me to use models that match the systems vram ~6gb. Even when I have all other docker containers shut down, when using the new container with ~14Gb of free system memory this error persists. It's possible this is an error in detection of sycl devices as the latest container does not pick up on the CPU either. Although I get high cpu core usage when doing inference on version 2.1.0 using htop, I can also see that hardware acceleration is being utilized by monitoring the GPU usage using intel_gpu_top. I'm not sure how much this means to you. It was working in the previous container, but can't get it to work in 2.2.0+ sticking with 2.1.0 for the time being.

@hzjane
Copy link
Contributor

hzjane commented Sep 4, 2024

I don't really know what problem you meet again? Do you mean that this problem exists in the latest 2.2.0 version, and the 2.1.0 is normal? But the docker image is basically not updated between 2.1.0 and 2.2.0. I have tested 2.2.0-snapshot on Arc A770 and no meet any OOM problem. Maybe it's caused by the VRAM different from A380 6GB and A770 16GB?

@bobsdacool
Copy link
Author

Hi, yes, whilst I can run llms at like 5Gb in size in 2.1.0 I cant run them in 2.2.0 with the exact same docker setup. I can run much smaller llms in 2.2.0 so the ollama functionality is not totally bust, there does seem to be a memory issue.

I'm not sure where the issue lies though. Please let me know if there is any other system information that you'd like me to collect to help get to the bottom of this.

@hzjane
Copy link
Contributor

hzjane commented Sep 5, 2024

Thanks for your question. There was indeed a llama.cpp/Ollama upgrade between image 2.2.0 and 2.1.0, which may be the root cause. We will confirm the issue again. And You can run it with 2.1.0 first.

@JinheTang
Copy link
Contributor

JinheTang commented Sep 5, 2024

Hi @bobsdacool , in your log it says your n_ctx = 8192. This is because the latest ollama upstream has a default setting of OLLAMA_NUM_PARALLEL=4, which sets the total space allocated for context n_ctx to 4*2048, 2048 being the model's default context space. Try running export OLLAMA_NUM_PARALLEL=1 before you start ollama serve. If the problem persists, you may manually create a Modelfile and set the model's num_ctx smaller, eg.

FROM llama2
PARAMETER num_ctx 512

then load model with:

ollama create llama2:latest-nctx512 -f Modelfile
ollama run llama2:latest-nctx512

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants