You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
(Optimization of LLM inference) Does Intel OpenVINO support offloading LLM models, allowing some layers to remain on the SSD while loading the main layers into RAM during inference computation?
#2533
Open
hsulin0806 opened this issue
Nov 19, 2024
· 2 comments
100% GPU: The model is fully loaded into the GPU.
100% CPU: The model is fully loaded into system memory.
48%/52% CPU/GPU: The model is split between the GPU and system memory.
Ollama is powered by llama.cpp, which supports the --gpu-layers parameter to distribute model layers between VRAM and RAM, reducing GPU memory pressure.
However, when the CPU handles inference, the model is entirely loaded into RAM. Would it be possible for OpenVINO to introduce a parameter or functionality to support offloading model layers to SSD storage as temporary storage? This would reduce RAM usage, offering a more efficient way to handle resource-limited scenarios.
The text was updated successfully, but these errors were encountered:
Does the HETERO mode allow RAM to be cached on an SSD to reduce RAM usage? If this functionality is not available, do you have any development plans to enable caching RAM on an SSD?
Functional discussion for this project.
notebooks/llm-chatbot
Intel's official documentation: https://www.intel.com.tw/content/www/tw/zh/content-details/826081/running-ollama-with-open-webui-on-intel-hardware-platform.html
confirms support for Ollama.
In Ollama's GitHub documentation: https://github.com/ollama/ollama/blob/main/docs/faq.md, it describes:
100% GPU: The model is fully loaded into the GPU.
100% CPU: The model is fully loaded into system memory.
48%/52% CPU/GPU: The model is split between the GPU and system memory.
Ollama is powered by llama.cpp, which supports the --gpu-layers parameter to distribute model layers between VRAM and RAM, reducing GPU memory pressure.
However, when the CPU handles inference, the model is entirely loaded into RAM. Would it be possible for OpenVINO to introduce a parameter or functionality to support offloading model layers to SSD storage as temporary storage? This would reduce RAM usage, offering a more efficient way to handle resource-limited scenarios.
The text was updated successfully, but these errors were encountered: