Run local LLMs on iGPU, APU and CPU (AMD , Intel, and Qualcomm (Coming Soon)). Easiest way to launch OpenAI API Compatible Server on Windows, Linux and MacOS
Support matrix | Supported now | Under Development | On the roadmap |
---|---|---|---|
Model architectures | Gemma Llama * Mistral + Phi |
||
Platform | Linux Windows |
||
Architecture | x86 x64 |
Arm64 | |
Hardware Acceleration | CUDA DirectML IpexLLM |
QNN ROCm |
OpenVINO |
* The Llama model architecture supports similar model families such as CodeLlama, Vicuna, Yi, and more.
+ The Mistral model architecture supports similar model families such as Zephyr.
- [2024/06] Support Phi-3 (mini, small, medium), Phi-3-Vision-Mini, Llama-2, Llama-3, Gemma (v1), Mistral v0.3, Starling-LM, Yi-1.5.
- [2024/06] Support vision/chat inference on iGPU, APU, CPU and CUDA.
- Supported Models
- Getting Started
- Compile OpenAI-API Compatible Server into Windows Executable
- Prebuilt Binary (Alpha)
- Acknowledgements
Models | Parameters | Context Length | Link |
---|---|---|---|
Gemma-2b-Instruct v1 | 2B | 8192 | EmbeddedLLM/gemma-2b-it-onnx |
Llama-2-7b-chat | 7B | 4096 | EmbeddedLLM/llama-2-7b-chat-int4-onnx-directml |
Llama-2-13b-chat | 13B | 4096 | EmbeddedLLM/llama-2-13b-chat-int4-onnx-directml |
Llama-3-8b-chat | 8B | 8192 | EmbeddedLLM/mistral-7b-instruct-v0.3-onnx |
Mistral-7b-v0.3-instruct | 7B | 32768 | EmbeddedLLM/mistral-7b-instruct-v0.3-onnx |
Phi-3-mini-4k-instruct-062024 | 3.8B | 4096 | EmbeddedLLM/Phi-3-mini-4k-instruct-062024-onnx |
Phi3-mini-4k-instruct | 3.8B | 4096 | microsoft/Phi-3-mini-4k-instruct-onnx |
Phi3-mini-128k-instruct | 3.8B | 128k | microsoft/Phi-3-mini-128k-instruct-onnx |
Phi3-medium-4k-instruct | 17B | 4096 | microsoft/Phi-3-medium-4k-instruct-onnx-directml |
Phi3-medium-128k-instruct | 17B | 128k | microsoft/Phi-3-medium-128k-instruct-onnx-directml |
Openchat-3.6-8b | 8B | 8192 | EmbeddedLLM/openchat-3.6-8b-20240522-onnx |
Yi-1.5-6b-chat | 6B | 32k | EmbeddedLLM/01-ai_Yi-1.5-6B-Chat-onnx |
Phi-3-vision-128k-instruct | 128k | EmbeddedLLM/Phi-3-vision-128k-instruct-onnx |
-
Windows
- Custom Setup:
- IPEX(XPU): Requires anaconda environment.
conda create -n ellm python=3.10 libuv; conda activate ellm
. - DirectML: If you are using Conda Environment. Install additional dependencies:
conda install conda-forge::vs2015_runtime
.
-
Install embeddedllm package.
$env:ELLM_TARGET_DEVICE='directml'; pip install -e .
. Note: currently supportcpu
,directml
andcuda
.- DirectML:
$env:ELLM_TARGET_DEVICE='directml'; pip install -e .[directml]
- CPU:
$env:ELLM_TARGET_DEVICE='cpu'; pip install -e .[cpu]
- CUDA:
$env:ELLM_TARGET_DEVICE='cuda'; pip install -e .[cuda]
- IPEX:
$env:ELLM_TARGET_DEVICE='ipex'; python setup.py develop
- OpenVINO:
$env:ELLM_TARGET_DEVICE='openvino'; pip install -e .[openvino]
- With Web UI:
- DirectML:
$env:ELLM_TARGET_DEVICE='directml'; pip install -e .[directml,webui]
- CPU:
$env:ELLM_TARGET_DEVICE='cpu'; pip install -e .[cpu,webui]
- CUDA:
$env:ELLM_TARGET_DEVICE='cuda'; pip install -e .[cuda,webui]
- IPEX:
$env:ELLM_TARGET_DEVICE='ipex'; python setup.py develop; pip install -r requirements-webui.txt
- OpenVINO:
$env:ELLM_TARGET_DEVICE='openvino'; pip install -e .[openvino,webui]
- DirectML:
- DirectML:
-
Linux
- Custom Setup:
- IPEX(XPU): Requires anaconda environment.
conda create -n ellm python=3.10 libuv; conda activate ellm
. - DirectML: If you are using Conda Environment. Install additional dependencies:
conda install conda-forge::vs2015_runtime
.
-
Install embeddedllm package.
ELLM_TARGET_DEVICE='directml' pip install -e .
. Note: currently supportcpu
,directml
andcuda
.- DirectML:
ELLM_TARGET_DEVICE='directml' pip install -e .[directml]
- CPU:
ELLM_TARGET_DEVICE='cpu' pip install -e .[cpu]
- CUDA:
ELLM_TARGET_DEVICE='cuda' pip install -e .[cuda]
- IPEX:
ELLM_TARGET_DEVICE='ipex' python setup.py develop
- OpenVINO:
ELLM_TARGET_DEVICE='openvino' pip install -e .[openvino]
- With Web UI:
- DirectML:
ELLM_TARGET_DEVICE='directml' pip install -e .[directml,webui]
- CPU:
ELLM_TARGET_DEVICE='cpu' pip install -e .[cpu,webui]
- CUDA:
ELLM_TARGET_DEVICE='cuda' pip install -e .[cuda,webui]
- IPEX:
ELLM_TARGET_DEVICE='ipex' python setup.py develop; pip install -r requirements-webui.txt
- OpenVINO:
ELLM_TARGET_DEVICE='openvino' pip install -e .[openvino,webui]
- DirectML:
- DirectML:
-
Custom Setup:
-
Ipex
-
For Intel iGPU:
set SYCL_CACHE_PERSISTENT=1 set BIGDL_LLM_XMX_DISABLED=1
-
For Intel Arc™ A-Series Graphics:
set SYCL_CACHE_PERSISTENT=1
-
-
-
ellm_server --model_path <path/to/model/weight>
. -
Example code to connect to the api server can be found in
scripts/python
. Note: To find out more of the supported arguments.ellm_server --help
.
ellm_chatbot --port 7788 --host localhost --server_port <ellm_server_port> --server_host localhost
. Note: To find out more of the supported arguments.ellm_chatbot --help
.
It is an interface that allows you to download and deploy OpenAI API compatible server. You can find out the disk space required to download the model in the UI.
ellm_modelui --port 6678
. Note: To find out more of the supported arguments.ellm_modelui --help
.
NOTE: OpenVINO packaging currently uses torch==2.4.0
. It will not be able to run due to missing dependencies which is libomp
. Make sure to install libomp
and add the libomp-xxxxxxx.dll
to C:\\Windows\\System32
.
-
Install
embeddedllm
. -
Install PyInstaller:
pip install pyinstaller==6.9.0
. -
Compile Windows Executable:
pyinstaller .\ellm_api_server.spec
. -
You can find the executable in the
dist\ellm_api_server
. -
Use it like
ellm_server
..\ellm_api_server.exe --model_path <path/to/model/weight>
.Powershell/Terminal Usage:
ellm_server --model_path <path/to/model/weight> # DirectML ellm_server --model_path 'EmbeddedLLM/Phi-3-mini-4k-instruct-onnx-directml' --port 5555 # IPEX-LLM ellm_server --model_path '.\meta-llama_Meta-Llama-3.1-8B-Instruct\' --backend 'ipex' --device 'xpu' --port 5555 --served_model_name 'meta-llama_Meta/Llama-3.1-8B-Instruct' # OpenVINO ellm_server --model_path '.\meta-llama_Meta-Llama-3.1-8B-Instruct\' --backend 'openvino' --device 'gpu' --port 5555 --served_model_name 'meta-llama_Meta/Llama-3.1-8B-Instruct'
You can find the prebuilt OpenAI API Compatible Windows Executable in the Release page.
Powershell/Terminal Usage (Use it like ellm_server
):
.\ellm_api_server.exe --model_path <path/to/model/weight>
# DirectML
.\ellm_api_server.exe --model_path 'EmbeddedLLM_Phi-3-mini-4k-instruct-062024-onnx\onnx\directml\Phi-3-mini-4k-instruct-062024-int4' --port 5555
# IPEX-LLM
.\ellm_api_server.exe --model_path '.\meta-llama_Meta-Llama-3.1-8B-Instruct\' --backend 'ipex' --device 'xpu' --port 5555 --served_model_name 'meta-llama_Meta/Llama-3.1-8B-Instruct'
# OpenVINO
.\ellm_api_server.exe --model_path '.\meta-llama_Meta-Llama-3.1-8B-Instruct\' --backend 'openvino' --device 'gpu' --port 5555 --served_model_name 'meta-llama_Meta/Llama-3.1-8B-Instruct'
- Excellent open-source projects: vLLM, onnxruntime-genai, Ipex-LLM and many others.