feat. add OpenVINO Model Server as a Backend

**Is your feature request related to a problem? Please describe.**
From my benchmark OpenVINO performance on iGPU is almost 5 to 8 times faster than llama.cpp SYCL implementation for Mistral based 7B models.

With SYCL I can serve with iGPU (UHD 770) Starling and Openchat from 2 to 4 token/s while I can easily inference at 15-16 token/second with OpenVINO with INT8.
I don't know what are the performance on ARC or NPU since I don't have the hardware to test.

Could be an effective solution for computer with iGPU

I've uploaded an OpenVINO version of openchat-3.5-0106 to HF for testing https://huggingface.co/fakezeta/openchat-3.5-0106-openvino-int8/

It will be compatible with torch, onnx, openvino model format.



**Describe the solution you'd like**

This could be implemente with [Optimum-Intel](https://huggingface.co/docs/optimum/main/en/intel/index) library or with gRPC [OpenVINO model server](https://docs.openvino.ai/2023.3/ovms_what_is_openvino_model_server.html)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat. add OpenVINO Model Server as a Backend #1722

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

feat. add OpenVINO Model Server as a Backend #1722

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions