Skip to content

OpenVINO™ Model Server 2023.0

Compare
Choose a tag to compare
@rasapala rasapala released this 01 Jun 15:35
· 614 commits to main since this release
301b794

The 2023.0 is a major release with numerous improvements and changes.

New Features

  • Added option to submit inference requests in a form of strings and reading the response also in a form of a string. That can be currently utilized via a custom nodes and OpenVINO models with a CPU extension handling string data:
    • Using a custom node in a DAG pipeline which can perform string tokenization before passing it to the OpenVINO model - that is beneficial for models without tokenization layer to fully delegate that preprocessing to the model server.
    • Using a custom node in a DAG pipeline which can perform string detokenization of the model response to convert it to a string format - that can be beneficial for models without detokenization layer to fully delegate that postprocessing to the model server.
    • Both options above are demonstrated with a GPT model for text generation demo.
    • For models with tokenization layer like universal-sentence-encoder - there is added a cpu extension which implements sentencepiece_tokenization layer. Users can pass to the model a string which is automatically converted to the format needed by the cpu extension.
    • The option above is demonstrated in universal-sentence-encoder model usage demo.
    • Added support for string input and output in the ovmsclientovmsclient library can be used to send the string data to the model server. Check the code snippets.
  • Preview version of OVMS with MediaPipe framework - it is possible to make calls to OpenVINO Model Server to perform mediapipe graph processing. There are calculators performing OpenVINO inference via C-API calls from OpenVINO Model Server, and also calculators converting the OV::Tensor input format to mediapipe image format. That creates a foundation for creating arbitrary graphs. Check model server integration with mediapipe documentation.
  • Extended C-API interface with ApiVersion and Metadata calls, C-API version is now 0.3.
  • Added support for saved_model format. Check how to create models repository. An example of such use case is in universal-sentence-encoder demo.
  • Added option to build the model server with NVIDIA plugin on UBI8 base image.
  • Virtual plugins AUTO, HETERO and MULTI are now supported with NVIDIA plugin.
  • In the DEBUG log_level, there is included a message about the actual execution device for each inference request for the AUTO target_device. Learn more about the AUTO plugin.
  • Support for relative paths to the model files. The paths can be now relative to the config.json location. It simplifies deployments when the config.json to distributed together with the models repository.
  • Updated OpenCL drivers for the GPU device to version 23.13 (with Ubuntu22.04 base image).
  • Added option to build OVMS on the base OS Ubuntu:22.04. This is an addition to the supported based OSes Ubuntu:20.04 and UBI8.7.

Breaking changes

  • KServe API unification with Triton implementation for handling string and encoded images formats (now every string or encoded image located in binary extension (REST) or raw_input_contents (GRPC) need to be preceded by 4 bytes (little endian) containing its size) The updated code snippets and samples.
  • Changed default performance hint from THROUGHPUT to LATENCY in 2023.0 the default performance hint is changed from THROUGHPUT to LATENCY. With the new default settings, the model server will be adjusted for optimal execution and minimal latency with low concurrency. The default setting will also minimize memory consumption. In case of the usage model with high concurrency, it is recommended to adjust the NUM_STREAMS or set the performance hint to THROUGHPUT explicitly. Read more in performance tuning guide.

Bug fixes

  • AUTO plugin starts serving models on CPU and switch to GPU device after the model is compiled – it reduces the startup time for the model.
  • Fixed image building error on MacOS and Ubuntu22.
  • Ovmsclient python library compatible with tensorflow in the same environment – ovmsclient is generally created to avoid the requirement of tensorflow package installation to create smaller python environment. Now the tensorflow package will not be conflicting so it is fully optional.
  • Improved memory handling after unloading the models – the model server will not force releasing the memory after models unloading. Memory consumption reported by the model server process will be smaller in use case, when the models are frequently changed.

You can use an OpenVINO Model Server public Docker image's based on Ubuntu via the following command:
docker pull openvino/model_server:2023.0  - CPU device support with the image based on Ubuntu20.04
docker pull openvino/model_server:2023.0-gpu - GPU and CPU device support with the image based on Ubuntu22.04
or use provided binary packages.
The prebuilt image is available also on RedHat Ecosystem Catalog