The 2024.4 release brings official support for OpenAI API text generation. It is now recommended for production usage. It comes with a set of added features and improvements.
Changes and improvements
-
Significant performance improvements for multinomial sampling algorithm
-
finish_reason
in the response correctly determines reaching the max_tokens (length) and completed the sequence (stop) -
Added automatic cancelling of text generation for disconnected clients
-
Included prefix caching feature which speeds up text generation by caching the prompt evaluation
-
Option to compress the KV Cache to lower precision – it reduces the memory consumption with minimal impact on accuracy
-
Added support for
stop
sampling parameters. It can define a sequence which stops text generation. -
Added support for
logprobs
sampling parameter. It returns the probabilities of generated tokens. -
Included generic metrics related to execution of MediaPipe graph. Metric
ovms_current_graphs
can be used for autoscaling based on current load and the level of concurrency. Counters likeovms_requests_accepted
andovms_responses
can track the activity of the server. -
Included demo of text generation horizontal scalability
-
Configurable handling of non-UTF-8 responses from the model – detokenizer can now automatically change then to Unicode replacement character
-
Included support for Llama3.1 models
-
Text generation is supported both on CPU and GPU -check the demo
Breaking changes
No breaking changes.
Bug fixes
-
Security and stability improvements
-
Fixed handling of model templates without bos_token
You can use an OpenVINO Model Server public Docker images based on Ubuntu via the following command:
docker pull openvino/model_server:2024.4
- CPU device support with the image based on Ubuntu22.04
docker pull openvino/model_server:2024.4-gpu
- CPU, GPU and NPU device support with the image based on Ubuntu22.04
or use provided binary packages.
The prebuilt image is available also on RedHat Ecosystem Catalog