[Feature] Serving embedding and reranking model using vLLM #1203

lvliang-intel · 2024-11-28T00:45:32Z

Priority

P1-Stopper

OS type

Ubuntu

Hardware type

Xeon-GNR

Running nodes

Single Node

Description

Feature: Serving Embedding and Reranking Models Using vLLM on Xeon and Gaudi
Description:
Integrate vLLM as a serving framework to enhance the performance and scalability of embedding and reranking models. This feature involves:

Leveraging vLLM's high-throughput serving capabilities to efficiently handle embedding and reranking requests.
Integration with the ChatQnA pipeline.
Optimizing the vLLM configuration for use cases involving embeddings and reranking, ensuring lower latency and better resource utilization.
Comparing vLLM's performance against the current TEI to determine the best setup for production.

Expected Outcome:

Applied another serving framework for embedding and reranking models, expect better performance on Gaudi.
Improved throughput for embedding and reranking services.
Enhanced flexibility to switch between serving frameworks based on specific requirements.

lvliang-intel · 2024-12-02T01:26:17Z

The microservice itself will be done by GenAIComps feature opea-project/GenAIComps#956. And if the embedding/reranking performance serving by vLLM is better than TGI, we will update the ChatQnA example.

lvliang-intel added this to the v1.2 milestone Nov 28, 2024

lvliang-intel assigned XinyaoWa Nov 28, 2024

lvliang-intel added this to OPEA Nov 28, 2024

lvliang-intel added the feature New feature or request label Nov 28, 2024

joshuayao moved this to In progress in OPEA Nov 28, 2024

joshuayao removed this from OPEA Dec 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Serving embedding and reranking model using vLLM #1203

[Feature] Serving embedding and reranking model using vLLM #1203

lvliang-intel commented Nov 28, 2024

lvliang-intel commented Dec 2, 2024

[Feature] Serving embedding and reranking model using vLLM #1203

[Feature] Serving embedding and reranking model using vLLM #1203

Comments

lvliang-intel commented Nov 28, 2024

Priority

OS type

Hardware type

Running nodes

Description

lvliang-intel commented Dec 2, 2024