Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Serving embedding and reranking model using vLLM #1203

Open
lvliang-intel opened this issue Nov 28, 2024 · 1 comment
Open

[Feature] Serving embedding and reranking model using vLLM #1203

lvliang-intel opened this issue Nov 28, 2024 · 1 comment
Assignees
Labels
feature New feature or request
Milestone

Comments

@lvliang-intel
Copy link
Collaborator

Priority

P1-Stopper

OS type

Ubuntu

Hardware type

Xeon-GNR

Running nodes

Single Node

Description

Feature: Serving Embedding and Reranking Models Using vLLM on Xeon and Gaudi
Description:
Integrate vLLM as a serving framework to enhance the performance and scalability of embedding and reranking models. This feature involves:

Leveraging vLLM's high-throughput serving capabilities to efficiently handle embedding and reranking requests.
Integration with the ChatQnA pipeline.
Optimizing the vLLM configuration for use cases involving embeddings and reranking, ensuring lower latency and better resource utilization.
Comparing vLLM's performance against the current TEI to determine the best setup for production.

Expected Outcome:

Applied another serving framework for embedding and reranking models, expect better performance on Gaudi.
Improved throughput for embedding and reranking services.
Enhanced flexibility to switch between serving frameworks based on specific requirements.

@lvliang-intel lvliang-intel added this to the v1.2 milestone Nov 28, 2024
@lvliang-intel lvliang-intel added the feature New feature or request label Nov 28, 2024
@joshuayao joshuayao moved this to In progress in OPEA Nov 28, 2024
@joshuayao joshuayao removed this from OPEA Dec 2, 2024
@lvliang-intel
Copy link
Collaborator Author

The microservice itself will be done by GenAIComps feature opea-project/GenAIComps#956. And if the embedding/reranking performance serving by vLLM is better than TGI, we will update the ChatQnA example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants