You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is this a new feature, an improvement, or a change to existing functionality?
New Feature
How would you describe the priority of this feature request
High
Please provide a clear description of problem this feature solves
As part of the Sherlock work, an example showing how to use Morpheus to upload documents to a Vector Database (VDB) is needed.
Describe your ideal solution
Purpose
The purpose of this example is to illustrate how a user could build a pipeline which will take a set of documents, split those documents into chunks, calculate the embedding vector for each chunk, and upload those chunks with the embedding to a VDB.
Scenario
This example will show one single implementation but the pipeline and components could be used in many scenarios with different requirements. At a high level, the following illustrates different customization points for this pipeline and the specific choices made for this example:
Source documents
This pipeline could support any type of document which can be converted into text. This includes PDFs, web pages, structured documents and even images (with OCR).
For this example, we will be using RSS feeds and a web scraper as the source for our documents. This was chosen because it simulates a real-world cyber scenario (Cyber security RSS feeds could be used to build a repository of knowledge for a security chatbot) and does not require any dataset or API keys to function.
Embedding model
This pipeline can support any type of embedding model that can convert text into a vector of floats.
We have tested this pipeline with several different models available on Huggingface including paraphrase-multilingual-mpnet-base-v2, e5-large-v2 and all-mpnet-base-v2
For the example we will use all-MiniLM-L6-v2 since it is a small, quick model with an small embedding dimension of 384
Vector DB Service
Any vector database can be used to store the resulting embedding and corresponding metadata.
It would be trivial to update the example to use Chroma or FAISS if needed
For the example, we will be using Milvus since we have been working closely with them on making GPU accelerated indices
Implementation
This example will be composed of 3 different components all set up as different click commands.
Export model component
This command is necessary to export the embedding model into a Triton model repository to be loaded by Triton. Any model which is BERT based and hosted on Hugging face can be exported. The way the command functions is by downloading the model, adding some layers at the end for average pooling and normalization and exports the model using the Pytorch -> ONNX exporter. This model can then be imported by Triton and optimized with the built in ONNX->TRT converter.
By default the pipelines will use the all-MiniLM-L6-v2 model which has already been exported and saved into the repo using Git LFS. This model is preferred because it is small (only 90 Mb when exported) and fast.
Morpheus pipeline
The Morpheus pipeline is built using the following components:
Ingest the RSS documents using our RSSSourceStage
Convert the URLs into text using a custom WebScraperStage
This stage downloads the HTML, then uses the BeautifulSoup library to extract the text. Other options exist but are very, very slow
The embedding is calculated using stages from the SID workflow
The PreprocessNLPStage calculates the tokens for each chunk
The TritonInferenceStage determines the embedding using the all-MiniLM-L6-v2 model
Finally, the embedding and documents are uploaded to the VectorDB using the WriteToVectorDBStage
LangChain pipeline (Optional)
As a comparison for performance, we should provide the equivalent pipeline using only Langchain to do an apples-to-apples comparison on performance. A few notes about the existing Langchain command currently in the prototype:
The Langchain library has a RSSLoader but it is not available in the 0.0.190 release. This release is the latest we can use from Conda because the next release requires Pandas 2.0+ which conflicts with the requirements of cuDF.
The RSSLoader out of the box uses a much more involved web scraper which is much slower. To perform a true apples-to-apples comparison, this would need to use the BeautifulSoup parser.
The Langchain pipelines can be very slow so getting true metrics on perf can be difficult. When using a ConfluenceLoader, we were able to see ~17x perf improvements over Langchain.
Completion Criteria
The following items need to be satisfied to consider this issue complete:
A README.md containing information on the following:
Background information about the problem at hand
Information about the specific implementation and reasoning behind design decisions (use content from this issue)
Step-by-step instructions for running the following:
How to export a different model from huggingface (use e5-large-v2)
How to run the Morpheus pipeline
Including instructions on starting a Milvus service
Including instructions on starting the Triton service
How to run the Langchain pipeline (optional)
The README.md should be linked in the developer docs
A functioning export model command which satisfies the following:
Should run without error using all default arguments
Export a Triton model which can be loaded without modification by Triton
Should work for the following models paraphrase-multilingual-mpnet-base-v2, e5-large-v2 and all-mpnet-base-v2
Have logging which can be increased to provide debugging details
A functioning Morpheus pipeline command which satisfies the following:
Should run without error using all default arguments
Correctly calculate embedding for supplied documents
Provide information about the success or failure of the pipeline. Including number of uploaded documents, throughput and total runtime.
(Optional) A functioning Langchain pipeline command which satisfies the following:
Should run without error using all default arguments
Provide similar results to the Morpheus pipeline
Tests should be added which include the following
Test successfully exporting a model
Test successfully running the Morpheus pipeline
(Optional) Test successfully running the Langchain pipeline
Dependent Issues
The following issues should be resolved before this can be completed:
The content you are editing has changed. Please copy your edits and refresh the page.
Is this a new feature, an improvement, or a change to existing functionality?
New Feature
How would you describe the priority of this feature request
High
Please provide a clear description of problem this feature solves
As part of the Sherlock work, an example showing how to use Morpheus to upload documents to a Vector Database (VDB) is needed.
Describe your ideal solution
Purpose
The purpose of this example is to illustrate how a user could build a pipeline which will take a set of documents, split those documents into chunks, calculate the embedding vector for each chunk, and upload those chunks with the embedding to a VDB.
Scenario
This example will show one single implementation but the pipeline and components could be used in many scenarios with different requirements. At a high level, the following illustrates different customization points for this pipeline and the specific choices made for this example:
paraphrase-multilingual-mpnet-base-v2
,e5-large-v2
andall-mpnet-base-v2
all-MiniLM-L6-v2
since it is a small, quick model with an small embedding dimension of 384Implementation
This example will be composed of 3 different components all set up as different click commands.
Export model component
This command is necessary to export the embedding model into a Triton model repository to be loaded by Triton. Any model which is BERT based and hosted on Hugging face can be exported. The way the command functions is by downloading the model, adding some layers at the end for average pooling and normalization and exports the model using the Pytorch -> ONNX exporter. This model can then be imported by Triton and optimized with the built in ONNX->TRT converter.
By default the pipelines will use the
all-MiniLM-L6-v2
model which has already been exported and saved into the repo using Git LFS. This model is preferred because it is small (only 90 Mb when exported) and fast.Morpheus pipeline
The Morpheus pipeline is built using the following components:
RSSSourceStage
WebScraperStage
BeautifulSoup
library to extract the text. Other options exist but are very, very slowPreprocessNLPStage
calculates the tokens for each chunkTritonInferenceStage
determines the embedding using theall-MiniLM-L6-v2
modelWriteToVectorDBStage
LangChain pipeline (Optional)
As a comparison for performance, we should provide the equivalent pipeline using only Langchain to do an apples-to-apples comparison on performance. A few notes about the existing Langchain command currently in the prototype:
BeautifulSoup
parser.ConfluenceLoader
, we were able to see ~17x perf improvements over Langchain.Completion Criteria
The following items need to be satisfied to consider this issue complete:
README.md
containing information on the following:e5-large-v2
)paraphrase-multilingual-mpnet-base-v2
,e5-large-v2
andall-mpnet-base-v2
Dependent Issues
The following issues should be resolved before this can be completed:
Tasks
VectorDBService
andWriteToVectorDBStage
to handle similarity search #1272WebScraperStage
#1283RSSSourceStage
for Sherlock workflows #1274Additional context
No response
Code of Conduct
The text was updated successfully, but these errors were encountered: