[FEA]: Create Sherlock example for VDB Upload #1298

mdemoret-nv · 2023-10-22T00:42:46Z

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

High

Please provide a clear description of problem this feature solves

As part of the Sherlock work, an example showing how to use Morpheus to upload documents to a Vector Database (VDB) is needed.

Describe your ideal solution

Purpose

The purpose of this example is to illustrate how a user could build a pipeline which will take a set of documents, split those documents into chunks, calculate the embedding vector for each chunk, and upload those chunks with the embedding to a VDB.

Scenario

This example will show one single implementation but the pipeline and components could be used in many scenarios with different requirements. At a high level, the following illustrates different customization points for this pipeline and the specific choices made for this example:

Source documents
- This pipeline could support any type of document which can be converted into text. This includes PDFs, web pages, structured documents and even images (with OCR).
- For this example, we will be using RSS feeds and a web scraper as the source for our documents. This was chosen because it simulates a real-world cyber scenario (Cyber security RSS feeds could be used to build a repository of knowledge for a security chatbot) and does not require any dataset or API keys to function.
Embedding model
- This pipeline can support any type of embedding model that can convert text into a vector of floats.
- We have tested this pipeline with several different models available on Huggingface including paraphrase-multilingual-mpnet-base-v2, e5-large-v2 and all-mpnet-base-v2
- For the example we will use all-MiniLM-L6-v2 since it is a small, quick model with an small embedding dimension of 384
Vector DB Service
- Any vector database can be used to store the resulting embedding and corresponding metadata.
- It would be trivial to update the example to use Chroma or FAISS if needed
- For the example, we will be using Milvus since we have been working closely with them on making GPU accelerated indices

Implementation

This example will be composed of 3 different components all set up as different click commands.

Export model component

This command is necessary to export the embedding model into a Triton model repository to be loaded by Triton. Any model which is BERT based and hosted on Hugging face can be exported. The way the command functions is by downloading the model, adding some layers at the end for average pooling and normalization and exports the model using the Pytorch -> ONNX exporter. This model can then be imported by Triton and optimized with the built in ONNX->TRT converter.

By default the pipelines will use the all-MiniLM-L6-v2 model which has already been exported and saved into the repo using Git LFS. This model is preferred because it is small (only 90 Mb when exported) and fast.

Morpheus pipeline

The Morpheus pipeline is built using the following components:

Ingest the RSS documents using our RSSSourceStage
Convert the URLs into text using a custom WebScraperStage
1. This stage downloads the HTML, then uses the BeautifulSoup library to extract the text. Other options exist but are very, very slow
The embedding is calculated using stages from the SID workflow
1. The PreprocessNLPStage calculates the tokens for each chunk
2. The TritonInferenceStage determines the embedding using the all-MiniLM-L6-v2 model
Finally, the embedding and documents are uploaded to the VectorDB using the WriteToVectorDBStage

LangChain pipeline (Optional)

As a comparison for performance, we should provide the equivalent pipeline using only Langchain to do an apples-to-apples comparison on performance. A few notes about the existing Langchain command currently in the prototype:

The Langchain library has a RSSLoader but it is not available in the 0.0.190 release. This release is the latest we can use from Conda because the next release requires Pandas 2.0+ which conflicts with the requirements of cuDF.
The RSSLoader out of the box uses a much more involved web scraper which is much slower. To perform a true apples-to-apples comparison, this would need to use the BeautifulSoup parser.
The Langchain pipelines can be very slow so getting true metrics on perf can be difficult. When using a ConfluenceLoader, we were able to see ~17x perf improvements over Langchain.

Completion Criteria

The following items need to be satisfied to consider this issue complete:

Dependent Issues

The following issues should be resolved before this can be completed:

Tasks

Give feedback

[FEA]: Create LLM Engine Core Functionality #1178

feature request sherlock
[FEA]: Improve the VectorDBService and WriteToVectorDBStage to handle similarity search #1272

feature request sherlock
Add documentation and tests for the WebScraperStage #1283

0 of 4

sherlock
[FEA]: Improve the RSSSourceStage for Sherlock workflows #1274

feature request sherlock
Options

Additional context

No response

Code of Conduct

I agree to follow this project's Code of Conduct
I have searched the open feature requests and have found no duplicates for this feature request

The text was updated successfully, but these errors were encountered:

mdemoret-nv · 2023-12-07T18:22:52Z

Closing since it was completed in 23.11

mdemoret-nv added the feature request New feature or request label Oct 22, 2023

mdemoret-nv added this to the 23.11 - Sherlock milestone Oct 22, 2023

mdemoret-nv assigned drobison00 Oct 22, 2023

github-project-automation bot added this to Morpheus Boards Oct 22, 2023

github-project-automation bot moved this to Todo in Morpheus Boards Oct 22, 2023

This was referenced Oct 22, 2023

Create Sherlock example for Retrieval Augmented Generation (RAG) pipeline #1306

Closed

Create Sherlock example for LLM agents pipeline #1307

Closed

mdemoret-nv closed this as completed Dec 7, 2023

github-project-automation bot moved this from Todo to Done in Morpheus Boards Dec 7, 2023

mdemoret-nv mentioned this issue Dec 7, 2023

Add Persistant pipeline to the Sherlock RAG example #1416

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA]: Create Sherlock example for VDB Upload #1298

[FEA]: Create Sherlock example for VDB Upload #1298

mdemoret-nv commented Oct 22, 2023 •

edited by drobison00

Loading

Tasks

mdemoret-nv commented Dec 7, 2023

[FEA]: Create Sherlock example for VDB Upload #1298

[FEA]: Create Sherlock example for VDB Upload #1298

Comments

mdemoret-nv commented Oct 22, 2023 • edited by drobison00 Loading

Is this a new feature, an improvement, or a change to existing functionality?

How would you describe the priority of this feature request

Please provide a clear description of problem this feature solves

Describe your ideal solution

Purpose

Scenario

Implementation

Export model component

Morpheus pipeline

LangChain pipeline (Optional)

Completion Criteria

Dependent Issues

Tasks

Additional context

Code of Conduct

mdemoret-nv commented Dec 7, 2023

mdemoret-nv commented Oct 22, 2023 •

edited by drobison00

Loading