[86]: Adding milvus as a vector database and a search service (RAG) #86

SatyamMattoo · 2024-07-04T20:31:17Z

Short description

Integrated a vector database embedded with the OpenFn docs and a search service for RAG.

Related Issues

Fixes: #71
Closes: #72

Implementation details

Added Milvus as the vector database, embedding the OpenFn docs from the repo and adding it to Zilliz Cloud Database. You need to sign in to Zilliz and create a cluster to obtain the token and connection URI.

The database can be seeded with the docs during the docker build using the command:
docker build --secret id=_env,src=.env -t apollo .

It can also be done by running the py script using:
poetry run python services/search/generate_docs_emebeddings.py tmp/repo/docs/jobs collection_name
You will need to change the docs_path by the actual path to the docs stored on your machine i.e. ./tmp/repo/docs.

Add the OpenAI API key, Zilliz database URI, and token to the .env file. This mounts the environment variables to the command and helps protect the secrets as well.

The search service currently takes a search string and a OpenAI API key, embeds the search query, searches the vector database for relevant information, and returns the results to the CLI.

Further work

After the integration and embedding of the doc's repo into the vector database we will need to automate the docker build process, i.e. whenever any change is pushed to the main branch of the doc's repo, we will need to rebuild the docker image of the apollo repository. This might be achieved using GitHub actions and workflows:

Create a repository dispatch event as suggested in this issue.
Listen to the dispatch event in the apollo repo and find a way to handle the env variables required for the docker build. (OpenAI API key, Milvus token and database URI)

Acceptence Critiera

Before we can consider releasing this work, we need to:

(Added by @josephjclark )

Settle on the embedding database and models. Should we try and build it all into the image or use a third party? If we use a third party, is Zilliz appropriate?
Ensure the returned chunks are reasonable quality and sufficiently related to the query
Decide whether to keep the summarization feature
Work out whether and how to handle adaptor docs chunking

Initial setup for the job expression service.

josephjclark · 2024-07-09T10:12:53Z

My guidance: I'd really like to avoid using a third party service if we can. Can we build a very small model into the image, rather than all of torch? That was my expectation. If we are going to use third party services, we might as well do ALL of it with 3rd party services and build nothing into the model.

If we really do have to build a lot of large models into the image, I need to see the evidence for it.

josephjclark · 2024-07-09T20:01:43Z

Let's break the problem into two.

First, we want to take a text corupus, and convert it into embeddings. This is an offline process outside the docker image (or in a builder image). Then in our docket file, we add an embeddings database and loading the trained data.

I don't expect the database or the embeddings to need torch. It's just a vector database and a bunch of data.

I don't think we need torch and ml to search the database either.

So we should be able to build this into an image without torch and a relatively small file size.

If either of these assumptions is wrong, I need to know so we can pivot.

Secondly, we need to convert natural language queries into embeddings in the fly. This is the harder bit and this either needs an embeddings service or a torch-enabled image. But let's deal with this bit after we've conquered step one.

SatyamMattoo · 2024-07-09T22:33:49Z

Thank you for your assistance sir. There are some details that I believe we need to discuss. Could we schedule a meeting to go over this and get a clearer idea?

josephjclark · 2024-07-10T10:31:14Z

@SatyamMattoo I am actually on vacation this week. Your mentor should be able to help you work out a solution - have you reached out to them?

If you're really blocked on this, we should focus on using third party embeddings with openai. If we're satisfied that the embeddings really give us better results, we'll come back and work out whether we can responsibly generate embeddings within the image itself.

SatyamMattoo · 2024-07-10T15:50:48Z

@josephjclark Yes sir, I have. I will implement that for now and continue to look for a better approach to do it within the image in parallel.

josephjclark · 2024-07-15T17:01:23Z

Some more thoughts on direction here:

PR Description

I think the PR description has fallen out of sync with the latest commits: please keep the PR description up to date so that when someone from the team comes to do a review, it's easy to get up to speed. Changes can be logged in comments (or commits) as you go, but the description should always reflect the current state of the work

Hosted Embeddings

I was thinking, if we ARE going to use third party services for embeddings, how and when do those embeddings get updated?

The original design asked for the database to be built from scratch each time. But with a third party, when we build, the docs embeddings may well already exist in the third party database.

In fact, ideally the embeddings would be updated whenever the doc site is updated, and then the live server is always querying against the very latest version of the docs.

Also, when the third party database is updated, what happens to the old embeddings? Are they purged or are we paying to store old data?

I don't necessarily need to see an implementation in this PR to automatically update the embeddings, but I would like to see a proposal for a robust, scalable strategy.

Testing

What sort of testing can we use to ensure that the embeddings are working? How do we know the right content is embedded from the doc site? I don't necessarily mean unit testing. Manual tests or logging during the build process could be sufficient. What I'm looking for is an answer to the question "How do I know this is working?"

SatyamMattoo · 2024-07-15T19:37:39Z

Sure sir, I will update the PR description shortly.

For the updating the database we can set up GitHub actions and set the workflows inside the apollo and docs repositories to rebuild the docker image each time a change is pushed to the main branch of the docs repository.

Steps:

A change is pushed to the main branch of the doc's repo, a GitHub Action workflow is triggered that rebuilds the docker image in the apollo repository.
- Create a GitHub Actions workflow in the docs repository that triggers a repository dispatch event to the apollo repository
  whenever there is a push to the main branch.
- Create a GitHub Actions workflow in the apollo repository that listens for the repository dispatch event and rebuilds the
  Docker image.
Rebuilding Docker Image:
- The workflow in the apollo repo will clone the necessary folders from the docs repo.
- It will then process the markdown files, chunk them appropriately, and embed the text using the embedding model.
Database Update:
- Connect to the vector database. Check if the collection exists. If it does, drop the existing collection to remove old
  embeddings.
- Insert the new embeddings into the database.
Managing Embeddings:
- By dropping the old collection and inserting new embeddings, we ensure that only the latest data is stored. This approach keeps storage costs efficient and maintains the relevance of the data.

Testing:

For testing manually,

I will add some more detailed logs during the docker build.
We can actually see the text embedded into the vector database, here is a screenshot of the docs embedded into zilliz

Use the search service to query the database and ensure that it returns accurate results based on the latest embeddings. This
can be done by querying specific known sections of the docs and verifying the results.

EDIT: I had a query regarding the docs repo, I don't think we need to embed all the .md files into the database. So, it might be efficient to just pull the required folders into the image. What do you think about this? For now, I have included the docs and the adaptors folder.

…s to return better search results).

Merging the job generation service for integration with the RAG service.

…ntegration

josephjclark · 2024-09-04T17:03:52Z

I'm not sure I agree with you on the semantic chunking. I suppose I don't have the old chunks to compare to, but I think the quality of the chunks looks pretty poor.

For one thing they look very uneven: in job-examples.md_sections.md, there are several examples in one big chunk (looks like 5800 characters). Then there's a chunk in the job writing guide that's just a single sentence.

Skimming over the chunks, I also see blocks of text which are clearly (to me) related - like some example code and its caption - which are split over several chunks. I see chunks begin on a sentence which relates back to the previous sentence. I see splits across HTML elements where the content must surely all be closely related.

Here's a random example:

## Section Start: 
------
It cannot be used in place of the `state` variable. It cannot be assigned to, or
be on the left hand side of an assignment, and can only be used inside an
argument to a function

This also means that Lazy State Operator can only be used to READ from state. It
cannot be used to assign to state directly. These examples are all errors:

```js
❌ const url = $.data.url;
get(url);

❌ get(() => $.data.url);

❌ $.data.x = fn();

❌ fn(state => {
  $.data.x = 10;
});
```

details
summary Compilation rules for advanced users summary

How does the Lazy State Operator work?

(I've mangled a bit of HTML there)

If you gave me, as a human, several of these chunks, I wouldn't understand what you were talking about. I don't expect an AI to fare much better.

When chunking by section, I at least know that each chunk has a fair chance of having enough context for it to be useful by itself. I also know that we can re-write docs in such a way that each section is designed to be read standalone (that's just good practice anyway). I'm sure there are some simple heuristics we can use to break up the longer chunks too.

services/search/search.py

josephjclark · 2024-09-04T17:17:15Z

I've run a couple of basic tests and while I'm not sold on the utility of the chunks, it does appear at first glance like the service is actually returning relevant chunks. When I ask about lazy state for example, I get ten chunks of lazy state data.

Can we control the relevance threshold at all? For our purposes, we don't need the top ten results - we need all results that are relevant enough. All results that score above 0.8 or whatever. If the search decides there are no docs worth returning, that's just fine (as would be the case when the user asks "What is Madonna's best album?")

josephjclark · 2024-09-04T17:27:48Z

I am concerned about the summarisation process. This feels like using AI for the sake of AI to me.

Consider the pipeline. When the user asks a question, we must:

Encode the question into embeddings (via a third party service)
Search for relevant documents (via a third party service)
Add the documents to the prompt
Send the prompt to the model (via at third party service)

The final model must then parse the input prompt and break it down into its own embeddings.

Each step has a cost attached (environmental, temporal, financial and technical). It's a lot of computation already.

And then the summarizer adds an extra step, calling out to another third party service. Assuming the summarisation is good and accurate (which I don't believe), what is the value? Users will never see it.

The idea of generating text in one model to pass it as input to another model is frankly a bit scary.

As a general point of feedback: experimental fringe features like this should really be developed on different branches. It is harder for us to release and maintain code when it contains features that haven't been properly specced or tested. If the work was done on another branch, it could be examined, tested, discussed and released in isolation. Now it's tied up with everything else and will slow down the release.

I'm always happy to sound out new ideas and try new things. But I'm less happy when the experiments threaten the release of the main feature. When they're done in isolation, I can explore the idea in a safe space without having to worry about how it'll impact the release.

Don't branch the work off now. I'll have a think about it. Maybe you can convinced me that it's worth keeping the feature.

services/search/generate_docs_embeddings.py

josephjclark · 2024-09-04T17:34:12Z

@SatyamMattoo just a quick note that I've added a list of acceptance criteria to the main PR post. These are things we need to work out before we can release this feature. I'll need to decide on the final embedding strategy next week so don't worry about that bit.

josephjclark · 2024-09-04T17:42:31Z

Regarding adaptor docs, you should be able to just run yarn generate-adaptors from the docs repo and it'll build the docs as markdown to adaptors/packages. You'll only need to read the <adaptor>-docs.md files. You should be able to chunk up by section.

SatyamMattoo · 2024-09-05T20:23:45Z

Hello @josephjclark,

Yes, it seems that semantic chunking won’t be the best fit for our docs. I’ve reverted to the previous strategy we used, which involves splitting based on headers and recursively splitting larger chunks.

Regarding the search, I’ve implemented a parameter that filters vectors based on a threshold. We can read more about this here. This ensures that irrelevant search queries return no results, and it works as expected.

As for the adaptor docs, I’ve worked out a solution locally. However, during the Docker build, it becomes a bit cumbersome since we need to install nodejs, npm, corepack, and yarn (with nodejs and npm required for installing yarn). This process adds about two minutes to the build time. Afterward, we still need to build the docs, which would further increase the image size and build time.

Alternatively, we could fetch the docs directly from the GitHub source. There’s likely a way to retrieve functions and examples for each adaptor, create a chunk for each function along with its adaptor name, and generate embeddings from them. I had initially started with this approach before your suggestion.

What are your thoughts on proceeding this way?

SatyamMattoo · 2024-09-05T20:42:16Z

@josephjclark Regarding the summarization process, I believe it might be more beneficial to discuss this after the release, as it may not directly impact the job generation service. The summarization essentially condenses the context based on the query, which we can potentially handle by adding specific instructions to the job prompt.

However, this could be useful if the user wants to use the search service directly for complex queries, as those can be answered based on the retrieved documents. I have removed this part for now, but this is something we can consider further down the line.

Also, I’ll make sure to branch off such ideas separately.

Best regards

josephjclark · 2024-09-06T10:02:24Z

Thank you for the update @SatyamMattoo

You must take where when submitting work to thoroughly test it (we've talked about this before). And ideally your test results should be shareable, so that I can check your testing. It's not enough to say "this works great". You must show me it works.

If you cannot prove the quality and robustness of your solution, it will not be merged. That's not my personal policy for openfn - that's how software development works worldwide. Different organisations have different standards, admittedly, but no-one wants to merge untrusted code.

I say this because you told me that semantic chunking method is providing better quality chunks. But you did not provide any proof of this, and now you've had to revert the work.

Experimentation is great - I'm glad we tried semantic chunking (and actually I'm sure we could spend more time on it to find a better solution). But you MUST test well and share your results. Super important.

The search threshold sounds great. What value have you chosen and why? How you can prove to me that it's a reasonable value?

Re adaptors: yes, using docs.json from github is a good solution, let's go with that for now.

SatyamMattoo · 2024-09-06T18:52:52Z

Hi @josephjclark,

Thank you for the valuable advice; I will definitely keep these points in mind.

Regarding the threshold value, I’ve conducted tests with multiple queries to determine an appropriate value that returns only valid results (in most cases). For example, when testing with an unrelated query like "What is Langchain?", setting the threshold slightly higher (e.g. 1.4) still returns some results for optional chaining.

For relevant queries, I’ve observed that the ideal score range is between 0.7 and 1.2 for the chunks related to the query. We can already see these scores in the stdout. Additionally, I’ve pushed a file containing 2 sample queries along with their respective scores for further reference.

SatyamMattoo · 2024-09-07T17:45:27Z

Hi @josephjclark,
I’ve implemented the code to embed the adaptor functions into the database. You can find the chunks of the adaptor documentation in the tmp/split_sections/adaptors.md file. Additionally, I’ve implemented a partition-based embedding and search strategy to improve search efficiency by distinguishing between adaptor data and the regular documentation.

I’ve also worked on improving the overall code quality by reducing redundancy.

Best regards

josephjclark · 2024-09-19T17:39:20Z

Thank you for your efforts here @SatyamMattoo

I have spun out a couple of issues for further work which we'll be starting in a couple of weeks. In the meantime I'm going to merge and release this to the staging server 🎉

SatyamMattoo · 2024-09-19T18:26:12Z

Hello @josephjclark,
That’s great news! Thank you so much for your guidance and support throughout—it’s been a great learning experience for me.

SatyamMattoo and others added 4 commits June 23, 2024 14:24

initial setup for job expression generation service

9899b4a

fixing a typo

b84b04b

Merge pull request #79 from SatyamMattoo/main

31be256

Initial setup for the job expression service.

adding milvus as a vector database

6176e48

SatyamMattoo requested a review from josephjclark July 4, 2024 20:31

SatyamMattoo added 2 commits July 5, 2024 02:10

adding new env var

942390d

adding new env var

d503de9

taylordowns2000 removed the request for review from josephjclark July 5, 2024 08:43

This comment was marked as outdated.

Sign in to view

This comment was marked as resolved.

Sign in to view

SatyamMattoo added 3 commits July 10, 2024 23:25

changed the embedding service to OpenAI

2c25522

Embedding OpenFn docs from the repo

63bd5a2

fixing a typo

2bc8ca0

This comment was marked as resolved.

Sign in to view

added a better text splitting strategy to avoid uneven chunks of data

c341639

This comment was marked as outdated.

Sign in to view

SatyamMattoo and others added 4 commits July 20, 2024 19:56

embed docs before database connection

55bc7c0

changing chunking strategy, size and overlap after testing (this seem…

0132654

…s to return better search results).

Merge pull request #91 from OpenFn/dmp/620

55ef21f

Merging the job generation service for integration with the RAG service.

Merge branch 'main' of https://github.com/OpenFn/apollo into milvus-i…

c7d42e4

…ntegration

SatyamMattoo changed the title ~~Adding milvus as a vector database and a search API~~ Dmp/620: Adding milvus as a vector database, a search service and a job generation service. Jul 24, 2024

SatyamMattoo mentioned this pull request Jul 24, 2024

Adding a repository dispatch event to apollo repository for docker build. OpenFn/docs#520

Open

integrating gen_job and search service.

49b7df3

josephjclark reviewed Sep 4, 2024

View reviewed changes

services/search/search.py Outdated Show resolved Hide resolved

josephjclark reviewed Sep 4, 2024

View reviewed changes

services/search/generate_docs_embeddings.py Outdated Show resolved Hide resolved

SatyamMattoo added 2 commits September 5, 2024 01:25

using markdown and recursive text spilters

79f471d

added path.config and updated readme

7d5ac5b

adding range search

cee0ee9

SatyamMattoo added 2 commits September 6, 2024 22:51

fixing a limit error

6ef0f07

added some search results with the score

a8b4ee0

SatyamMattoo added 3 commits September 7, 2024 22:56

added adaptor doc embeddings and improved the code quality

bb18e40

fixing typos

6561dac

adding more validation and improving the readme

4a578ad

SatyamMattoo and others added 5 commits September 9, 2024 02:32

resolved minor errors

54a269f

fixing typo

ab3c7ba

separating client connection and adding logger

8ce5193

writing output to file bug fixed

fd248d5

changeset & readme update

e80484f

josephjclark changed the base branch from main to release/next September 19, 2024 17:38

josephjclark merged commit 013506f into release/next Sep 19, 2024

josephjclark deleted the milvus-integration branch September 19, 2024 17:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[86]: Adding milvus as a vector database and a search service (RAG) #86

[86]: Adding milvus as a vector database and a search service (RAG) #86

SatyamMattoo commented Jul 4, 2024 •

edited by josephjclark

Loading

This comment was marked as outdated.

josephjclark commented Jul 9, 2024

This comment was marked as resolved.

josephjclark commented Jul 9, 2024

SatyamMattoo commented Jul 9, 2024

josephjclark commented Jul 10, 2024

SatyamMattoo commented Jul 10, 2024

josephjclark commented Jul 15, 2024

SatyamMattoo commented Jul 15, 2024 •

edited

Loading

This comment was marked as resolved.

This comment was marked as outdated.

josephjclark commented Sep 4, 2024 •

edited

Loading

josephjclark commented Sep 4, 2024

josephjclark commented Sep 4, 2024

josephjclark commented Sep 4, 2024

josephjclark commented Sep 4, 2024

SatyamMattoo commented Sep 5, 2024

SatyamMattoo commented Sep 5, 2024

josephjclark commented Sep 6, 2024

SatyamMattoo commented Sep 6, 2024

SatyamMattoo commented Sep 7, 2024 •

edited

Loading

josephjclark commented Sep 19, 2024

SatyamMattoo commented Sep 19, 2024

[86]: Adding milvus as a vector database and a search service (RAG) #86

[86]: Adding milvus as a vector database and a search service (RAG) #86

Conversation

SatyamMattoo commented Jul 4, 2024 • edited by josephjclark Loading

Short description

Related Issues

Implementation details

Further work

Acceptence Critiera

This comment was marked as outdated.

josephjclark commented Jul 9, 2024

This comment was marked as resolved.

josephjclark commented Jul 9, 2024

SatyamMattoo commented Jul 9, 2024

josephjclark commented Jul 10, 2024

SatyamMattoo commented Jul 10, 2024

josephjclark commented Jul 15, 2024

PR Description

Hosted Embeddings

Testing

SatyamMattoo commented Jul 15, 2024 • edited Loading

Steps:

Testing:

This comment was marked as resolved.

This comment was marked as outdated.

josephjclark commented Sep 4, 2024 • edited Loading

josephjclark commented Sep 4, 2024

josephjclark commented Sep 4, 2024

josephjclark commented Sep 4, 2024

josephjclark commented Sep 4, 2024

SatyamMattoo commented Sep 5, 2024

SatyamMattoo commented Sep 5, 2024

josephjclark commented Sep 6, 2024

SatyamMattoo commented Sep 6, 2024

SatyamMattoo commented Sep 7, 2024 • edited Loading

josephjclark commented Sep 19, 2024

SatyamMattoo commented Sep 19, 2024

SatyamMattoo commented Jul 4, 2024 •

edited by josephjclark

Loading

SatyamMattoo commented Jul 15, 2024 •

edited

Loading

josephjclark commented Sep 4, 2024 •

edited

Loading

SatyamMattoo commented Sep 7, 2024 •

edited

Loading