Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[86]: Adding milvus as a vector database and a search service (RAG) #86

Merged
merged 41 commits into from
Sep 19, 2024

Conversation

SatyamMattoo
Copy link
Collaborator

@SatyamMattoo SatyamMattoo commented Jul 4, 2024

Short description

Integrated a vector database embedded with the OpenFn docs and a search service for RAG.

Related Issues

Fixes: #71
Closes: #72

Implementation details

Added Milvus as the vector database, embedding the OpenFn docs from the repo and adding it to Zilliz Cloud Database. You need to sign in to Zilliz and create a cluster to obtain the token and connection URI.

The database can be seeded with the docs during the docker build using the command:
docker build --secret id=_env,src=.env -t apollo .

It can also be done by running the py script using:
poetry run python services/search/generate_docs_emebeddings.py tmp/repo/docs/jobs collection_name
You will need to change the docs_path by the actual path to the docs stored on your machine i.e. ./tmp/repo/docs.

Add the OpenAI API key, Zilliz database URI, and token to the .env file. This mounts the environment variables to the command and helps protect the secrets as well.

The search service currently takes a search string and a OpenAI API key, embeds the search query, searches the vector database for relevant information, and returns the results to the CLI.

Further work

After the integration and embedding of the doc's repo into the vector database we will need to automate the docker build process, i.e. whenever any change is pushed to the main branch of the doc's repo, we will need to rebuild the docker image of the apollo repository. This might be achieved using GitHub actions and workflows:

  1. Create a repository dispatch event as suggested in this issue.
  2. Listen to the dispatch event in the apollo repo and find a way to handle the env variables required for the docker build. (OpenAI API key, Milvus token and database URI)

Acceptence Critiera

Before we can consider releasing this work, we need to:

(Added by @josephjclark )

  • Settle on the embedding database and models. Should we try and build it all into the image or use a third party? If we use a third party, is Zilliz appropriate?
  • Ensure the returned chunks are reasonable quality and sufficiently related to the query
  • Decide whether to keep the summarization feature
  • Work out whether and how to handle adaptor docs chunking

@taylordowns2000 taylordowns2000 removed the request for review from josephjclark July 5, 2024 08:43
@josephjclark

This comment was marked as outdated.

@josephjclark
Copy link
Collaborator

My guidance: I'd really like to avoid using a third party service if we can. Can we build a very small model into the image, rather than all of torch? That was my expectation. If we are going to use third party services, we might as well do ALL of it with 3rd party services and build nothing into the model.

If we really do have to build a lot of large models into the image, I need to see the evidence for it.

@SatyamMattoo

This comment was marked as resolved.

@josephjclark
Copy link
Collaborator

Let's break the problem into two.

First, we want to take a text corupus, and convert it into embeddings. This is an offline process outside the docker image (or in a builder image). Then in our docket file, we add an embeddings database and loading the trained data.

I don't expect the database or the embeddings to need torch. It's just a vector database and a bunch of data.

I don't think we need torch and ml to search the database either.

So we should be able to build this into an image without torch and a relatively small file size.

If either of these assumptions is wrong, I need to know so we can pivot.

Secondly, we need to convert natural language queries into embeddings in the fly. This is the harder bit and this either needs an embeddings service or a torch-enabled image. But let's deal with this bit after we've conquered step one.

@SatyamMattoo
Copy link
Collaborator Author

Thank you for your assistance sir. There are some details that I believe we need to discuss. Could we schedule a meeting to go over this and get a clearer idea?

@josephjclark
Copy link
Collaborator

@SatyamMattoo I am actually on vacation this week. Your mentor should be able to help you work out a solution - have you reached out to them?

If you're really blocked on this, we should focus on using third party embeddings with openai. If we're satisfied that the embeddings really give us better results, we'll come back and work out whether we can responsibly generate embeddings within the image itself.

@SatyamMattoo
Copy link
Collaborator Author

@josephjclark Yes sir, I have. I will implement that for now and continue to look for a better approach to do it within the image in parallel.

@josephjclark
Copy link
Collaborator

Some more thoughts on direction here:

PR Description

I think the PR description has fallen out of sync with the latest commits: please keep the PR description up to date so that when someone from the team comes to do a review, it's easy to get up to speed. Changes can be logged in comments (or commits) as you go, but the description should always reflect the current state of the work

Hosted Embeddings

I was thinking, if we ARE going to use third party services for embeddings, how and when do those embeddings get updated?

The original design asked for the database to be built from scratch each time. But with a third party, when we build, the docs embeddings may well already exist in the third party database.

In fact, ideally the embeddings would be updated whenever the doc site is updated, and then the live server is always querying against the very latest version of the docs.

Also, when the third party database is updated, what happens to the old embeddings? Are they purged or are we paying to store old data?

I don't necessarily need to see an implementation in this PR to automatically update the embeddings, but I would like to see a proposal for a robust, scalable strategy.

Testing

What sort of testing can we use to ensure that the embeddings are working? How do we know the right content is embedded from the doc site? I don't necessarily mean unit testing. Manual tests or logging during the build process could be sufficient. What I'm looking for is an answer to the question "How do I know this is working?"

@SatyamMattoo
Copy link
Collaborator Author

SatyamMattoo commented Jul 15, 2024

Sure sir, I will update the PR description shortly.

For the updating the database we can set up GitHub actions and set the workflows inside the apollo and docs repositories to rebuild the docker image each time a change is pushed to the main branch of the docs repository.

Steps:

  1. A change is pushed to the main branch of the doc's repo, a GitHub Action workflow is triggered that rebuilds the docker image in the apollo repository.

    • Create a GitHub Actions workflow in the docs repository that triggers a repository dispatch event to the apollo repository
      whenever there is a push to the main branch.
    • Create a GitHub Actions workflow in the apollo repository that listens for the repository dispatch event and rebuilds the
      Docker image.
  2. Rebuilding Docker Image:

    • The workflow in the apollo repo will clone the necessary folders from the docs repo.
    • It will then process the markdown files, chunk them appropriately, and embed the text using the embedding model.
  3. Database Update:

    • Connect to the vector database. Check if the collection exists. If it does, drop the existing collection to remove old
      embeddings.
    • Insert the new embeddings into the database.
  4. Managing Embeddings:

    • By dropping the old collection and inserting new embeddings, we ensure that only the latest data is stored. This approach keeps storage costs efficient and maintains the relevance of the data.

Testing:

For testing manually,

  • I will add some more detailed logs during the docker build.
  • We can actually see the text embedded into the vector database, here is a screenshot of the docs embedded into zilliz

Screenshot 2024-07-16 004713

  • Use the search service to query the database and ensure that it returns accurate results based on the latest embeddings. This
    can be done by querying specific known sections of the docs and verifying the results.

EDIT: I had a query regarding the docs repo, I don't think we need to embed all the .md files into the database. So, it might be efficient to just pull the required folders into the image. What do you think about this? For now, I have included the docs and the adaptors folder.

@josephjclark

This comment was marked as resolved.

@SatyamMattoo

This comment was marked as outdated.

@SatyamMattoo SatyamMattoo changed the title Adding milvus as a vector database and a search API Dmp/620: Adding milvus as a vector database, a search service and a job generation service. Jul 24, 2024
@josephjclark
Copy link
Collaborator

josephjclark commented Sep 4, 2024

I'm not sure I agree with you on the semantic chunking. I suppose I don't have the old chunks to compare to, but I think the quality of the chunks looks pretty poor.

For one thing they look very uneven: in job-examples.md_sections.md, there are several examples in one big chunk (looks like 5800 characters). Then there's a chunk in the job writing guide that's just a single sentence.

Skimming over the chunks, I also see blocks of text which are clearly (to me) related - like some example code and its caption - which are split over several chunks. I see chunks begin on a sentence which relates back to the previous sentence. I see splits across HTML elements where the content must surely all be closely related.

Here's a random example:

## Section Start: 
------
It cannot be used in place of the `state` variable. It cannot be assigned to, or
be on the left hand side of an assignment, and can only be used inside an
argument to a function

This also means that Lazy State Operator can only be used to READ from state. It
cannot be used to assign to state directly. These examples are all errors:

```js
❌ const url = $.data.url;
get(url);

❌ get(() => $.data.url);

❌ $.data.x = fn();

❌ fn(state => {
  $.data.x = 10;
});
```

details
summary Compilation rules for advanced users summary

How does the Lazy State Operator work?

(I've mangled a bit of HTML there)

If you gave me, as a human, several of these chunks, I wouldn't understand what you were talking about. I don't expect an AI to fare much better.

When chunking by section, I at least know that each chunk has a fair chance of having enough context for it to be useful by itself. I also know that we can re-write docs in such a way that each section is designed to be read standalone (that's just good practice anyway). I'm sure there are some simple heuristics we can use to break up the longer chunks too.

@josephjclark
Copy link
Collaborator

I've run a couple of basic tests and while I'm not sold on the utility of the chunks, it does appear at first glance like the service is actually returning relevant chunks. When I ask about lazy state for example, I get ten chunks of lazy state data.

Can we control the relevance threshold at all? For our purposes, we don't need the top ten results - we need all results that are relevant enough. All results that score above 0.8 or whatever. If the search decides there are no docs worth returning, that's just fine (as would be the case when the user asks "What is Madonna's best album?")

@josephjclark
Copy link
Collaborator

I am concerned about the summarisation process. This feels like using AI for the sake of AI to me.

Consider the pipeline. When the user asks a question, we must:

  • Encode the question into embeddings (via a third party service)
  • Search for relevant documents (via a third party service)
  • Add the documents to the prompt
  • Send the prompt to the model (via at third party service)

The final model must then parse the input prompt and break it down into its own embeddings.

Each step has a cost attached (environmental, temporal, financial and technical). It's a lot of computation already.

And then the summarizer adds an extra step, calling out to another third party service. Assuming the summarisation is good and accurate (which I don't believe), what is the value? Users will never see it.

The idea of generating text in one model to pass it as input to another model is frankly a bit scary.

As a general point of feedback: experimental fringe features like this should really be developed on different branches. It is harder for us to release and maintain code when it contains features that haven't been properly specced or tested. If the work was done on another branch, it could be examined, tested, discussed and released in isolation. Now it's tied up with everything else and will slow down the release.

I'm always happy to sound out new ideas and try new things. But I'm less happy when the experiments threaten the release of the main feature. When they're done in isolation, I can explore the idea in a safe space without having to worry about how it'll impact the release.

Don't branch the work off now. I'll have a think about it. Maybe you can convinced me that it's worth keeping the feature.

@josephjclark
Copy link
Collaborator

@SatyamMattoo just a quick note that I've added a list of acceptance criteria to the main PR post. These are things we need to work out before we can release this feature. I'll need to decide on the final embedding strategy next week so don't worry about that bit.

@josephjclark
Copy link
Collaborator

Regarding adaptor docs, you should be able to just run yarn generate-adaptors from the docs repo and it'll build the docs as markdown to adaptors/packages. You'll only need to read the <adaptor>-docs.md files. You should be able to chunk up by section.

@SatyamMattoo
Copy link
Collaborator Author

Hello @josephjclark,

Yes, it seems that semantic chunking won’t be the best fit for our docs. I’ve reverted to the previous strategy we used, which involves splitting based on headers and recursively splitting larger chunks.

Regarding the search, I’ve implemented a parameter that filters vectors based on a threshold. We can read more about this here. This ensures that irrelevant search queries return no results, and it works as expected.

As for the adaptor docs, I’ve worked out a solution locally. However, during the Docker build, it becomes a bit cumbersome since we need to install nodejs, npm, corepack, and yarn (with nodejs and npm required for installing yarn). This process adds about two minutes to the build time. Afterward, we still need to build the docs, which would further increase the image size and build time.

Alternatively, we could fetch the docs directly from the GitHub source. There’s likely a way to retrieve functions and examples for each adaptor, create a chunk for each function along with its adaptor name, and generate embeddings from them. I had initially started with this approach before your suggestion.

What are your thoughts on proceeding this way?

@SatyamMattoo
Copy link
Collaborator Author

@josephjclark Regarding the summarization process, I believe it might be more beneficial to discuss this after the release, as it may not directly impact the job generation service. The summarization essentially condenses the context based on the query, which we can potentially handle by adding specific instructions to the job prompt.

However, this could be useful if the user wants to use the search service directly for complex queries, as those can be answered based on the retrieved documents. I have removed this part for now, but this is something we can consider further down the line.

Also, I’ll make sure to branch off such ideas separately.

Best regards

@josephjclark
Copy link
Collaborator

Thank you for the update @SatyamMattoo

You must take where when submitting work to thoroughly test it (we've talked about this before). And ideally your test results should be shareable, so that I can check your testing. It's not enough to say "this works great". You must show me it works.

If you cannot prove the quality and robustness of your solution, it will not be merged. That's not my personal policy for openfn - that's how software development works worldwide. Different organisations have different standards, admittedly, but no-one wants to merge untrusted code.

I say this because you told me that semantic chunking method is providing better quality chunks. But you did not provide any proof of this, and now you've had to revert the work.

Experimentation is great - I'm glad we tried semantic chunking (and actually I'm sure we could spend more time on it to find a better solution). But you MUST test well and share your results. Super important.

The search threshold sounds great. What value have you chosen and why? How you can prove to me that it's a reasonable value?

Re adaptors: yes, using docs.json from github is a good solution, let's go with that for now.

@SatyamMattoo
Copy link
Collaborator Author

Hi @josephjclark,

Thank you for the valuable advice; I will definitely keep these points in mind.

Regarding the threshold value, I’ve conducted tests with multiple queries to determine an appropriate value that returns only valid results (in most cases). For example, when testing with an unrelated query like "What is Langchain?", setting the threshold slightly higher (e.g. 1.4) still returns some results for optional chaining.

For relevant queries, I’ve observed that the ideal score range is between 0.7 and 1.2 for the chunks related to the query. We can already see these scores in the stdout. Additionally, I’ve pushed a file containing 2 sample queries along with their respective scores for further reference.

@SatyamMattoo
Copy link
Collaborator Author

SatyamMattoo commented Sep 7, 2024

Hi @josephjclark,
I’ve implemented the code to embed the adaptor functions into the database. You can find the chunks of the adaptor documentation in the tmp/split_sections/adaptors.md file. Additionally, I’ve implemented a partition-based embedding and search strategy to improve search efficiency by distinguishing between adaptor data and the regular documentation.

I’ve also worked on improving the overall code quality by reducing redundancy.

Best regards

@josephjclark josephjclark changed the base branch from main to release/next September 19, 2024 17:38
@josephjclark
Copy link
Collaborator

Thank you for your efforts here @SatyamMattoo

I have spun out a couple of issues for further work which we'll be starting in a couple of weeks. In the meantime I'm going to merge and release this to the staging server 🎉

@josephjclark josephjclark merged commit 013506f into release/next Sep 19, 2024
@josephjclark josephjclark deleted the milvus-integration branch September 19, 2024 17:39
@SatyamMattoo
Copy link
Collaborator Author

Hello @josephjclark,
That’s great news! Thank you so much for your guidance and support throughout—it’s been a great learning experience for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Setup a vector database with search API to assist RAG Embed docs.openfn.org into a vector database
2 participants