-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[86]: Adding milvus as a vector database and a search service (RAG) #86
Conversation
Initial setup for the job expression service.
This comment was marked as outdated.
This comment was marked as outdated.
My guidance: I'd really like to avoid using a third party service if we can. Can we build a very small model into the image, rather than all of torch? That was my expectation. If we are going to use third party services, we might as well do ALL of it with 3rd party services and build nothing into the model. If we really do have to build a lot of large models into the image, I need to see the evidence for it. |
This comment was marked as resolved.
This comment was marked as resolved.
Let's break the problem into two. First, we want to take a text corupus, and convert it into embeddings. This is an offline process outside the docker image (or in a builder image). Then in our docket file, we add an embeddings database and loading the trained data. I don't expect the database or the embeddings to need torch. It's just a vector database and a bunch of data. I don't think we need torch and ml to search the database either. So we should be able to build this into an image without torch and a relatively small file size. If either of these assumptions is wrong, I need to know so we can pivot. Secondly, we need to convert natural language queries into embeddings in the fly. This is the harder bit and this either needs an embeddings service or a torch-enabled image. But let's deal with this bit after we've conquered step one. |
Thank you for your assistance sir. There are some details that I believe we need to discuss. Could we schedule a meeting to go over this and get a clearer idea? |
@SatyamMattoo I am actually on vacation this week. Your mentor should be able to help you work out a solution - have you reached out to them? If you're really blocked on this, we should focus on using third party embeddings with openai. If we're satisfied that the embeddings really give us better results, we'll come back and work out whether we can responsibly generate embeddings within the image itself. |
@josephjclark Yes sir, I have. I will implement that for now and continue to look for a better approach to do it within the image in parallel. |
Some more thoughts on direction here: PR DescriptionI think the PR description has fallen out of sync with the latest commits: please keep the PR description up to date so that when someone from the team comes to do a review, it's easy to get up to speed. Changes can be logged in comments (or commits) as you go, but the description should always reflect the current state of the work Hosted EmbeddingsI was thinking, if we ARE going to use third party services for embeddings, how and when do those embeddings get updated? The original design asked for the database to be built from scratch each time. But with a third party, when we build, the docs embeddings may well already exist in the third party database. In fact, ideally the embeddings would be updated whenever the doc site is updated, and then the live server is always querying against the very latest version of the docs. Also, when the third party database is updated, what happens to the old embeddings? Are they purged or are we paying to store old data? I don't necessarily need to see an implementation in this PR to automatically update the embeddings, but I would like to see a proposal for a robust, scalable strategy. TestingWhat sort of testing can we use to ensure that the embeddings are working? How do we know the right content is embedded from the doc site? I don't necessarily mean unit testing. Manual tests or logging during the build process could be sufficient. What I'm looking for is an answer to the question "How do I know this is working?" |
Sure sir, I will update the PR description shortly. For the updating the database we can set up GitHub actions and set the workflows inside the apollo and docs repositories to rebuild the docker image each time a change is pushed to the main branch of the docs repository. Steps:
Testing:For testing manually,
EDIT: I had a query regarding the docs repo, I don't think we need to embed all the .md files into the database. So, it might be efficient to just pull the required folders into the image. What do you think about this? For now, I have included the docs and the adaptors folder. |
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as outdated.
This comment was marked as outdated.
…s to return better search results).
Merging the job generation service for integration with the RAG service.
I'm not sure I agree with you on the semantic chunking. I suppose I don't have the old chunks to compare to, but I think the quality of the chunks looks pretty poor. For one thing they look very uneven: in Skimming over the chunks, I also see blocks of text which are clearly (to me) related - like some example code and its caption - which are split over several chunks. I see chunks begin on a sentence which relates back to the previous sentence. I see splits across HTML elements where the content must surely all be closely related. Here's a random example:
(I've mangled a bit of HTML there) If you gave me, as a human, several of these chunks, I wouldn't understand what you were talking about. I don't expect an AI to fare much better. When chunking by section, I at least know that each chunk has a fair chance of having enough context for it to be useful by itself. I also know that we can re-write docs in such a way that each section is designed to be read standalone (that's just good practice anyway). I'm sure there are some simple heuristics we can use to break up the longer chunks too. |
I've run a couple of basic tests and while I'm not sold on the utility of the chunks, it does appear at first glance like the service is actually returning relevant chunks. When I ask about lazy state for example, I get ten chunks of lazy state data. Can we control the relevance threshold at all? For our purposes, we don't need the top ten results - we need all results that are relevant enough. All results that score above 0.8 or whatever. If the search decides there are no docs worth returning, that's just fine (as would be the case when the user asks "What is Madonna's best album?") |
I am concerned about the summarisation process. This feels like using AI for the sake of AI to me. Consider the pipeline. When the user asks a question, we must:
The final model must then parse the input prompt and break it down into its own embeddings. Each step has a cost attached (environmental, temporal, financial and technical). It's a lot of computation already. And then the summarizer adds an extra step, calling out to another third party service. Assuming the summarisation is good and accurate (which I don't believe), what is the value? Users will never see it. The idea of generating text in one model to pass it as input to another model is frankly a bit scary. As a general point of feedback: experimental fringe features like this should really be developed on different branches. It is harder for us to release and maintain code when it contains features that haven't been properly specced or tested. If the work was done on another branch, it could be examined, tested, discussed and released in isolation. Now it's tied up with everything else and will slow down the release. I'm always happy to sound out new ideas and try new things. But I'm less happy when the experiments threaten the release of the main feature. When they're done in isolation, I can explore the idea in a safe space without having to worry about how it'll impact the release. Don't branch the work off now. I'll have a think about it. Maybe you can convinced me that it's worth keeping the feature. |
@SatyamMattoo just a quick note that I've added a list of acceptance criteria to the main PR post. These are things we need to work out before we can release this feature. I'll need to decide on the final embedding strategy next week so don't worry about that bit. |
Regarding adaptor docs, you should be able to just run |
Hello @josephjclark, Yes, it seems that semantic chunking won’t be the best fit for our docs. I’ve reverted to the previous strategy we used, which involves splitting based on headers and recursively splitting larger chunks. Regarding the search, I’ve implemented a parameter that filters vectors based on a threshold. We can read more about this here. This ensures that irrelevant search queries return no results, and it works as expected. As for the adaptor docs, I’ve worked out a solution locally. However, during the Docker build, it becomes a bit cumbersome since we need to install nodejs, npm, corepack, and yarn (with nodejs and npm required for installing yarn). This process adds about two minutes to the build time. Afterward, we still need to build the docs, which would further increase the image size and build time. Alternatively, we could fetch the docs directly from the GitHub source. There’s likely a way to retrieve functions and examples for each adaptor, create a chunk for each function along with its adaptor name, and generate embeddings from them. I had initially started with this approach before your suggestion. What are your thoughts on proceeding this way? |
@josephjclark Regarding the summarization process, I believe it might be more beneficial to discuss this after the release, as it may not directly impact the job generation service. The summarization essentially condenses the context based on the query, which we can potentially handle by adding specific instructions to the job prompt. However, this could be useful if the user wants to use the search service directly for complex queries, as those can be answered based on the retrieved documents. I have removed this part for now, but this is something we can consider further down the line. Also, I’ll make sure to branch off such ideas separately. Best regards |
Thank you for the update @SatyamMattoo You must take where when submitting work to thoroughly test it (we've talked about this before). And ideally your test results should be shareable, so that I can check your testing. It's not enough to say "this works great". You must show me it works. If you cannot prove the quality and robustness of your solution, it will not be merged. That's not my personal policy for openfn - that's how software development works worldwide. Different organisations have different standards, admittedly, but no-one wants to merge untrusted code. I say this because you told me that Experimentation is great - I'm glad we tried semantic chunking (and actually I'm sure we could spend more time on it to find a better solution). But you MUST test well and share your results. Super important. The search threshold sounds great. What value have you chosen and why? How you can prove to me that it's a reasonable value? Re adaptors: yes, using docs.json from github is a good solution, let's go with that for now. |
Hi @josephjclark, Thank you for the valuable advice; I will definitely keep these points in mind. Regarding the threshold value, I’ve conducted tests with multiple queries to determine an appropriate value that returns only valid results (in most cases). For example, when testing with an unrelated query like "What is Langchain?", setting the threshold slightly higher (e.g. 1.4) still returns some results for optional chaining. For relevant queries, I’ve observed that the ideal score range is between 0.7 and 1.2 for the chunks related to the query. We can already see these scores in the stdout. Additionally, I’ve pushed a file containing 2 sample queries along with their respective scores for further reference. |
Hi @josephjclark, I’ve also worked on improving the overall code quality by reducing redundancy. Best regards |
Thank you for your efforts here @SatyamMattoo I have spun out a couple of issues for further work which we'll be starting in a couple of weeks. In the meantime I'm going to merge and release this to the staging server 🎉 |
Hello @josephjclark, |
Short description
Integrated a vector database embedded with the OpenFn docs and a search service for RAG.
Related Issues
Fixes: #71
Closes: #72
Implementation details
Added Milvus as the vector database, embedding the OpenFn docs from the repo and adding it to Zilliz Cloud Database. You need to sign in to Zilliz and create a cluster to obtain the token and connection URI.
The database can be seeded with the docs during the docker build using the command:
docker build --secret id=_env,src=.env -t apollo .
It can also be done by running the py script using:
poetry run python services/search/generate_docs_emebeddings.py tmp/repo/docs/jobs collection_name
You will need to change the docs_path by the actual path to the docs stored on your machine i.e.
./tmp/repo/docs
.Add the OpenAI API key, Zilliz database URI, and token to the .env file. This mounts the environment variables to the command and helps protect the secrets as well.
The search service currently takes a search string and a OpenAI API key, embeds the search query, searches the vector database for relevant information, and returns the results to the CLI.
Further work
After the integration and embedding of the doc's repo into the vector database we will need to automate the docker build process, i.e. whenever any change is pushed to the main branch of the doc's repo, we will need to rebuild the docker image of the apollo repository. This might be achieved using GitHub actions and workflows:
Acceptence Critiera
Before we can consider releasing this work, we need to:
(Added by @josephjclark )