Information changes quickly, and it's hard to keep them up-to-date. Suppose some people are responsible for maintaining documentation, but it's tough for them to identify all the places needing updates. So, we're creating a RAG solution that can automatically update documentation based on natural language queries.
Imagine there's a new document update:
- Now, archiving queries isn't possible anymore; instead, queries can only be deleted.
With this framework, a person can simply enter the query: We removed the ability to archive queries, and instead added the ability to completely delete them. Update all relevant knowledge. Then, the framework will automatically update all the relevant information in the documentation.
-
Build the Docker image:
-
docker build -t rag_img --rm .
-
docker run -it --name rag_app --rm rag_img
-
-
Change the OpenAI and LangChain keys in the secrets.txt
-
Check the existing queries in the queries.txt and see if you want to add more. Each line is a query.
-
Run the framework, such as:
-
python main.py --emb_model_name all-MiniLM-L6-v2
-
python main.py --emb_model_name text-embedding-3-small --retriever parent_document_retriever --search_fetch_k 40
-
-
The updated documents will be stored under the result/, and the log file rag.log is under the log/.
-
For other parameters in the configuration, check the config.py.
I list some of the arguments that are important to the final performance. More details can be viewed in the args.py.
-
--retriever: Can be Threshold Retriever or ParentDocument Retriever. The Threshold Retriever returns all documents whose similarities with the query are above a certain threshold, and the ParentDocument Retriever always returns Top-k similar documents.
-
--search_type: For ParentDocument Retriever, it can retrieve documents with ”mmr” using reranking or ”similarity” computing only the cosine similarity. For Threshold Retriever, it can only be ”similarity score threshold” that retrieves documents above a similarity threshold set in the argument --scorer_threshold.
-
--chunk_size: The size of chunks within each document. A smaller chunk size enables more subtle retrieval.
-
--search_fetch_k: Number of documents sent to the re-ranking process of the MMR. A larger number increases the diversity of retrieval results.
-
--emb_model_name: Embedding models for documents. Can be OpenAI (”text-embedding3-small”, ”text-embedding-3-large”) or HuggingFace (”all-MiniLM-L6-v2”) embedding models.
-
--llm_model_name: OpenAI chat model for text generation, can be gpt-3.5-turbo or gpt-4.
Given the user’s natural language queries and the provided documents, the project is implemented with a (Retrieval Augmented Generation) RAG framework. Specifically, the initial framework processes each query with the following steps:
-
The framework retrieves embeddings for each provided document from an LLM.
-
Given a query, the framework retrieves the top-K relevant documents from all.
-
The framework considers each retrieved document (each line in the original JSON file) as the context of the query and formulates a prompt by filling the query and context into a pre-defined template.
-
The framework sends the prompt to an LLM, and the response from the LLM consists of the updated document and differences.
-
The original documents and their updates are saved into the JSON file.
Based on this initial framework, the following optimizations have been implemented:
-
To save querying to LLMs, the embeddings and corresponding configurations are stored locally. Each time, the framework decides whether to reuse the local embeddings by checking the current configurations with the stored ones. If both configurations are the same, the framework can reuse the local embeddings and does not need to query from LLMs.
-
The framework automatically generates a log, which consists of the current configuration, queries, responses from LLMs, updated documents, and other messages.
-
To prevent exceeding the requests per minute (RPM) from OpenAI, the framework can automatically control the frequency of requests.
-
The framework supports OpenAI and HuggingFace embeddings, and GPUs can be used for HuggingFace embeddings.
For experiments, I use three different queries:
-
We removed the ability to archive queries and instead added the ability to completely delete them. Update all relevant documents.
-
We have increased the default limit of 250,000 datapoints to 400,000. Update all relevant documents.
-
We do not support TrinoSQL anymore. Update all relevant documents.
I compared different configurations of the framework listed in the previous section. All the logs of different configurations can be viewed under the directory log/. The following configuration is considered as the baseline: retriever = threshold_retriever, search_type = similarity_score_threshold, chunk_size = 400, search_fetch_k = 20, emb_model_name = text-embedding-3-small, llm_model_name = gpt-4.
After manually analyzing the framework’s performance under different configurations, here are several interesting results:
-
For the Query 1, most models successfully update the ”archive” into the ”delete.” Moreover, there are two highlights:
-
The framework (such as the baseline configuration) can also recognize the word ”unarchive” and update the information correctly.
-
Some codes can also be updated. For example:
– Before: POST/api/v1/query/query id/archive
– After: POST/api/v1/query/query id/delete
– Before: def archive query(self, query id : int) −> bool : more codes.
– After: Method to archive queries has been removed.
-
-
The Query 2 is difficult for document retrieval, as the keyword ”250,000” is a number that is difficult for embedding models to recognize. However, as I decreased the chunk size to 200, the framework can retrieve the correct document with this information.
-
For the Query 3, the framework tends to replace all the ”TrinoSQL” with the ”DuneSQL.” Although LLMs find relevant information about ”DuneSQL” in the context, I do not ask the framework to do so, and thus, it is regarded as a hallucination.
-
For the chat model, GPT-4 always performs better than GPT3.5, as GPT3.5 sometimes just copy-and-paste words from prompts to their updates.
-
Surprisingly, when using HuggingFace embeddings with the OpenAI chat model, the results are still as good as the baseline. This result suggests that the retrieval and text generation can be two independent steps by using different LLMs, which provides us more space for optimization.