This application can be used to upload external information, such as proprietary specific documents,and integrate them with Chat GPT's LLM (Large Language Model), so that the user can ask specific questions related to the uploaded documents but get a natural language response.
Currently, you need to manipulate the lines of code in the main.py file (via commenting out/uncommenting) so that you can switch between uploading documents (via a url, wikipedia, etc.) or using an index that already has the uploaded documents (you manually uploaded the documents to Pinecone).
The process of uploading your document to the program and it then being used for Q&As is the following:
- Prepare the document (once per document)
- Load the data into LangChain documents
- Split the document into chunks
- This helps optimize the relevance of the content we get back from a vector DB
- Rule of thumb: If chunk text makes sense to a human without relevant context, it will make sense to a language model as well
- Embed the chunks into numeric vectors
- Save the chunks and embeddings to a vector db
- Search (once per query)
- Embed the user's question
- Using the question's embedding and the chunk embedding, rank the vectors by similarity to the question's embedding
- The nearest vector represents chunks similar to the question
- Ask (once per query)
- Insert the question and the most relevant chunks into a message to a GPT model
- Return GPT's answer
You will need to have python installed and have an account created with both Pinecone and OpenAI.
Once python is installed, you will need to setup a python virtual environment, which can be done by running the following command:
python -m venv env
You will then need to activate the virtual environment with this command:
source env/bin/activate
These commands will only work if you are using bash. Please refer to the documentation if using something different: https://docs.python.org/3/library/venv.html
Once activated, you can now install all the necessary dependencies by running:
pip install -r requirements.txt
Now run the main.py file and respond to the terminal appropriately
Upon cloning the repository, you need to create a .env file and use the same name of the variables found in the .env.example file. Here is a list of the variables and short description of what they represent:
- OPENAI_API_KEY - represents the API key associated with your OpenAI account
- PINECONE_API_KEY - represents the API key associated with your Pinecone account
- PINECONE_ENV - represents the environment associated with the Pinecone index that is holding the data of the document we want to used with the LLM
- PINECONE_INDEX - represents the name of the Pinecone index that is holding the data of the document we want to used with the LLM
- QA_DOCUMENT - the file path or url for the document we want to upload
- QA_WIKIPEDIA - the search topic that we want to use to retrieve relevant documents from wikipedia so that they can be used with the LLM