This Flask backend API takes a document in multiple formats (.txt, .docx, .pptx, .jpg, .png, .eml, .html, and .pdf) and allows you to perform a semantic search in 100+ languages supported by Cohere Multilingual API. Qdrant vector database is used to save embeddings.
The following steps will guide you on how to run the application on macOS/Linux.
- Python 3
- Git
- virtualenv
- Homebrew
- Clone the repository
git clone https://github.com/menloparklab/langchain-cohere-qdrant-doc-retrieval docQA
- Change into the directory
cd docQA
- Create and activate a virtual environment
python3 -m venv env
source env/bin/activate
- Install the required packages
pip install -r requirements.txt
Unstructured uses detectron which is installed as below:
pip install "detectron2@git+https://github.com/facebookresearch/detectron2.git@v0.6#egg=detectron2"
- Install Homebrew
Follow the installation guide on Homebrew website.
- Install the following brew packages
brew install libmagic poppler tesseract libxml2 libxslt
- Create a
.env
file and set the following environment variables:
cohere_api_key="insert here"
openai_api_key="insert here"
qdrant_url="insert here"
qdrant_api_key="insert here"
Replace the values with your own API keys and Qdrant URL.
Please signup for a free cloud-based account of Qdrant and create a new cluster. You will then be able to get the qdrant_url and qdrant_api_key used in the section above.
- Run the application using the following command:
gunicorn app:app
- Access the API endpoints
The API endpoints will be live at the following routes:
/embed
/retrieve
You have successfully installed and ran the DocQA system on your local machine. Feel free to explore the code and make changes as per your requirements.
The deployed api endpoints, /embed
and /retrieve
can now be called from any frontend application. For bubble users, you can watch this video for detailed instructions.
Include headers for the API: "Content-Type": "application/json"
JSON body for /embed
:
{ "collection_name": "{collection_name}", "file_url": "{file_url}" }
JSON body for /retrieve
:
{ "collection_name": "{collection_name}", "query": "{query}" }
Embed JSON for the bubble:
{ "collection_name": "<collection_name>", "file_url": "<file_url>" }
Retrieve JSON for bubble:
{ "collection_name": "<collection_name>", "query": "<query>" }
Feel free to reach out if any questions on Twitter