Hermes simplifies document searching within specified directories, offering precise and quick retrieval of PDF documents. By combining a user-friendly API with powerful search capabilities, Hermes highlights relevant documents and their exact locations, facilitating efficient information access.
- API: Facilitates searching within PDFs and setting the directory path for the PDFs.
- Crawler: Automatically updates the database with new or modified PDFs by monitoring specified directories.
- Docker
- Make
- Start Milvus Vector DB
docker compose -f milvus-docker-compose.yaml up -d
- Run the API component
make run-web
- Configure PDF Directory Path
Navigate tohttp://127.0.0.1:8000/docs
and use the POST/api/dir_path
endpoint to specify the path to your PDFs directory. - Launch the Crawler
make run-crawler
- Query Your PDFs Utilize the API to perform your document searches.
For each PDF:
pdf_extract
- extracts each page's content in a loop and returns a list of PDFPages -> pages- For each page in PDFPages:
normalize(page.content)
- normalizes content of the page -> normalizedget_len_safe_embedding(normalized)
-> embeddings:- Chunks text using a generator with yield, for each chunk:
get_embedding(chunk, model)
- makes an API call to OpenAI to get the embedding vector- Puts each embedding into a list
- Returns the list of embeddings
- Chunks text using a generator with yield, for each chunk:
- For each embedding in embeddings:
- Prepares a record to be inserted into Milvus
- Puts that record in a list
There are a lot of improvements to be made to current approach. First thing we really need and that must be done synchronously is to parse the single pdf file and get its pages. After we get a list of pages we can start triggering concurrent/parallel jobs. Let's think about data ingestion in terms of a single page. So, for each page we need:
- Normalization
- Split a page into chunks
- For each chunk we need to get an embedding
- For each embedding we prep it for insertion and insert it
Here are tasks and their types that are made by the algorithm:
- Parsing pdfs using pdfium: mostly I/O bound due to reads and writes to disk.
- Embedding: pure I/O bound, cause it is an API call.
- Insertion to DB: I/O bound.
This tells that we need to use either ThreadPoolExecutor
or asyncio
. We don't need ProcessPoolExecutor
cause there is no need for separate Python interpreters to do CPU bound tasks.
Let's try asyncio
approach that operates in terms of a single page and run that job for each page concurrently.
After refactoring it takes almost 10x less time to ingest PDFs into Vector DB.
Input: 2 pdfs with < 30 pages
- Initial sequential approach: ~79 seconds
- Improved concurrent approach with
AsyncOpenAI
client: ~10 seconds