This project provides a wrapper around PaperQA, a Retrieval-Augmented Generation (RAG) pipeline centred around academic papers. This project aims to make scientific literature more accessible and interactive, enabling researchers to quickly find relevant information from large datasets of academic publications.
This wrapper was specifically written to suit my requirements with minimal setup, allowing seamless integration with any LLM run with ollama or GPT models. The wrapper adds some control over your PDF embeddings, support for reasoning LLMs (like the new DeepSeek R1), and some formatting of the responses. This project includes the PaperQA module as a submodule (forked from the original repository). The forked repository contains changes for this use case.
git clone --recursive https://github.com/foreverallama/paperqa-wrapper
cd paperqa-wrapperIf you've already cloned the repository but didn't initialize the submodule, you can do so by running
git submodule update --init --recursive
conda create --name paperqa-env python=3.11
conda activate paperqa-envpip install -r requirements.txtThe submodule is a fork of the original repository with some slight modifications to paperqa/sources/core.py. These modifications are used to better handle the LLM responses. To install this version, navigate to the paperqa directory and install with pip:
cd paperqa
pip install .
Alternatively, you can manually copy the modified core.py file into your environment installation directory.
To use PaperQA with OpenAI's models or a custom LLM, you need to set up the OPENAI_API_KEY:
set OPENAI_API_KEY=your_openai_api_keyexport OPENAI_API_KEY=your_openai_api_keyFor those interested in running a local LLM using Ollama:
- Install Ollama: Download and install Ollama.
- Start the LLM: Use the command
ollama run model_nameto start the model locally.
I wrote this to try out the DeepSeek R1 model but you can run it with any LLM on ollama. The project automatically tries to obtain the running ollama configuration. If you are running multiple models or using a different port, you may need to reconfigure this inside settings.py and utils.py
The wrapper provides a command-line interface (CLI) for adding documents and querying the database.
Create embeddings from a directory of papers:
python main.py add --paper_dir /path/to/pdf/folder--paper_dir: Specify the paper directory. Default:papers/--file_path: Specify the path to load/save the indexed Docs object. Default:paper_index/docs.pkl
Ask questions based on indexed documents:
python main.py query "What are some possible research challenges in deep learning?"--file_path: Specify the path to load the indexed Docs object. Default:paper_index/docs.pkl
--llm [gpt|ollama]: Specify the LLM configuration to use. Default:ollama--verbose [0-3]: Control verbosity level--ollama_model "model_name": Specify a custom model for Ollama
Scientific research involves reviewing and analyzing a large number of papers, which can be overwhelming. While AI tools like ChatGPT are popular, simply uploading PDFs to them does not ensure reliable, source-backed answers—it only provides general information, which may not come from your own documents.
PaperQA helps solve this problem by allowing you to ask specific questions directly to your collection of PDFs. Instead of returning broad, general AI-generated responses, PaperQA finds the most relevant sections from your papers, extracts key information, and presents an answer with proper citations. This ensures that every response is backed by sources you provide.

Automatically generated using eraser.io
This project includes a forked version of PaperQA with a slight modification to handle specific edge cases related to LLM response formatting. This version is added as a submodule within this repository which allows users to get the necesssary modifications without manually altering the original code.
Visit the original repository for understanding its capabilities and further customization options.
[1] Lála, Jakub, et al. "PaperQA: Retrieval-Augmented Generative Agent for Scientific Research." arXiv preprint arXiv:2312.07559, 2023.
[2] Skarlinski, Michael D., et al. "Language Agents Achieve Superhuman Synthesis of Scientific Knowledge." arXiv preprint arXiv:2409.13740, 2024