This FastAPI service extracts text content from a wide variety of document formats (PDF, DOCX, PPTX, EPUB, HTML, TXT, etc.) using markitdown
. It returns the content as an array of strings, one for each logical page, slide, or section.
- Supports multiple document formats
- Returns page-wise content as a JSON array
- Automatically detects file type via content-type
- FastAPI + Uvicorn app, easy to deploy
- PDF (
application/pdf
) - DOCX / Word
- PPTX / PowerPoint
- EPUB
- HTML
- Markdown
- TXT
- CSV
- In OpenWebUI, configure for Document Extractor external and as url
http://localhost:5000
(you can change this based on your needs) - Locally,
curl -X POST http://localhost:5000/process -H "Content-Type: application/pdf" --data-binary @file.pdf
- PORT=5000
- LLM_TOKEN # You can use a LLM to read images
- LLM_MODEL # The LLM model
- LLM_URL # The URL of an OpenAI compatible provider
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
./main.py