Epithet is a lightweight Python utility that extracts titles and hierarchical headings from PDF files and organizes them into a structured outline. The entire workflow is performed offline, making it fast, secure, and privacy-friendly. It generates a JSON summary of any PDF document’s major sections.
-
📄 Parses PDF text and analyzes font metadata
-
🧠 Automatically detects:
- Document title
- H1, H2, and H3 headings based on font size hierarchy
-
⚡ Fully offline with quick processing (~4.5s per document)
-
✅ Outputs a structured and schema-validated JSON outline
Epithet follows these steps:
-
Parse PDF Uses PyMuPDF (
fitz) to extract:- Text spans
- Font size, style, and positional data
-
Detect Headings
-
The largest font is treated as the title
-
Larger-than-body fonts are categorized as:
H1,H2, orH3based on size thresholds
-
-
Generate Output Produces a JSON file containing:
-
The title
-
A list of headings with:
- Hierarchical level
- Page number
-
-
PyMuPDF (fitz) – for PDF parsing
-
Standard Python libraries:
re,os,json,pathlib,logging
Install with:
pip install pymupdf jsonschemaEPITHET1A/
├── app/
│ ├── process_pdfs.py
│ ├── Dockerfile
│ ├── README.md
│ ├── requirement.txt
│ ├── sample_dataset/
│ ├── input/
│ ├── output/
│ └── schema/
│ └── output_schema.json
-
Prepare Input Place PDFs in the
input/folder. -
Run Script
From the
app/directory, run:python process_pdfs.py input/your_input_file.pdf
-
View Output The structured outline will be saved as a JSON file in the
output/folder.
All output is validated against:
schema/output_schema.json
docker build -t epithet:latest .docker run \
-v "/path/to/EPITHET1A/input:/app/input" \
-v "/path/to/EPITHET1A/output:/app/output" \
--network none \
epithet:latestReplace
/path/to/EPITHET/with the actual local path.
- ✅ Offline-only — No external requests
- ⚡ Average Speed — ~4.5 seconds per PDF
- 🔗 Output Format — Clean JSON for downstream use or integration