📘 Epithet: An Offline PDF Outline Extractor

Epithet is a lightweight Python utility that extracts titles and hierarchical headings from PDF files and organizes them into a structured outline. The entire workflow is performed offline, making it fast, secure, and privacy-friendly. It generates a JSON summary of any PDF document’s major sections.

🔍 Features

📄 Parses PDF text and analyzes font metadata
🧠 Automatically detects:
- Document title
- H1, H2, and H3 headings based on font size hierarchy
⚡ Fully offline with quick processing (~4.5s per document)
✅ Outputs a structured and schema-validated JSON outline

🧠 How It Works

Epithet follows these steps:

Parse PDF Uses PyMuPDF (fitz) to extract:
- Text spans
- Font size, style, and positional data
Detect Headings
- The largest font is treated as the title
- Larger-than-body fonts are categorized as:
  - H1, H2, or H3 based on size thresholds
Generate Output Produces a JSON file containing:
- The title
- A list of headings with:
  - Hierarchical level
  - Page number

📦 Dependencies

PyMuPDF (fitz) – for PDF parsing
Standard Python libraries:
- re, os, json, pathlib, logging

Install with:

pip install pymupdf jsonschema

📁 Project Structure

EPITHET1A/
├── app/
│   ├── process_pdfs.py
│   ├── Dockerfile
│   ├── README.md
│   ├── requirement.txt
│   ├── sample_dataset/
│   ├── input/
│   ├── output/
│   └── schema/
│       └── output_schema.json

🚀 Usage

🔧 Local Execution

Prepare Input Place PDFs in the input/ folder.

Run Script

From the app/ directory, run:

python process_pdfs.py input/your_input_file.pdf

View Output The structured outline will be saved as a JSON file in the output/ folder.

✅ Schema Validation

All output is validated against: schema/output_schema.json

🐳 Docker Usage

Build the Image

docker build -t epithet:latest .

Run the Container

docker run \
  -v "/path/to/EPITHET1A/input:/app/input" \
  -v "/path/to/EPITHET1A/output:/app/output" \
  --network none \
  epithet:latest

Replace /path/to/EPITHET/ with the actual local path.

📈 Performance

✅ Offline-only — No external requests
⚡ Average Speed — ~4.5 seconds per PDF
🔗 Output Format — Clean JSON for downstream use or integration

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📘 Epithet: An Offline PDF Outline Extractor

🔍 Features

🧠 How It Works

📦 Dependencies

📁 Project Structure

🚀 Usage

🔧 Local Execution

✅ Schema Validation

🐳 Docker Usage

Build the Image

Run the Container

📈 Performance

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
input		input
output		output
sample_dataset		sample_dataset
Dockerfile		Dockerfile
README.md		README.md
process_pdfs.py		process_pdfs.py
requirements.txt		requirements.txt

BipashaBi/Epithet

Folders and files

Latest commit

History

Repository files navigation

📘 Epithet: An Offline PDF Outline Extractor

🔍 Features

🧠 How It Works

📦 Dependencies

📁 Project Structure

🚀 Usage

🔧 Local Execution

✅ Schema Validation

🐳 Docker Usage

Build the Image

Run the Container

📈 Performance

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages