End‑to‑end pipeline to:
- Convert energy / invoice PDFs to Markdown (Docling or Marker)
- Scan & parse Swiss QR bill codes (WeChat QR + OpenCV fallback)
- Lookup / enrich customer data from Postgres
- Produce structured JSON output via LangChain + OpenAI-compatible model
- Trace runs and sub-steps with LangSmith
| Area | Tools |
|---|---|
| PDF Parsing | Docling, Marker |
| LLM / Structured Output | LangChain, OpenAI-compatible chat model |
| Tracing / Observability | LangSmith |
| QR Detection | WeChat QR (Caffe), OpenCV |
| DB | Postgres (via psycopg) |
| Orchestration | Custom CLI + runners |
Conda:
conda env create -f environment.ymlconda activate invoice-chain-aipip install -r requirements.txt
Copy .env.example to .env and adjust:
DATABASE_URLOPENAI_API_KEYSTRUCTURED_OUTPUT_MODEL(default falls back)- LangSmith keys if tracing enabled
Optional: verify DB
docker compose up -d
python -m invoice_chain_ai.db.seed(Entry point module: invoice_chain_ai.main)
python -m invoice_chain_ai.main --pdf path\to\file.pdf --parser doclingpython -m invoice_chain_ai.main --pdf path\to\file.pdf --parser marker --use-llmpython -m invoice_chain_ai.main --pdf path\to\file.pdf --qrpython -m invoice_chain_ai.main --pdf path\to\file.pdf --parser marker --use-llm --structured-outputpython -m invoice_chain_ai.main --run-dir .\invoice_chain_ai\output\some_run --structured-outputpython -m invoice_chain_ai.main --pdf .\training_data\sig\10300992.pdf --parser marker --use-llm
python -m invoice_chain_ai.main --pdf .\training_data\sig\10300992.pdf --parser docling
python -m invoice_chain_ai.main --pdf .\training_data\sig\10300992.pdf --qr
python -m invoice_chain_ai.main --run-dir .\invoice_chain_ai\output\sig_10300992 --structured-outputEach run creates:
invoice_chain_ai/output/<basename>_<filename>/
original.pdf
<basename>.marker.md | <basename>.docling.md
qr.json # extracted QR code data
customer.json # customer prompt from DB
structured_output.json # final structured output
- Runners / orchestration:
invoice_chain_ai/runners.py - CLI:
invoice_chain_ai/cli.py - QR decode + parsing:
invoice_chain_ai/qr.py - Structured output (LLM):
invoice_chain_ai/structured_output.py - Post-processing:
invoice_chain_ai/postprocess_bz.py - IO helpers:
invoice_chain_ai/io_utils.py
- Structured extraction uses
ChatOpenAI.with_structured_output(...)for schema-safe JSON. - Each step (
Scan QR Code,Convert PDF to Markdown, structured output) is decorated with@traceableenabling hierarchical traces in LangSmith. - Runnable wrapping in runners assigns readable run names.
Pipeline:
- Render each PDF page (PyMuPDF)
- Preprocess (grayscale / contrast)
- Try WeChat QR detector (if model assets present)
- Fallback to OpenCV multi / single detect
- Parse Swiss Payment Code (fields beginning with SPC)
- Normalize into structured invoice + addresses
Models expected in:
invoice_chain_ai/WeChatQR/
detect.prototxt
detect.caffemodel
sr.prototxt
sr.caffemodel

