Skip to content

bsorrentino/pdf-tools

Repository files navigation

npm        example workflow

pdf-tools

Tools to extract/transform data from PDF

inspired by project: pdf-to-markdown

Installation

npm install @bsorrentino/pdf-tools -g

Requirements

  • NodeJs >= 16
  • Since pdf-tools use canvas that is a Cairo-backed Canvas implementation for Node.js take a look to its reqirements

pdftools Commands

common options

 -o, --outdir [folder]        output folder (default: "out")

pdfximages

extract images (as png) from pdf and save it to the given folder

Usage:

pdftools pdfximages|pxi [options] <pdf>

pdf2images

create an image (as png) for each pdf page

Usage:

pdftools pdf2images|p2i <pdf>

pdf2md

convert pdf to markdown format.

Usage:

pdftools pdf2md|p2md [options] <pdf>

Options:

  -ps, --pageseparator [separator]  add page separator (default: "---")
  --imageurl [url prefix]           imgage url prefix
  --stats                           print stats information
  --debug                           print debug information

Conversion to Markdown

supported features

  • Detect headers
  • Detect and extract images
  • Extract plain text
  • Extract fonts and allow custom mapping through a generated file <document name>.font.json

    Supported fonts bold, italic, monospace, bold+italic

  • Detect code block ( i.e. ```)
  • Detect external link

TO DO

  • Detect TOC