PDF Extraction Resources

A list of tools and resources for working with PDF files. Adapted from https://pdfliberation.wordpress.com/

Open source PDF technologies

Apache PDFBox - General purpose PDF library written in Java.
Tabula - Open source PDF table extraction tool written in Java and Ruby by Manuel Aristarán. Makes calls to PDFBox. Table extraction powered by http://github.com/jazzido/tabula-extractor.
PDF Extraction Toolkit - Java framework built on PDFBox by Tamir Hassan for performing document analysis of PDF files and creating custom conversion methods to HTML and other formats.
PDFExtract - Text extraction library that extends both PDFBox and Poppler. Written in Java by Øyvind Berg, the tool is no longer under active development but may contain code that can be reused by hackathon participants. Download Page: http://elacin.github.io/PDFExtract/.
PDF2SVG - Java tool developed by Peter Murray-Rust that converts PDFs to Scalable Vector Graphics (SVG) files that can be rendered by most modern browsers. PDF2SVG, which is based on PDFBox, is a component of the larger AMI suite of open source tools created for the purpose of liberating scientific documents. Another component, SVG2XML converts the SVG files to HTML and is currently under heavy development.
Poppler (pdftotext, pdfinfo, pdfimages) - Command line tools to extract text, metadata, and bitmap images from PDF files, written in C++, forked from Xpdf.
Ashima PDF Table Extractor - Table extraction tool built in Python and based on Poppler.
Coolwanglu - PDF to HMTL converter based on Poppler.
PDF2XML - Open source converter based on XPDF library developed by Hervé Déjean.
Xpdf (pdftotext, pdfinfo, pdfimages) - Command line tools to extract text, metadata, and bitmap images from PDF files. Also includes a page rasterizer (pdftoppm).
MuPDF - General purpose, open source PDF toolkit written in C by Artifex, the developers of GhostScript. The mudraw component has a basic text extraction utility.
PDFMiner - Open source PDF extraction library written in Python.
PDFTables - Table extraction tool based on PDFMiner and also written in Python.
Doc⚡split - A command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text via OCR if necessary, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)
DocHive - Open source tool based on Tesseract and ImageMagick that extracts data from scanned PDFs.
Node PDF Extract - Javascript library that reads PDFs with embedded text as well as scanned PDFs. Built on both Poppler and Tesseract.
Ocrad - "GNU Ocrad is an OCR (Optical Character Recognition) program based on a feature extraction method. It reads images in pbm (bitmap), pgm (greyscale) or ppm (color) formats and produces text in byte (8-bit) or UTF-8 formats."
GOCR - "GOCR is an OCR (Optical Character Recognition) program, developed under the GNU Public License. It converts scanned images of text back to text files."

OCR Technologies:

Tesseract - Open source OCR library. This tool does not work directly with PDFs, but a shell script or package can be used to convert a PDF to a TIFF which can be analyzed with Tesseract.
- A Java interface to Tesseract is available.
- Tesserwrap is a Python ctypes wrapper for tesseract.
ABBYY FineReader - Commercial OCR tool which works directly with PDFs. ABBYY also offers a cloud OCR API
Nuance OmniPage - Commercial OCR tool which works directly with PDFs.
Captricity - Web based service that uses a mixture of technology and human labor to convert uploaded documents into structured data.

Low-cost commercial PDF technologies:

Adobe Acrobat XI Pro - The original general purpose GUI-based PDF tool that can convert to PDFs to Excel, Word, Powerpoint and HTML.
Able2Extract - A line of tools from InvestInTech that extracts PDF content to Excel, Word, XML and other formats. GUI and Command Line tools available.
BCL Technologies - Free, online PDF to Word and PDF to HTML converters.
Cogniview - Extracts PDFs to Excel.
Docudesk deskUnPDF Converter - Converts PDFs to Excel, Word, XML and other formats. Trial download available.
Microsoft Word 2013 - The most recent version of this MS Office component supports direct opening of PDFs. The contents can then be saved in DOCX or other Word-supported formats.
NitroPDF - General purpose GUI-based PDF tool that can extract to spreadsheets and documents.
Nuance PDF Reader - Free PDF reader with a web service that converts PDFs to spreadsheets and documents.
Nuance PDF Converter
PDFLib Text Extraction Tool - Function library that makes available the text contents of a PDF as Unicode strings, plus detailed glyph and font information as well as the position on the page.
PDFTron - General purpose PDF manipulation library that includes text extraction capabilities. Sample Code Page
ScraperWiki Table XTract - Web based solution that returns tables extracted from uploaded PDFs.
Simx Text Converter - Extract, Transform and Load (ETL) solution that enables users to create custom routines for converting PDFs and other unstructured formats to database records.
Snow Tide PDF TextStream. Commercial PDF text extraction component that can be embedded in Java or .Net applications. Single threaded version is free.
Xpdf Commercial Libraries from Glyph and Cog - Including:
- XpdfText, a PDF text extraction library
- XpdfInfo, a PDF metadata extraction library
- XpdfImageExtract, a PDF image extraction library (contact info@glyphandcog.com for details)
- XpdfRasterizer, a library which converts PDF pages to images.
Aspose.Pdf for Java - How to Extract Text From All the Pages of a PDF Document
Big Faceless Java Library - PDF Text Extraction in Java
IText, a Java PDF Library

Enterprise-Level ETL Solutions

Enterprise-Level (Cost > $1000) Extract Transfer Load (ETL) Solutions that Directly Read PDFs

Datawatch Modeler (Formerly Known as Monarch)
IDR Solutions - Online PDF to SVG and PDF to HTML5 conversions. This vendor also maintained the open source JPedal library until last year.
Informatica B2B Data Transformation
Pradea

Reviews, Listings and Comparisons:

Duke University's Reporters Lab contains reviews of many of the tools listed above
PDFJailbreak provides a list of tools for extracting data from scientific papers in PDF format.
Peter Murray-Rust's Blog Post discussing software resources used at a May 2013 PDF Hackathon in Europe.
Comparison of iText, PDFBox and PDFTextExtractor by Madhura Oak

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resources.md

Resources.md

PDF Extraction Resources

Open source PDF technologies

OCR Technologies:

Low-cost commercial PDF technologies:

Enterprise-Level ETL Solutions

Reviews, Listings and Comparisons:

Files

Resources.md

Latest commit

History

Resources.md

File metadata and controls

PDF Extraction Resources

Open source PDF technologies

OCR Technologies:

Low-cost commercial PDF technologies:

Enterprise-Level ETL Solutions

Reviews, Listings and Comparisons: