Skip to content
This repository has been archived by the owner on Sep 29, 2024. It is now read-only.
/ mc-pdf2txt Public archive

Convert multi-column pdf to text with `poppler` and `tesseract`

License

Notifications You must be signed in to change notification settings

tos-kamiya/mc-pdf2txt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

mc-pdf2txt

NOTICE: As of September 2024, https://github.com/VikParuchuri/marker offers a far superior tool. This repository will be archived.

Convert multi-column pdf to text with poppler and tesseract.

Install

(1) Install dependencies:

Install poppler.

sudo apt install poppler-utils

Install tesseract-ocr

sudo apt install tesseract-ocr

with the language data files of your choice, e.g.,

sudo apt install tesseract-ocr-jpn

(2) Install mc-pdf2txt

pip3 install mc-pdf2txt

Usage

Usage:
  mc-pdf2txt [options] <input>...

Options:
  -l LANG           Language, such as `eng`, `jpn`, or `eng+jpn`.
  <input>           Input PDF file.
  -o OUTPUT         Output text file.
  -r DPI            Resolution of temporary image file [default: 600].
  --page-separator LINE     String to be output as page separator [default: ---].
  --psm VALUE       Page segmentation mode of `tessoract-ocr` [default: 3].
  --verbose         Verbose.

About

Convert multi-column pdf to text with `poppler` and `tesseract`

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages