mc-pdf2txt

NOTICE: As of September 2024, https://github.com/VikParuchuri/marker offers a far superior tool. This repository will be archived.

Convert multi-column pdf to text with poppler and tesseract.

Install

(1) Install dependencies:

Install poppler.

sudo apt install poppler-utils

Install tesseract-ocr

sudo apt install tesseract-ocr

with the language data files of your choice, e.g.,

sudo apt install tesseract-ocr-jpn

(2) Install mc-pdf2txt

pip3 install mc-pdf2txt

Usage

Usage:
  mc-pdf2txt [options] <input>...

Options:
  -l LANG           Language, such as `eng`, `jpn`, or `eng+jpn`.
  <input>           Input PDF file.
  -o OUTPUT         Output text file.
  -r DPI            Resolution of temporary image file [default: 600].
  --page-separator LINE     String to be output as page separator [default: ---].
  --psm VALUE       Page segmentation mode of `tessoract-ocr` [default: 3].
  --verbose         Verbose.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
mc_pdf2txt		mc_pdf2txt
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mc-pdf2txt

Install

Usage

About

Releases

Packages

Languages

License

tos-kamiya/mc-pdf2txt

Folders and files

Latest commit

History

Repository files navigation

mc-pdf2txt

Install

Usage

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages