Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

[Unreleased]

Added

Embed base64-encoded images inline. Support starting with JPEG and BMP. (#99, @HiromuHota)

Changed

Suppress tabula-java's log messages unless pdftotree's logger is set logging.DEBUG. (#103, @HiromuHota)

Fixed

List a missing "ocrx_line" in the ocr-capabilities metadata field. (#94, @HiromuHota)
Use the centroid for isContained check not to miss cell values. (#96, @HiromuHota)
Treat non-breaking space as a white space to prevent "Out of order" warnings. (#98, @HiromuHota)
Escape text only once. (#100, @HiromuHota)
Treat "(cid:%d)" as a possible char to reduce "Out of order" warnings. (#102, @HiromuHota)
Use sys.maxsize not to cause "OverflowError: cannot convert float infinity to integer". (#104, @HiromuHota)
Let TableExtractorML inherit TreeExtractor to use its updated parse(). (#105, @HiromuHota)

0.5.0 - 2020-10-13

Added

Support for Python 3.8. (#86, @HiromuHota)

Changed

Switch the output format from "HTML-like" to hOCR. (#62, @HiromuHota)
Loosen Keras' version restriction, which is now unnecessarily strict. (#68, @HiromuHota)
Greedily extract contents from PDF even if it looks scanned. (#71, @HiromuHota)
Upgrade Keras to 2.4.0 or later (and TensorFlow 2.2 or later). (#86, @HiromuHota)

Removed

Remove "favor_figures" option and extract everything. (#77, @HiromuHota)
Remove "dry_run" option. (#89, @HiromuHota)

Fixed

Fix a bug that an html file is not created at a given path. (#64, @HiromuHota)
Extract LTChar even if they are not children of LTTextLine. (#79, @HiromuHota)

0.4.1 - 2020-09-21

Fixed

Temporarily add chardet to requirements until pdfminer/pdfminer.six#213 is fixed. (#47, @lukehsiao)
Fix ValueError when a Node instance is a single element. (#49, @mgoo)