All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- Embed base64-encoded images inline. Support starting with JPEG and BMP. (#99, @HiromuHota)
- Suppress tabula-java's log messages unless pdftotree's logger is set logging.DEBUG. (#103, @HiromuHota)
- List a missing "ocrx_line" in the ocr-capabilities metadata field. (#94, @HiromuHota)
- Use the centroid for
isContained
check not to miss cell values. (#96, @HiromuHota) - Treat non-breaking space as a white space to prevent "Out of order" warnings. (#98, @HiromuHota)
- Escape text only once. (#100, @HiromuHota)
- Treat "(cid:%d)" as a possible char to reduce "Out of order" warnings. (#102, @HiromuHota)
- Use sys.maxsize not to cause "OverflowError: cannot convert float infinity to integer". (#104, @HiromuHota)
- Let TableExtractorML inherit TreeExtractor to use its updated parse(). (#105, @HiromuHota)
- Support for Python 3.8. (#86, @HiromuHota)
- Switch the output format from "HTML-like" to hOCR. (#62, @HiromuHota)
- Loosen Keras' version restriction, which is now unnecessarily strict. (#68, @HiromuHota)
- Greedily extract contents from PDF even if it looks scanned. (#71, @HiromuHota)
- Upgrade Keras to 2.4.0 or later (and TensorFlow 2.2 or later). (#86, @HiromuHota)
- Remove "favor_figures" option and extract everything. (#77, @HiromuHota)
- Remove "dry_run" option. (#89, @HiromuHota)
- Fix a bug that an html file is not created at a given path. (#64, @HiromuHota)
- Extract LTChar even if they are not children of LTTextLine. (#79, @HiromuHota)
- Temporarily add
chardet
to requirements until pdfminer/pdfminer.six#213 is fixed. (#47, @lukehsiao) - Fix ValueError when a Node instance is a single element. (#49, @mgoo)