All notable changes to this project will be documented in this file. The format is based on Keep a Changelog.
- Add
.extract_text(layout=True)
, an experimental feature which attempts to mimic the structural layout of the text on the page. (#10) - Add
utils.merge_bboxes(bboxes)
, which returns the smallest bounding box that contains all bounding boxes in thebboxes
argument. (f8d5e70) - Add
--precision
argument to CLI (#520) - Add
snap_x_tolerance
andsnap_y_tolerance
to table extraction settings. (#51 + #475) [h/t @dustindall] - Add
join_x_tolerance
andjoin_y_tolerance
to table extraction settings. (cbb34ce)
- Upgrade
pdfminer.six
from20200517
to20211012
; see that library's changelog for details, but a key difference is an improvement in how it assignsline
,rect
, andcurve
objects. (Diagonal two-point lines, for instance, are nowline
objects instead ofcurve
objects.) (#515) - Remove Decimal-ization of parsed object attributes, which are now represented with as much precision as is returned by
pdfminer.six
(#346 + #520) .extract_text(...)
returns""
instead ofNone
when character list is empty. (#482 + cb9900b) [h/t @tungph].extract_words(...)
now includesdoctop
among the attributes it returns for each word. (66fef89)- Change behavior of horizontal
text_strategy
, so that it uses the top and bottom of every word, not just the top of every word and the bottom of the last. (#467 + #466 + #265) [h/t @bobluda + @samkit-jain] - Change
table.merge_edges(...)
behavior whenjoin_tolerance
(andx
/y
variants)<= 0
, so that joining is attempted regardless, to handle cases of overlapping lines. (cbb34ce) - Raise error if certain table-extraction settings are negative. (aa2d594)
- Fix slowdown in
.extract_words(...)
/WordExtractor.iter_chars_to_words(...)
on very long words, caused by repeatedly re-calculating bounding box. (#483) - Handle
UnicodeDecodeError
when trying to decode utf-16-encoded annotations (#463) [h/t @tungph] - Fix crash when extracting tables with null values in
(text|intersection)_(x|y)_tolerance
settings. (#539) [h/t @yoavxyoav]
- Remove
pdfplumber.load(...)
method, which has been deprecated since0.5.23
(54cbbc5)
- Add
CONTRIBUTING.md
(#428) - Enforce import order via
isort
(d72b879) - Update Pillow and Wand versions in
requirements.txt
(cae6924) - Update all dependency versions in
requirements-dev.txt
(2f7e7ee)
- Add
--laparams
flag to CLI. (#407)
- Change
.convert_csv(...)
to order objects first by page number, rather than object type. (#407) - Change
.convert_csv(...)
,.convert_json(...)
, and CLI so that, by default, they returning all available object types, rather than those in a predefined default list. (#407)
- Fix
.extract_text(...)
so that it can accept generator objects as its main parameter. (#385) [h/t @alexreg] - Fix page-parsing so that
LTAnno
objects (which have no bounding-box coordinates) are not extracted. (Was only an issue when settinglaparams
.) (#388) - Fix
Page.extract_table(...)
so that it honors text tolerance settings (#415) [h/t @trifling]
- Fix regression (introduced in
0.5.26
/b1849f4) in closing files opened byPDF.open
- Reinstate access to higher-level layout objects (such as
textboxhorizontal
) whenlaparams
is passed topdfplumber.open(...)
. Had been removed in0.5.24
via 1f87898. (#359 + #364)
- Add a
python setup.py build sdist
test to main GitHub action. (#365)
- Add
Page.close/__enter__/__exit__
methods, by generalizing that behavior through theContainer
class (b1849f4)
- Change handling of floating point numbers; no longer convert them to
Decimal
objects and do not round them - Change
TableFinder
to return tables in order of topmost-and-then-leftmost, rather than leftmost-and-then-topmost (#336) - Change
Page.to_image()
's handling of alpha layer, to remove aliasing artifacts (#340) [h/t @arlyon]
- Enforce
psf/black
andflake8
ontests/
(#327
- Add new boolean argument
strict_metadata
(defaultFalse
) topdfplumber.open(...)
method for handling metadata resolution failures (f2c510d)
- Fix metadata extraction to handle integer/floating-point values (cb32478) (#297)
- Fix metadata extraction to handle nested metadata values (2d9415) (#316)
- Explicitly load text as utf-8 in
setup.py
(7854328) (#304) - Fix
pdfplumber.open(...)
so that it does not close file objects passed to it (408605f) (#312)
- Added
extra_attrs=[...]
parameter to.extract_text(...)
(c8b200e) (#28) - Added
utils/page.dedupe_chars(...)
(04fd56a + b132d45) (#71)
- Change character attribute
upright
fromint
tobool
(per originalpdfminer.six
representation) (1f87898) - Remove access and reference to
Container.figures
, given that they are not fundamental objects (8e74cb9)
- Decimalize "simple"
explicit_horizontal_lines
/explicit_vertical_lines
descs passed toTableFinder
methods (bc40779) (#290)
- Refactor/simplify
Page.process_objects
(1f87898),utils.extract_words
(c8b200e), andconvert.serialize
(a74d3bc) - Remove
test_issues.py:test_pr_77
(917467a) and narrowtest_ca_warn_report:test_objects
(6233bbd) to speed up tests
- Add
utils.resolve
(non-recursive .resolve_all) (7a90630) - Add
page.annots
andpage.hyperlinks
, replacing non-functionalpage.annos
, and mirroring pdfminer's language ("annot" vs. "anno"). (aa03961) - Add
page/pdf.to_json
andpage/pdf.to_csv
(cbc91c6) - Add
relative=True/False
parameter to.crop
and.within_bbox
; those methods also now raise exceptions for invalid and out-of-page bounding boxes. (047ad34) [h/t @samkit-jain]
- Remove
pdfminer.from_path
andpdfminer.load
as deprecated; nowpdfminer.open
is the canonical way to load a PDF. (00e789b) - Simplify the logic in "text" table-finding strategies; in edge cases, may result in changes to results. (d224202)
- Drop support for Python 3.5 (baf1033)
- Fix
.extract_words
, which had been returning incorrect results whenhorizontal_ltr = False
(d16aa13) - Fix
utils.resize_object
, which had been failing in various permutations (d16aa13) - Fix
lines_strict
table-finding strategy, which a typo had prevented from being usable (f0c9b85) - Fix
utils.resolve_all
to guard against two known sources of infinite recursion (cbc91c6)
- Rename default branch to "stable," to clarify its purpose
- Reformat code with psf/black (1258e09)
- Add code linting via psf/black and flake8 (1258e09)
- Switch from nosetests to pytest (1ac16dd)
- Switch from pipenv to standard requirements.txt + python -m venv (48eaa51)
- Add GitHub action for tests + codecov (b148fd1)
- Add Makefile for building development virtual environment and running tests (4c69c58)
- Add badges to README.md (9e42dc3)
- Add Trove classifiers for Python versions to setup.py (6946e8d)
- Add MANIFEST.in (eafc15c)
- Add GitHub issue templates (c4156d6)
- Remove
pandas
from dev requirements and tests (a5e7d7f)
- Upgraded
pdfminer.six
requirement to==20200517
(cddbff7) [h/t @youngquan]
- Add support for
non_stroking_color
attribute onchar
objects (0254da3) [h/t @idan-david]
- Fix
Page.extract_table(...)
to returnNone
instead of crashing when no table is found (d64afa8) [h/t @stucka]
- Fix
.get_page_image
to prefer paths over streams, when possible (ab957de) [h/t @ubmarco] - Local-fix pdfminer.six's
.resolve_all
to handle tuples and simplify (85f422d)
- Remove support for Python 2 and Python <3.3
- Add
utils.decimalize
performance improvement (830d117) [h/t @ubmarco]
- Fix un-referenced method when using "text" table-finding strategy (2a0c4a2)
- Add missing object type
rect_edge
toobj_to_edges()
(0edc6bf)
- Allow
rect
andcurve
objects also to be passed to "explicit_..._lines" setting when table-finding. (And disallow other types of dicts to be passed.)
- Fix
utils.extract_text
bug introduced in prior version
- Fix and simplify obj-in-bbox logic (see commit 25672961)
- Improve/fix the way
utils.extract_text
handles vertical text (see commit 8a5d858b) [h/t @dwalton76] - Have
Page.to_image
use bytes stream instead of file path (Issue #124 / PR #179) [h/t @cheungpat] - Fix issue #176, in which
Page.extract_tables
did not pass kwargs toTable.extract
[h/t @jsfenfen]
- Prevent custom LAParams from raising exception (Issue #168 / PR #169) [h/t @frascuchon]
- Add
six
as explicit dependency (for now)
- Upgrade
pdfminer.six
requirement to==20200104
- Upgrade
pillow
requirement>=7.0.0
- Remove Python 2.7 and 3.4 from
tox
tests
- Fix sorting bug in
page.extract_table()
- Fix support for password-protected PDFs (PR #138)
- Fixed PDF object resolution for rotation (PR #136)
cdecimal
support for Python 2- Support for password-protected PDFs
- Caching for
.decimalize()
method
- Upgrade to
pdfminer.six==20181108
- Make whitespace checking more robust (PR #88)
- Fix issue #75 (
.to_image()
custom arguments) - Fix issue raised in PR #77 (PDFObjRef resolution), and general class of problems
- Fix issue #90, and general class of problems, by explicitly typecasting each kind of PDF Object
- Fix bug in which, when calling get_page_image(...), the alpha channel could make the whole page black out.
- Fix issue #67, in which bool-type metadata were handled incorrectly
- Fix issue #53, in which non-decimalize-able (non_)stroking_color properties were raising errors.
.travis.yml
, but failing on.to_image()
- Move from defunct
pycrypto
topycryptodome
- Update
pdfminer.six
to20170720
- Fix issue #41, in which PDF-object-referenced cropboxes/mediaboxes weren't being fully resolved.
- Access to
__version__
from main namespace
- Fix issue #33, by checking
decode_text
's argument type
- Pin
pdfminer.six
to version20151013
(for now), fixing incompatibility
- Allow
import pdfplumber
even if ImageMagick not installed.
- Access to
curve
points. (E.g.,page.curves[0]["points"]
.) - Ability for
.draw_line
to drawcurve
points.
- Disaggregated "min_words_vertical" (default: 3) and "min_words_horizontal" (default: 1), removing "text_word_threshold".
- Internally, made
utils.decimalize
a bit more robust; now throws errors on non-decimalizable items. - Now explicitly ignoring some (obscure)
pdfminer
object attributes. - Raw input for
.draw_line
from a bounding box to((x, y), (x, y))
, for consistency withcurve["points"]
and withPillow
's underlying method.
- Fixed typo bug when
.rect_edges
is called before.edges
- Quick-draw
PageImage
methods:.draw_vline
,.draw_vlines
,.draw_hline
, and.draw_hlines
. - Boolean parameter
keep_blank_chars
for.extract_words(...)
andTableFinder
settings.
- Increased default
text_tolerance
andintersection_tolerance
TableFinder values from 1 to 3.
- Properly handle conversion of PDFs with transparency to
pillow
images. - Properly handle
pandas
DataFrames as inputs to multi-draw commands (e.g.,PageImage.draw_rects(...)
).
- Visual debugging features, via
Page.to_image(...)
andPageImage
. (Introduceswand
andpillow
as package requirements.) - More powerful options for extracting data from tables. See changes below.
- Entirely overhaul the table-extraction methods. Now based on Anssi Nurminen's master's thesis.
- Disentangle
.crop
from.intersects_bbox
and.within_bbox
. - Change default
x_tolerance
andy_tolerance
for word extraction from5
to3
- Fix bug stemming from non-decimalized page heights. [h/t @jsfenfen]
- Provide access to
Page.page_number
- Use
.page_number
instead of.page_id
as primary identifier. [h/t @jsfenfen] - Change default
x_tolerance
andy_tolerance
for word extraction from0
to5
- Provide proper support for rotated pages
- Fix bug stemming from when metadata includes a PostScript literal. [h/t @boblannon]
Whoops.
- When extracting table cells, use chars' midpoints instead of top-points.
- Fix find_gutters — should ignore
" "
chars