All notable changes to this project will be documented in this file. The format is based on Keep a Changelog.
- Add
x_tolerance_ratio
parameter toextract_text
and similar functions, to account for text size when spacing characters (instead of a fixed number of pixels) (h/t @afriedman412). (#1041) - Add support for PDF 1.3 logical structure via
Page.structure_tree
(h/t @dhdaines). (#963) - Add "gswin64c" as another possible Ghostscript executable in
repair.py
(h/t @echedey-ls). (#1032) - Re-add
Page.close()
method, havePDF.close()
close all pages as well, and improve relevant documentation (h/t @luketudge). (#1042) - Add
force_mediabox
parameter toPage.to_image(...)
. (#1054)
- Standardize handling of cropbox, fixing various issues with PageImage. (#1054)
- Fix
Page.get_textmap
caching to allow forextra_attrs=[...]
, by preconverting list kwargs to tuples. (#1030) - Explicitly close
pypdfium2.PdfDocument
inget_page_image
(h/t @dhdaines). (#1090) - In
PDFPageAggregatorWithMarkedContent.tag_cur_item
, checkself.cur_item._objs
length before trying to access[-1]
. (4f39d03)
- Add support for marked-content sequences, represented by
mcid
andtag
attributes onchar
/rect
/line
/curve
/image
objects (h/t @dhdaines). (#961) - Add
gs_path
argument topdfplumber.open(...)
andpdfplumber.repair(...)
, to allow passing a custom Ghostscript path to be used for repairing. (#953)
- Respect
use_text_flow
inextract_text
(h/t @dhdaines). (#983)
-
Add
PDF.path
: APath
object for PDFs loaded by passing a path (unlessrepair=True
), andNone
otherwise. (30a52cb + #948) -
Accept
Iterable
objects for geometry utils (h/t @dhdaines). (53bee23 + #945)
- Add
antialias
boolean parameter toPage.to_image(...)
and associated methods (h/t @cmdlineluser). (7e28931)
- Normalize color representation to
tuple[float|int, ...]
(#917). (57d51bb) - Replace Wand with pypdfium2 for page.to_image(...). (b049373)
- Add
pdfplumber.repair(...)
and.open(repair=True)
(#824). (db6ae97) - Add Page.find_table(...) (#873). (3772af6)
- Add
quantize=True
,colors=256
,bits=8
arguments/defaults toPageImage.save(...)
. (b049373) - Extract and handle patterns + (some) color spaces. (97ca4b0)
- Remove support for Python 3.7 (EOL'ed June 2023). (c9d24d5)
- Remove vestigial 'font' and 'name' properties from PDF objects. (6d62054)
- Fix bug for re-crops that use relative=True (#914). (0de6da9)
- Handle
use_text_flow
more consistently (#912). (b1db5b8)
- Make word segmentation (via
WordExtractor.char_begins_new_word(...)
) more explict and rigorous; should help in catching edge-cases in the future. (6acd580 + ebb93ea + #840) - Use
curve_edge
objects (instead of justline
andrect_edge
objects) in default table-detection strategy. (6f6b465 + #858) - By default, expand ligatures into their consituent letters (e.g.,
ffi
toffi
), and add theexpand_ligatures
boolean parameter to text-extraction methods. (86e935d + #598)
- Add
Page.extract_text_lines(...)
method. (4b37397 + #852) - Add
main_group
,return_groups
,return_chars
parameters toPage.search(...)
. (4b37397) - Add
.curve_edges
property toPDF
andPage
. (6f6b465)
- Fix handling of bytes-typed fontnames. (9441ff7 + #461 + #842)
- Fix handling of whitespace-only and empty results of
Page.search(...)
. (6f6b465 + #853)
- Fix
x0>x1
/etc. error for when drawing rect fills, per new Pillow version (db136b7)
- Change the (still experimental)
Page/utils.extract_text(layout=True)
approach so that it pads, to the degree necessary, the ends of lines with spaces and the end of the text with blank lines to acheive better mimicry of page layout. (d3662de) - Refactor handling of
pts
attribute and, in doing so, deprecate thecurve_obj["points"]
attribute, and fixPageImage.draw_line(...)
's handling of diagonal lines. (216bedd) - Breaking change: In
Page.extract_table[s](...)
,keep_blank_chars
must now be passed astext_keep_blank_chars
, for consistency's sake. (c4e1b29)
- Add
Page.extract_table[s](...)
support for allPage.extract_text(...)
keyword arguments. (c4e1b29) - Add
height
andwidth
keyword arguemnts toPage.to_image(...)
. (#798 + 93f7dbd) - Add
layout_width
,layout_width_chars
,layout_height
, andlayout_width_chars
parameters toPage/utils.extract_text(layout=True)
. (d3662de) - Add CITATION.cff. (#755) [h/t @joaoccruz]
- Fix simple edge-case for when page rotation is (incorrectly) set to
None
. (#811) [h/t @toshi1127]
- Convert
utils.py
intoutils/
submodules. Retains same interface, just an improvement in organization. (6351d97) - Fix typing hints to include io.BytesIO. (d4107f6) [h/t @conitrade-as]
- Refactor text-extraction utilities, paving way for better consistency across various entrypoints to text extraction (e.g., via
utils.extract_text(...)
, viaPage.extract_text(...)
, viaPage.extract_table(...)
). (3424b57)
- Bump pinned
pdfminer.six
version to20221105
. (e63a038)
- Restore
text
attribute to.textboxhorizontal
/etc., regression introduced in9587cc7
/v0.6.2
. (8a0c126) - Fix
lru_cache
usage, which are discouraged for class methods due to garbage-collection issues. (e3142a0)
- Upgrade
nbexec
development requirement from0.1.0
to0.2.0
. (30dac25)
- Fixed issue where
py.typed
file was not included in PyPi distribution. (#698 + #703 + 6908487) [h/t @jhonatan-lopes] - Reinstated the ability to call
utils.cluster_objects(...)
with any hashable value (str
,int
,tuple
, etc.) as thekey_fn
parameter, reverting breaking change in 58b1ab1. (#691 + 1e97656) [h/t @jfuruness]
- Add
utils.outside_bbox(...)
andPage.outside_bbox(...)
method, which are the inverse ofutils.within_bbox(...)
andPage.within_bbox(...)
. (#369 + 3ab1cc4) - Add
strict=True/False
parameter toPage.crop(...)
,Page.within_bbox(...)
, andPage.outside_bbox(...)
; default isTrue
, whileFalse
bypasses thetest_proposed_bbox(...)
check. (#421 + 71ad60f) - Add more guidance to exception when
.to_image(...)
raisesPIL.Image.DecompressionBombError
. (#413 + b6ff9e8)
- Fix
PageImage
conversions for PDFs withcmyk
colorspaces; convert them torgb
earlier in the process. (28330da)
- Quick fix for transparency issue in visual debugging mode. (b98dd7c)
- Add
split_at_punctuation
parameter to.extract_words(...)
and.extract_text(...)
. (#682) [h/t @lolipopshock] - Add README.md link to @hbh112233abc's Chinese translation of README.md. (#674)
- Change
.to_image(...)
's approach, preferring to composite with a white background instead of removing the alpha channel. (1cd1f9a)
- Fix bug in
LayoutEngine.calculate(...)
when processing char objects with len>1 representations, such as ligatures. (#683)
- Fix bug when calling
PageImage.debug_tablefinder()
(i.e., with no parameters). (#659 + 063e2ed) [h/t @rneumann7]
- Add
Makefile
target forexamples
, as well as dev requirements to support re-running the example notebooks automatically. (ef065a7)
- Add
"matrix"
property tochar
objects, representing the current transformation matrix. (ae6f99e) - Add
pdfplumber.ctm
submodule with classCTM
, to calculate scale, skew, and translation of a current transformation matrix obtained from achar
's"matrix"
property. (ae6f99e) - Add
page.search(...)
, an experimental feature that allows you to search a page's text via regular expressions and non-regex strings, returning the text, any regex matches, the bounding box coordinates, and the char objects themselves. (#201 + 58b1ab1) - Add
--include-attrs
/--exclude-attrs
to CLI (and corresponding params to.to_json(...)
,.to_csv(...)
, andSerializer
. (4deac25) - Add
py.typed
for PEP561 compatibility and detection of typing hints by mypy. (ca795d1) [h/t @jhonatan-lopes]
- Bump pinned
pdfminer.six
version to20220524
. (486cea8)
- Remove
utils.collate_chars(...)
, the old name (and then alias) forutils.extract_text(...)
. (24f3532) - Remove
utils._itemgetter(...)
, an internal-use method previously used byutils.cluster_objects(...)
. (58b1ab1)
- Fix
IndexError
bug for.extract_text(layout=True)
on pages without text. (#658 + ad3df11) [h/t @ethanscorey]
- Add type annotations, and refactor parts of the library accordingly. (9587cc7)
- Add enforcement of type annotations via
mypy --strict
. (cdfdb87) - Add final bits of test coverage. (feb9d08)
- Add
TableSettings
class, a behind-the-scenes handler for managing and validating table-extraction settings. (9587cc7)
- Rename the positional argument to
.to_csv(...)
and.to_json(...)
fromtypes
toobject_types
. (9587cc7) - Tweak the output of
.to_json(...)
so that, if an object type is not present for a given page, it has no key in the page's object representation. (9587cc7)
- Remove
utils.filter_objects(...)
and move the functionality to within theFilteredPage.objects
property calculation, the only part of the library that used it. (9587cc7) - Remove code that sets
pdfminer.pdftypes.STRICT = True
andpdfminer.pdfinterp.STRICT = True
, since that has now been the default for a while. (9587cc7)
- Bump pinned
pdfminer.six
version to20220319
. (e434ed0) - Bump minimum
Pillow
version to>=9.1
. (d88eff1) - Drop support for Python 3.6 (EOL Dec. 2021) (a32473e)
- If
pdfplumber.open(...)
opens a file but apdfminer.pdfparser.PSException
is raised during the process,pdfplumber
now makes sure to close that file. (#581 + (#578) [h/t @johnhuge] - Fix incompatibility with
Pillow>=9.1
. (#637)
- Add
.extract_text(layout=True)
, an experimental feature which attempts to mimic the structural layout of the text on the page. (#10) - Add
utils.merge_bboxes(bboxes)
, which returns the smallest bounding box that contains all bounding boxes in thebboxes
argument. (f8d5e70) - Add
--precision
argument to CLI (#520) - Add
snap_x_tolerance
andsnap_y_tolerance
to table extraction settings. (#51 + #475) [h/t @dustindall] - Add
join_x_tolerance
andjoin_y_tolerance
to table extraction settings. (cbb34ce)
- Upgrade
pdfminer.six
from20200517
to20211012
; see that library's changelog for details, but a key difference is an improvement in how it assignsline
,rect
, andcurve
objects. (Diagonal two-point lines, for instance, are nowline
objects instead ofcurve
objects.) (#515) - Remove Decimal-ization of parsed object attributes, which are now represented with as much precision as is returned by
pdfminer.six
(#346 + #520) .extract_text(...)
returns""
instead ofNone
when character list is empty. (#482 + cb9900b) [h/t @tungph].extract_words(...)
now includesdoctop
among the attributes it returns for each word. (66fef89)- Change behavior of horizontal
text_strategy
, so that it uses the top and bottom of every word, not just the top of every word and the bottom of the last. (#467 + #466 + #265) [h/t @bobluda + @samkit-jain] - Change
table.merge_edges(...)
behavior whenjoin_tolerance
(andx
/y
variants)<= 0
, so that joining is attempted regardless, to handle cases of overlapping lines. (cbb34ce) - Raise error if certain table-extraction settings are negative. (aa2d594)
- Fix slowdown in
.extract_words(...)
/WordExtractor.iter_chars_to_words(...)
on very long words, caused by repeatedly re-calculating bounding box. (#483) - Handle
UnicodeDecodeError
when trying to decode utf-16-encoded annotations (#463) [h/t @tungph] - Fix crash when extracting tables with null values in
(text|intersection)_(x|y)_tolerance
settings. (#539) [h/t @yoavxyoav]
- Remove
pdfplumber.load(...)
method, which has been deprecated since0.5.23
(54cbbc5)
- Add
CONTRIBUTING.md
(#428) - Enforce import order via
isort
(d72b879) - Update Pillow and Wand versions in
requirements.txt
(cae6924) - Update all dependency versions in
requirements-dev.txt
(2f7e7ee)
- Add
--laparams
flag to CLI. (#407)
- Change
.convert_csv(...)
to order objects first by page number, rather than object type. (#407) - Change
.convert_csv(...)
,.convert_json(...)
, and CLI so that, by default, they returning all available object types, rather than those in a predefined default list. (#407)
- Fix
.extract_text(...)
so that it can accept generator objects as its main parameter. (#385) [h/t @alexreg] - Fix page-parsing so that
LTAnno
objects (which have no bounding-box coordinates) are not extracted. (Was only an issue when settinglaparams
.) (#388) - Fix
Page.extract_table(...)
so that it honors text tolerance settings (#415) [h/t @trifling]
- Fix regression (introduced in
0.5.26
/b1849f4) in closing files opened byPDF.open
- Reinstate access to higher-level layout objects (such as
textboxhorizontal
) whenlaparams
is passed topdfplumber.open(...)
. Had been removed in0.5.24
via 1f87898. (#359 + #364)
- Add a
python setup.py build sdist
test to main GitHub action. (#365)
- Add
Page.close/__enter__/__exit__
methods, by generalizing that behavior through theContainer
class (b1849f4)
- Change handling of floating point numbers; no longer convert them to
Decimal
objects and do not round them - Change
TableFinder
to return tables in order of topmost-and-then-leftmost, rather than leftmost-and-then-topmost (#336) - Change
Page.to_image()
's handling of alpha layer, to remove aliasing artifacts (#340) [h/t @arlyon]
- Enforce
psf/black
andflake8
ontests/
(#327
- Add new boolean argument
strict_metadata
(defaultFalse
) topdfplumber.open(...)
method for handling metadata resolution failures (f2c510d)
- Fix metadata extraction to handle integer/floating-point values (cb32478) (#297)
- Fix metadata extraction to handle nested metadata values (2d9415) (#316)
- Explicitly load text as utf-8 in
setup.py
(7854328) (#304) - Fix
pdfplumber.open(...)
so that it does not close file objects passed to it (408605f) (#312)
- Added
extra_attrs=[...]
parameter to.extract_text(...)
(c8b200e) (#28) - Added
utils/page.dedupe_chars(...)
(04fd56a + b132d45) (#71)
- Change character attribute
upright
fromint
tobool
(per originalpdfminer.six
representation) (1f87898) - Remove access and reference to
Container.figures
, given that they are not fundamental objects (8e74cb9)
- Decimalize "simple"
explicit_horizontal_lines
/explicit_vertical_lines
descs passed toTableFinder
methods (bc40779) (#290)
- Refactor/simplify
Page.process_objects
(1f87898),utils.extract_words
(c8b200e), andconvert.serialize
(a74d3bc) - Remove
test_issues.py:test_pr_77
(917467a) and narrowtest_ca_warn_report:test_objects
(6233bbd) to speed up tests
- Add
utils.resolve
(non-recursive .resolve_all) (7a90630) - Add
page.annots
andpage.hyperlinks
, replacing non-functionalpage.annos
, and mirroring pdfminer's language ("annot" vs. "anno"). (aa03961) - Add
page/pdf.to_json
andpage/pdf.to_csv
(cbc91c6) - Add
relative=True/False
parameter to.crop
and.within_bbox
; those methods also now raise exceptions for invalid and out-of-page bounding boxes. (047ad34) [h/t @samkit-jain]
- Remove
pdfminer.from_path
andpdfminer.load
as deprecated; nowpdfminer.open
is the canonical way to load a PDF. (00e789b) - Simplify the logic in "text" table-finding strategies; in edge cases, may result in changes to results. (d224202)
- Drop support for Python 3.5 (baf1033)
- Fix
.extract_words
, which had been returning incorrect results whenhorizontal_ltr = False
(d16aa13) - Fix
utils.resize_object
, which had been failing in various permutations (d16aa13) - Fix
lines_strict
table-finding strategy, which a typo had prevented from being usable (f0c9b85) - Fix
utils.resolve_all
to guard against two known sources of infinite recursion (cbc91c6)
- Rename default branch to "stable," to clarify its purpose
- Reformat code with psf/black (1258e09)
- Add code linting via psf/black and flake8 (1258e09)
- Switch from nosetests to pytest (1ac16dd)
- Switch from pipenv to standard requirements.txt + python -m venv (48eaa51)
- Add GitHub action for tests + codecov (b148fd1)
- Add Makefile for building development virtual environment and running tests (4c69c58)
- Add badges to README.md (9e42dc3)
- Add Trove classifiers for Python versions to setup.py (6946e8d)
- Add MANIFEST.in (eafc15c)
- Add GitHub issue templates (c4156d6)
- Remove
pandas
from dev requirements and tests (a5e7d7f)
- Upgraded
pdfminer.six
requirement to==20200517
(cddbff7) [h/t @youngquan]
- Add support for
non_stroking_color
attribute onchar
objects (0254da3) [h/t @idan-david]
- Fix
Page.extract_table(...)
to returnNone
instead of crashing when no table is found (d64afa8) [h/t @stucka]
- Fix
.get_page_image
to prefer paths over streams, when possible (ab957de) [h/t @ubmarco] - Local-fix pdfminer.six's
.resolve_all
to handle tuples and simplify (85f422d)
- Remove support for Python 2 and Python <3.3
- Add
utils.decimalize
performance improvement (830d117) [h/t @ubmarco]
- Fix un-referenced method when using "text" table-finding strategy (2a0c4a2)
- Add missing object type
rect_edge
toobj_to_edges()
(0edc6bf)
- Allow
rect
andcurve
objects also to be passed to "explicit_..._lines" setting when table-finding. (And disallow other types of dicts to be passed.)
- Fix
utils.extract_text
bug introduced in prior version
- Fix and simplify obj-in-bbox logic (see commit 25672961)
- Improve/fix the way
utils.extract_text
handles vertical text (see commit 8a5d858b) [h/t @dwalton76] - Have
Page.to_image
use bytes stream instead of file path (Issue #124 / PR #179) [h/t @cheungpat] - Fix issue #176, in which
Page.extract_tables
did not pass kwargs toTable.extract
[h/t @jsfenfen]
- Prevent custom LAParams from raising exception (Issue #168 / PR #169) [h/t @frascuchon]
- Add
six
as explicit dependency (for now)
- Upgrade
pdfminer.six
requirement to==20200104
- Upgrade
pillow
requirement>=7.0.0
- Remove Python 2.7 and 3.4 from
tox
tests
- Fix sorting bug in
page.extract_table()
- Fix support for password-protected PDFs (PR #138)
- Fixed PDF object resolution for rotation (PR #136)
cdecimal
support for Python 2- Support for password-protected PDFs
- Caching for
.decimalize()
method
- Upgrade to
pdfminer.six==20181108
- Make whitespace checking more robust (PR #88)
- Fix issue #75 (
.to_image()
custom arguments) - Fix issue raised in PR #77 (PDFObjRef resolution), and general class of problems
- Fix issue #90, and general class of problems, by explicitly typecasting each kind of PDF Object
- Fix bug in which, when calling get_page_image(...), the alpha channel could make the whole page black out.
- Fix issue #67, in which bool-type metadata were handled incorrectly
- Fix issue #53, in which non-decimalize-able (non_)stroking_color properties were raising errors.
.travis.yml
, but failing on.to_image()
- Move from defunct
pycrypto
topycryptodome
- Update
pdfminer.six
to20170720
- Fix issue #41, in which PDF-object-referenced cropboxes/mediaboxes weren't being fully resolved.
- Access to
__version__
from main namespace
- Fix issue #33, by checking
decode_text
's argument type
- Pin
pdfminer.six
to version20151013
(for now), fixing incompatibility
- Allow
import pdfplumber
even if ImageMagick not installed.
- Access to
curve
points. (E.g.,page.curves[0]["points"]
.) - Ability for
.draw_line
to drawcurve
points.
- Disaggregated "min_words_vertical" (default: 3) and "min_words_horizontal" (default: 1), removing "text_word_threshold".
- Internally, made
utils.decimalize
a bit more robust; now throws errors on non-decimalizable items. - Now explicitly ignoring some (obscure)
pdfminer
object attributes. - Raw input for
.draw_line
from a bounding box to((x, y), (x, y))
, for consistency withcurve["points"]
and withPillow
's underlying method.
- Fixed typo bug when
.rect_edges
is called before.edges
- Quick-draw
PageImage
methods:.draw_vline
,.draw_vlines
,.draw_hline
, and.draw_hlines
. - Boolean parameter
keep_blank_chars
for.extract_words(...)
andTableFinder
settings.
- Increased default
text_tolerance
andintersection_tolerance
TableFinder values from 1 to 3.
- Properly handle conversion of PDFs with transparency to
pillow
images. - Properly handle
pandas
DataFrames as inputs to multi-draw commands (e.g.,PageImage.draw_rects(...)
).
- Visual debugging features, via
Page.to_image(...)
andPageImage
. (Introduceswand
andpillow
as package requirements.) - More powerful options for extracting data from tables. See changes below.
- Entirely overhaul the table-extraction methods. Now based on Anssi Nurminen's master's thesis.
- Disentangle
.crop
from.intersects_bbox
and.within_bbox
. - Change default
x_tolerance
andy_tolerance
for word extraction from5
to3
- Fix bug stemming from non-decimalized page heights. [h/t @jsfenfen]
- Provide access to
Page.page_number
- Use
.page_number
instead of.page_id
as primary identifier. [h/t @jsfenfen] - Change default
x_tolerance
andy_tolerance
for word extraction from0
to5
- Provide proper support for rotated pages
- Fix bug stemming from when metadata includes a PostScript literal. [h/t @boblannon]
Whoops.
- When extracting table cells, use chars' midpoints instead of top-points.
- Fix find_gutters — should ignore
" "
chars