Many parsing issues can be traced back to malformed PDFs.
Malformed PDFs can often be fixed via Ghostscript.
pdfplumber
lets you automatically run those repairs, in several ways:
pdfplumber.open(..., repair=True)
will repair your PDF on the fly (but not save the repaired version to disk).pdfplumber.repair(path_to_pdf)
will return aBytesIO
object holding the bytes of a repaired version of the original file.pdfplumber.repair(path_to_pdf, outfile="path/to/repaired.pdf")
will write a repaired version of the original file to the indicatedoutfile
path.
gs_path=...
: You can pass a custom path for the Ghostscript executable, helpful in casepdfplumber
is unable to auto-detect your copy of Ghostscript.