Add `repair` method? #824

jsvine · 2023-02-24T19:45:01Z

In issue #799, @sandzone had the suggestion to add PDF-repairing as a pdfplumber feature. Although it would be impractical for pdfplumber to do the repairing itself, it seems feasible to have the library shell out to Ghostscript and/or other command-line tools (e.g., poppler, mutool, etc.) that do PDF repair. I could see having two interfaces:

pdfplumber.repair("path/to/malformed.pdf", "path/to/fixed.pdf", engine="ghostscript"), which would write the fixed version.
pdfplumber.open("path/to/malformed.pdf", repair=Optional[Union[bool, str]]: False), where you would either pass True or the name of the engine ("ghostscript", etc.). This would write the fixed PDF to a temporary file, and load that instead of the original PDF.

Whatever the interface, this would probably need some clear exception handling for when the user had not installed Ghostscript/etc.

The text was updated successfully, but these errors were encountered:

sandzone · 2023-02-26T03:27:18Z

Thanks for opening a separate thread @jsvine.

Adding a repair preprocess is also improving the general quality of pdf parsing. For me, repair process also solved common 'text' issues. Parsed text was different (jumbled) from what was being displayed in Okular. Repair pre-process resolved that issue too.

For linux machines (also works with AWS lambda), invoking the preprocess step via subprocess.call() is my current solution.

samkit-jain · 2023-02-26T14:43:40Z

My preference would be for the second option. When the repair fails, the PDF should still be loaded correctly and the failure to repair be notified as a warning.

Passing a boolean or a string to the repair keyword might be a bit confusing.

Can add a new parameter repair_method which would accept a string. It has the disadvantage of having a new parameter added but has the advantage of keeping things a bit simple.
Can change the repair param to repair_method which would be Optional[str]. If None, don't repair. If a string, repair using that technique.

jsvine · 2023-03-09T16:14:48Z

@samkit-jain, I like your proposal for breaking out repair: bool and repair_method: str into separate parameters.

Re. this:

My preference would be for the second option

I was actually proposing implementing both interfaces; the second interface could, internally, use the code written for the first interface. Or do you think better just to have the second, without the first?

samkit-jain · 2023-03-20T17:41:17Z

I was actually proposing implementing both interfaces; the second interface could, internally, use the code written for the first interface. Or do you think better just to have the second, without the first?

I am sorry I am unable to understand. Could you please elaborate maybe with an example?

jsvine · 2023-03-21T20:13:00Z

Sure! Interface 1:

import pdfplumber
repaired_pdf_bytes = pdfplumber.repair("corrupted.pdf")
with open("fixed.pdf", "wb") as f:
  f.write(repaired_pdf_bytes)

... or similarly:

import pdfplumber
pdfplumber.repair("corrupted.pdf", outfile="repaired.pdf")

Interface 2:

import pdfplumber
pdf = pdfplumber.open("corrupted.pdf", repair=True)
page_text = pdf.pages[0].extract_text()

samkit-jain · 2023-03-23T12:08:04Z

Thanks @jsvine Able to understand now. Yes, it makes more sense. Gives more convenience to the user. Also, I think that we can add a new property repaired_pdf_path that will give the path to the repaired PDF. I think that the majority of the use-cases will be solved by interface 2. If there comes a use-case that the user wants to access the repaired PDF after using interface 2, instead of re-repairing the PDF using interface 1, they can use the exposed property and get the path to the already repaired PDF.

PS: I can also take up implementing this repair functionality unless of course you haven't already started working on it :)

jsvine · 2023-03-23T13:03:27Z

Thanks, @samkit-jain! That additional property sounds good to me. And thank you for offering to implement this! I haven't started on it yet.

jsvine · 2023-07-17T03:10:11Z

Now available in v0.10.0, with explanation added in https://github.com/jsvine/pdfplumber/blob/stable/docs/repairing.md

This is a new feature and somewhat experimental, so I haven't yet added it to the main documentation. I have, however, mentioned it in the bug-report issue template.

I didn't end up adding that additional property, but I'm still open to it!

jsvine added the enhancement label Feb 24, 2023

jsvine mentioned this issue Feb 24, 2023

Wrong coordinates of words when using function extract_words() #799

Closed

samkit-jain self-assigned this Jun 19, 2023

jsvine mentioned this issue Jul 16, 2023

v0.10.0 #936

Merged

jsvine closed this as completed Jul 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `repair` method? #824

Add `repair` method? #824

jsvine commented Feb 24, 2023

sandzone commented Feb 26, 2023

samkit-jain commented Feb 26, 2023

jsvine commented Mar 9, 2023 •

edited

Loading

samkit-jain commented Mar 20, 2023

jsvine commented Mar 21, 2023

samkit-jain commented Mar 23, 2023

jsvine commented Mar 23, 2023

jsvine commented Jul 17, 2023 •

edited

Loading

Add repair method? #824

Add repair method? #824

Comments

jsvine commented Feb 24, 2023

sandzone commented Feb 26, 2023

samkit-jain commented Feb 26, 2023

jsvine commented Mar 9, 2023 • edited Loading

samkit-jain commented Mar 20, 2023

jsvine commented Mar 21, 2023

samkit-jain commented Mar 23, 2023

jsvine commented Mar 23, 2023

jsvine commented Jul 17, 2023 • edited Loading

Add `repair` method? #824

Add `repair` method? #824

jsvine commented Mar 9, 2023 •

edited

Loading

jsvine commented Jul 17, 2023 •

edited

Loading