Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add repair method? #824

Closed
jsvine opened this issue Feb 24, 2023 · 8 comments
Closed

Add repair method? #824

jsvine opened this issue Feb 24, 2023 · 8 comments
Assignees

Comments

@jsvine
Copy link
Owner

jsvine commented Feb 24, 2023

In issue #799, @sandzone had the suggestion to add PDF-repairing as a pdfplumber feature. Although it would be impractical for pdfplumber to do the repairing itself, it seems feasible to have the library shell out to Ghostscript and/or other command-line tools (e.g., poppler, mutool, etc.) that do PDF repair. I could see having two interfaces:

  • pdfplumber.repair("path/to/malformed.pdf", "path/to/fixed.pdf", engine="ghostscript"), which would write the fixed version.

  • pdfplumber.open("path/to/malformed.pdf", repair=Optional[Union[bool, str]]: False), where you would either pass True or the name of the engine ("ghostscript", etc.). This would write the fixed PDF to a temporary file, and load that instead of the original PDF.

Whatever the interface, this would probably need some clear exception handling for when the user had not installed Ghostscript/etc.

@sandzone
Copy link

Thanks for opening a separate thread @jsvine.

Adding a repair preprocess is also improving the general quality of pdf parsing. For me, repair process also solved common 'text' issues. Parsed text was different (jumbled) from what was being displayed in Okular. Repair pre-process resolved that issue too.

For linux machines (also works with AWS lambda), invoking the preprocess step via subprocess.call() is my current solution.

@samkit-jain
Copy link
Collaborator

My preference would be for the second option. When the repair fails, the PDF should still be loaded correctly and the failure to repair be notified as a warning.

Passing a boolean or a string to the repair keyword might be a bit confusing.

  • Can add a new parameter repair_method which would accept a string. It has the disadvantage of having a new parameter added but has the advantage of keeping things a bit simple.
  • Can change the repair param to repair_method which would be Optional[str]. If None, don't repair. If a string, repair using that technique.

@jsvine
Copy link
Owner Author

jsvine commented Mar 9, 2023

@samkit-jain, I like your proposal for breaking out repair: bool and repair_method: str into separate parameters.

Re. this:

My preference would be for the second option

I was actually proposing implementing both interfaces; the second interface could, internally, use the code written for the first interface. Or do you think better just to have the second, without the first?

@samkit-jain
Copy link
Collaborator

I was actually proposing implementing both interfaces; the second interface could, internally, use the code written for the first interface. Or do you think better just to have the second, without the first?

I am sorry I am unable to understand. Could you please elaborate maybe with an example?

@jsvine
Copy link
Owner Author

jsvine commented Mar 21, 2023

Sure! Interface 1:

import pdfplumber
repaired_pdf_bytes = pdfplumber.repair("corrupted.pdf")
with open("fixed.pdf", "wb") as f:
  f.write(repaired_pdf_bytes)

... or similarly:

import pdfplumber
pdfplumber.repair("corrupted.pdf", outfile="repaired.pdf")

Interface 2:

import pdfplumber
pdf = pdfplumber.open("corrupted.pdf", repair=True)
page_text = pdf.pages[0].extract_text()

@samkit-jain
Copy link
Collaborator

Thanks @jsvine Able to understand now. Yes, it makes more sense. Gives more convenience to the user. Also, I think that we can add a new property repaired_pdf_path that will give the path to the repaired PDF. I think that the majority of the use-cases will be solved by interface 2. If there comes a use-case that the user wants to access the repaired PDF after using interface 2, instead of re-repairing the PDF using interface 1, they can use the exposed property and get the path to the already repaired PDF.

PS: I can also take up implementing this repair functionality unless of course you haven't already started working on it :)

@jsvine
Copy link
Owner Author

jsvine commented Mar 23, 2023

Thanks, @samkit-jain! That additional property sounds good to me. And thank you for offering to implement this! I haven't started on it yet.

@samkit-jain samkit-jain self-assigned this Jun 19, 2023
@jsvine jsvine mentioned this issue Jul 16, 2023
@jsvine
Copy link
Owner Author

jsvine commented Jul 17, 2023

Now available in v0.10.0, with explanation added in https://github.com/jsvine/pdfplumber/blob/stable/docs/repairing.md

This is a new feature and somewhat experimental, so I haven't yet added it to the main documentation. I have, however, mentioned it in the bug-report issue template.

I didn't end up adding that additional property, but I'm still open to it!

@jsvine jsvine closed this as completed Jul 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants