-
Notifications
You must be signed in to change notification settings - Fork 685
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add repair
method?
#824
Comments
Thanks for opening a separate thread @jsvine. Adding a repair preprocess is also improving the general quality of pdf parsing. For me, repair process also solved common 'text' issues. Parsed text was different (jumbled) from what was being displayed in Okular. Repair pre-process resolved that issue too. For linux machines (also works with AWS lambda), invoking the preprocess step via subprocess.call() is my current solution. |
My preference would be for the second option. When the repair fails, the PDF should still be loaded correctly and the failure to repair be notified as a warning. Passing a boolean or a string to the
|
@samkit-jain, I like your proposal for breaking out Re. this:
I was actually proposing implementing both interfaces; the second interface could, internally, use the code written for the first interface. Or do you think better just to have the second, without the first? |
I am sorry I am unable to understand. Could you please elaborate maybe with an example? |
Sure! Interface 1: import pdfplumber
repaired_pdf_bytes = pdfplumber.repair("corrupted.pdf")
with open("fixed.pdf", "wb") as f:
f.write(repaired_pdf_bytes) ... or similarly: import pdfplumber
pdfplumber.repair("corrupted.pdf", outfile="repaired.pdf") Interface 2: import pdfplumber
pdf = pdfplumber.open("corrupted.pdf", repair=True)
page_text = pdf.pages[0].extract_text() |
Thanks @jsvine Able to understand now. Yes, it makes more sense. Gives more convenience to the user. Also, I think that we can add a new property PS: I can also take up implementing this repair functionality unless of course you haven't already started working on it :) |
Thanks, @samkit-jain! That additional property sounds good to me. And thank you for offering to implement this! I haven't started on it yet. |
Now available in v0.10.0, with explanation added in https://github.com/jsvine/pdfplumber/blob/stable/docs/repairing.md This is a new feature and somewhat experimental, so I haven't yet added it to the main documentation. I have, however, mentioned it in the bug-report issue template. I didn't end up adding that additional property, but I'm still open to it! |
In issue #799, @sandzone had the suggestion to add PDF-repairing as a
pdfplumber
feature. Although it would be impractical forpdfplumber
to do the repairing itself, it seems feasible to have the library shell out to Ghostscript and/or other command-line tools (e.g., poppler, mutool, etc.) that do PDF repair. I could see having two interfaces:pdfplumber.repair("path/to/malformed.pdf", "path/to/fixed.pdf", engine="ghostscript")
, which would write the fixed version.pdfplumber.open("path/to/malformed.pdf", repair=Optional[Union[bool, str]]: False)
, where you would either passTrue
or the name of the engine ("ghostscript", etc.). This would write the fixed PDF to a temporary file, and load that instead of the original PDF.Whatever the interface, this would probably need some clear exception handling for when the user had not installed Ghostscript/etc.
The text was updated successfully, but these errors were encountered: