Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: add incremental capability to PdfWriter #2811

Merged
merged 45 commits into from
Sep 11, 2024

Conversation

pubpub-zz
Copy link
Collaborator

@pubpub-zz pubpub-zz commented Aug 23, 2024

This PR introduces a new capability I was expecting to propose for a while : you can now build some PDF as incrementation from an existing PDF. This allow to keep signature validation of existing forms / documents.

closes #2780 (partially : requires XFA form to be modified manually)

@pubpub-zz pubpub-zz marked this pull request as draft August 23, 2024 21:19
Copy link

codecov bot commented Aug 24, 2024

Codecov Report

Attention: Patch coverage is 98.16514% with 4 lines in your changes missing coverage. Please review.

Project coverage is 95.91%. Comparing base (b85c171) to head (64cf1f3).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
pypdf/_writer.py 97.72% 0 Missing and 3 partials ⚠️
pypdf/_doc_common.py 96.87% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2811      +/-   ##
==========================================
+ Coverage   95.88%   95.91%   +0.02%     
==========================================
  Files          51       51              
  Lines        8576     8735     +159     
  Branches     1696     1744      +48     
==========================================
+ Hits         8223     8378     +155     
  Misses        210      210              
- Partials      143      147       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@pubpub-zz pubpub-zz marked this pull request as ready for review August 26, 2024 15:20
@stefan6419846
Copy link
Collaborator

Are we able to fix the partial coverage? At least the _doc_common.py case seems to be easy enough to fix, while I am unsure about the remaining ones.

@pubpub-zz
Copy link
Collaborator Author

Are we able to fix the partial coverage? At least the _doc_common.py case seems to be easy enough to fix, while I am unsure about the remaining ones.

I left the remaining ones as I would like to fix them with small evolutions I would like to process apart :

  • _doc_common will be processed with some change to prevent page tree flattening and page update to flowdown inherited properties
  • inf will be tested when adding capability to wipe out /Info from PdfWriter Object

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this pull request Aug 28, 2024
@ljbergmann
Copy link

Hey i was following this PR and tried it with the PDF i mentioned in #2780. I modified my "test-code" to use the new increment feature and the created pdf has some issues.

When opening the created pdf in Acrobat two things happen:

  1. there is an error message
    grafik
  2. the form is empty
    grafik

Firefox
grafik

normally i would say screw Acrobat, but i have to use Acrobat, so maybe this is happening because i made an error in my code?

import pypdf
from pypdf import PdfReader
from pypdf import PdfWriter
from pathlib import Path
import urllib.request

irs_form = Path("f5471sm.pdf")
if not irs_form.is_file():
    urllib.request.urlretrieve("https://www.irs.gov/pub/irs-pdf/f5471sm.pdf", "f5471sm.pdf")

form = PdfReader("f5471sm.pdf")
fields = form.get_form_text_fields()
form.close()

writer = PdfWriter("f5471sm.pdf", incremental=True)
for key,field in fields.items():
    fields[key] = key

writer.update_page_form_field_values(None, fields)

with open("f5471sm-"+pypdf.__version__+".pdf","wb") as file:
    writer.write(file)
writer.close()

@pubpub-zz
Copy link
Collaborator Author

@ljbergmann I've not been able to identify why acrobat reader says it is damaged. I did other tests where no message is reported.
My current guess is that this form is an XFA and your are only loading data within the fields but not within the XFA part.
I did some comparison with filling up manually the form and noticed only this change apart from the fact acrobat is using other syntax for the reference. Help to identify discrepancy would be welcomed 😀

@ljbergmann
Copy link

@pubpub-zz i'm giving my best to identify these discrepancies and contribute to the PR / project, but i have to admit python and pdf are not my strong suit. I've had a quick look at the mentioned pdf and can verify that an XFA structure exists. Is there any documentation available for the interaction with XFA?

@pubpub-zz
Copy link
Collaborator Author

a few hints:
a ) you should refer to Pdf reference (2.0 is quite heavy to handle I prefer personally for daily work version 1.7)
b) a good hex editor to look at raw data might be necessary
c) to display object, on the console you can use .get_object(idnum)

The simpler your document is the best it is.

@ljbergmann
Copy link

To investigate the error message a bit more i reduced my test script even more and removed the update of fields.

import pypdf
from pypdf import PdfWriter
from pathlib import Path
import urllib.request

irs_form = Path("f5471sm.pdf")
if not irs_form.is_file():
    urllib.request.urlretrieve("https://www.irs.gov/pub/irs-pdf/f5471sm.pdf", "f5471sm.pdf")

writer = PdfWriter("f5471sm.pdf", incremental=True)
with open("f5471sm-"+pypdf.__version__+".pdf","wb") as file:
    writer.write(file)
writer.close()

The error message does occur even in this case. If you compare the original and the created pdf the files only differ in the last 12 lines. The "new PDF" contains the following lines:

xref
0 1
0000000000 65535 f 
trailer
<<
/Size 1175
/Root 942 0 R
/Info 940 0 R
/Prev 107946
/ID [ <80e0541ba5885549bb7658f058d887ad> <a32415fa560cf24bb549379eed243641> ]
>>
startxref
108321
%%EOF

grafik

If i remove them manually the error message is gone. I hope this helps?

pubpub-zz and others added 6 commits September 8, 2024 09:37
Co-authored-by: Stefan <96178532+stefan6419846@users.noreply.github.com>
Co-authored-by: Stefan <96178532+stefan6419846@users.noreply.github.com>
pypdf/_writer.py Outdated Show resolved Hide resolved
pypdf/constants.py Outdated Show resolved Hide resolved
@stefan6419846
Copy link
Collaborator

@pubpub-zz Could you please have a look at the remaining remarks to allow continuing with this PR and the PR built on top of this?

@pubpub-zz
Copy link
Collaborator Author

@stefan6419846
Sorry, the comments where folded/hidden and I missed them.

pypdf/_doc_common.py Outdated Show resolved Hide resolved
pypdf/_writer.py Outdated Show resolved Hide resolved
pypdf/_writer.py Outdated Show resolved Hide resolved
@stefan6419846 stefan6419846 merged commit 98d4425 into py-pdf:main Sep 11, 2024
16 checks passed
@stefan6419846
Copy link
Collaborator

@pubpub-zz Thanks for your patience. Could you please rebase your other PRs accordingly and point to the changes where the next reviews are possible?

@pubpub-zz pubpub-zz mentioned this pull request Sep 15, 2024
pubpub-zz added a commit that referenced this pull request Sep 17, 2024
## Version 5.0.0, 2024-09-15

This version drops support for Python 3.7 (not maintained since July 2023), PdfMerger (use PdfWriter instead) and AnnotationBuilder (use annotations instead).


### Deprecations (DEP)
- Remove the deprecated PfdMerger and AnnotationBuilder classes and other deprecations cleanup (#2813)
- Drop Python 3.7 support (#2793)

### New Features (ENH)
- Add capability to remove /Info from PDF (#2820)
- Add incremental capability to PdfWriter (#2811)
- Add UniGB-UTF16 encodings (#2819)
- Accept utf strings for metadata (#2802)
- Report PdfReadError instead of RecursionError (#2800)
- Compress PDF files merging identical objects (#2795)

### Bug Fixes (BUG)
- Fix sheared image (#2801)

### Robustness (ROB)
- Robustify .set_data() (#2821)
- Raise PdfReadError when missing /Root in trailer (#2808)
- Fix extract_text() issues on damaged PDFs (#2760)
- Handle images with empty data when processing an image from bytes (#2786)

### Developer Experience (DEV)
- Fix coverage uploads (#2832)
- Test against Python 3.13 (#2776)


[Full Changelog](4.3.1...5.0.0)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PDF-Form not editable after filling out text field (after upgrade from 3.9.* to 4.3*)
3 participants