ENH: add incremental capability to PdfWriter #2811

pubpub-zz · 2024-08-23T21:08:15Z

This PR introduces a new capability I was expecting to propose for a while : you can now build some PDF as incrementation from an existing PDF. This allow to keep signature validation of existing forms / documents.

closes #2780 (partially : requires XFA form to be modified manually)

closes py-pdf#2780

codecov · 2024-08-24T09:31:21Z

Codecov Report

Attention: Patch coverage is 98.16514% with 4 lines in your changes missing coverage. Please review.

Project coverage is 95.91%. Comparing base (b85c171) to head (64cf1f3).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
pypdf/_writer.py	97.72%	0 Missing and 3 partials ⚠️
pypdf/_doc_common.py	96.87%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2811      +/-   ##
==========================================
+ Coverage   95.88%   95.91%   +0.02%     
==========================================
  Files          51       51              
  Lines        8576     8735     +159     
  Branches     1696     1744      +48     
==========================================
+ Hits         8223     8378     +155     
  Misses        210      210              
- Partials      143      147       +4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

stefan6419846 · 2024-08-27T06:27:46Z

Are we able to fix the partial coverage? At least the _doc_common.py case seems to be easy enough to fix, while I am unsure about the remaining ones.

pubpub-zz · 2024-08-27T07:43:12Z

Are we able to fix the partial coverage? At least the _doc_common.py case seems to be easy enough to fix, while I am unsure about the remaining ones.

I left the remaining ones as I would like to fix them with small evolutions I would like to process apart :

_doc_common will be processed with some change to prevent page tree flattening and page update to flowdown inherited properties
inf will be tested when adding capability to wipe out /Info from PdfWriter Object

to be merged after py-pdf#2811

ljbergmann · 2024-08-28T13:10:37Z

Hey i was following this PR and tried it with the PDF i mentioned in #2780. I modified my "test-code" to use the new increment feature and the created pdf has some issues.

When opening the created pdf in Acrobat two things happen:

there is an error message
the form is empty

Firefox

normally i would say screw Acrobat, but i have to use Acrobat, so maybe this is happening because i made an error in my code?

import pypdf
from pypdf import PdfReader
from pypdf import PdfWriter
from pathlib import Path
import urllib.request

irs_form = Path("f5471sm.pdf")
if not irs_form.is_file():
    urllib.request.urlretrieve("https://www.irs.gov/pub/irs-pdf/f5471sm.pdf", "f5471sm.pdf")

form = PdfReader("f5471sm.pdf")
fields = form.get_form_text_fields()
form.close()

writer = PdfWriter("f5471sm.pdf", incremental=True)
for key,field in fields.items():
    fields[key] = key

writer.update_page_form_field_values(None, fields)

with open("f5471sm-"+pypdf.__version__+".pdf","wb") as file:
    writer.write(file)
writer.close()

pubpub-zz · 2024-08-28T14:42:07Z

@ljbergmann I've not been able to identify why acrobat reader says it is damaged. I did other tests where no message is reported.
My current guess is that this form is an XFA and your are only loading data within the fields but not within the XFA part.
I did some comparison with filling up manually the form and noticed only this change apart from the fact acrobat is using other syntax for the reference. Help to identify discrepancy would be welcomed 😀

ljbergmann · 2024-08-28T15:08:19Z

@pubpub-zz i'm giving my best to identify these discrepancies and contribute to the PR / project, but i have to admit python and pdf are not my strong suit. I've had a quick look at the mentioned pdf and can verify that an XFA structure exists. Is there any documentation available for the interaction with XFA?

pubpub-zz · 2024-08-29T06:39:43Z

a few hints:
a ) you should refer to Pdf reference (2.0 is quite heavy to handle I prefer personally for daily work version 1.7)
b) a good hex editor to look at raw data might be necessary
c) to display object, on the console you can use .get_object(idnum)

The simpler your document is the best it is.

ljbergmann · 2024-08-30T10:56:10Z

To investigate the error message a bit more i reduced my test script even more and removed the update of fields.

import pypdf
from pypdf import PdfWriter
from pathlib import Path
import urllib.request

irs_form = Path("f5471sm.pdf")
if not irs_form.is_file():
    urllib.request.urlretrieve("https://www.irs.gov/pub/irs-pdf/f5471sm.pdf", "f5471sm.pdf")

writer = PdfWriter("f5471sm.pdf", incremental=True)
with open("f5471sm-"+pypdf.__version__+".pdf","wb") as file:
    writer.write(file)
writer.close()

The error message does occur even in this case. If you compare the original and the created pdf the files only differ in the last 12 lines. The "new PDF" contains the following lines:

xref
0 1
0000000000 65535 f 
trailer
<<
/Size 1175
/Root 942 0 R
/Info 940 0 R
/Prev 107946
/ID [ <80e0541ba5885549bb7658f058d887ad> <a32415fa560cf24bb549379eed243641> ]
>>
startxref
108321
%%EOF

If i remove them manually the error message is gone. I hope this helps?

pypdf/generic/_data_structures.py

tests/test_reader.py

Co-authored-by: Stefan <96178532+stefan6419846@users.noreply.github.com>

pypdf/_writer.py

pypdf/constants.py

stefan6419846 · 2024-09-10T13:53:06Z

@pubpub-zz Could you please have a look at the remaining remarks to allow continuing with this PR and the PR built on top of this?

pubpub-zz · 2024-09-10T17:06:40Z

@stefan6419846
Sorry, the comments where folded/hidden and I missed them.

pypdf/_doc_common.py

pypdf/_writer.py

stefan6419846 · 2024-09-11T15:58:40Z

@pubpub-zz Thanks for your patience. Could you please rebase your other PRs accordingly and point to the changes where the next reviews are possible?

## Version 5.0.0, 2024-09-15 This version drops support for Python 3.7 (not maintained since July 2023), PdfMerger (use PdfWriter instead) and AnnotationBuilder (use annotations instead). ### Deprecations (DEP) - Remove the deprecated PfdMerger and AnnotationBuilder classes and other deprecations cleanup (#2813) - Drop Python 3.7 support (#2793) ### New Features (ENH) - Add capability to remove /Info from PDF (#2820) - Add incremental capability to PdfWriter (#2811) - Add UniGB-UTF16 encodings (#2819) - Accept utf strings for metadata (#2802) - Report PdfReadError instead of RecursionError (#2800) - Compress PDF files merging identical objects (#2795) ### Bug Fixes (BUG) - Fix sheared image (#2801) ### Robustness (ROB) - Robustify .set_data() (#2821) - Raise PdfReadError when missing /Root in trailer (#2808) - Fix extract_text() issues on damaged PDFs (#2760) - Handle images with empty data when processing an image from bytes (#2786) ### Developer Experience (DEV) - Fix coverage uploads (#2832) - Test against Python 3.13 (#2776) [Full Changelog](4.3.1...5.0.0)

ENH: add incremental capability to PdfWriter

fba73a4

closes py-pdf#2780

pubpub-zz marked this pull request as draft August 23, 2024 21:19

fix test

0543709

pubpub-zz added 11 commits August 25, 2024 17:10

fixes + first test

29030d4

coverage

1067b74

coverage

f1d3fbe

cope with multiple level pages

ae97bc7

test + doc

d9a99d9

coverage

3c4cfdc

coverage

38d4b35

coverage

79eca73

coverage

290c5a6

coverage

173578d

Merge branch 'main' into incremental

b2b0c9e

pubpub-zz marked this pull request as ready for review August 26, 2024 15:20

pubpub-zz requested a review from stefan6419846 August 26, 2024 15:20

simplification

1a6eda5

coverage

d43d25b

pubpub-zz added 2 commits August 27, 2024 17:20

Merge branch 'main' into incremental

7e2e74d

Merge branch 'main' into incremental

708e449

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this pull request Aug 28, 2024

ENH: add capability to remove /Info from pypdf

c9a6c95

to be merged after py-pdf#2811

pubpub-zz mentioned this pull request Aug 28, 2024

ENH: add capability to remove /Info from pypdf #2820

Merged

stefan6419846 reviewed Sep 8, 2024

View reviewed changes

pypdf/generic/_data_structures.py Show resolved Hide resolved

stefan6419846 reviewed Sep 8, 2024

View reviewed changes

pypdf/generic/_data_structures.py Show resolved Hide resolved

stefan6419846 reviewed Sep 8, 2024

View reviewed changes

tests/test_reader.py Show resolved Hide resolved

pubpub-zz and others added 6 commits September 8, 2024 09:37

Update pypdf/_writer.py

fbe54d0

Co-authored-by: Stefan <96178532+stefan6419846@users.noreply.github.com>

Update pypdf/_writer.py

e3c1e2c

Co-authored-by: Stefan <96178532+stefan6419846@users.noreply.github.com>

clarify assert mypy

6e65943

doc hash_bin

4121672

doc hash_bin

bcc5c1d

Merge branch 'main' into incremental

02ac507

pubpub-zz requested a review from stefan6419846 September 8, 2024 08:42

stefan6419846 added 2 commits September 8, 2024 16:39

Update pypdf/_page.py

bc6caba

Update pypdf/_writer.py

8659de2

stefan6419846 reviewed Sep 8, 2024

View reviewed changes

pypdf/_writer.py Outdated Show resolved Hide resolved

stefan6419846 reviewed Sep 8, 2024

View reviewed changes

pypdf/constants.py Outdated Show resolved Hide resolved

Apply suggestions from code review

99e6dfc

fix in accordance with comments

3b81ee5

pubpub-zz requested a review from stefan6419846 September 10, 2024 17:08

fix doc

efd948b

stefan6419846 reviewed Sep 11, 2024

View reviewed changes

pypdf/_doc_common.py Outdated Show resolved Hide resolved

stefan6419846 reviewed Sep 11, 2024

View reviewed changes

pypdf/_writer.py Outdated Show resolved Hide resolved

stefan6419846 reviewed Sep 11, 2024

View reviewed changes

pypdf/_writer.py Outdated Show resolved Hide resolved

stefan6419846 added 2 commits September 11, 2024 08:51

fix typos

5b030dc

Update pypdf/_writer.py

64cf1f3

stefan6419846 approved these changes Sep 11, 2024

View reviewed changes

stefan6419846 merged commit 98d4425 into py-pdf:main Sep 11, 2024
16 checks passed

pubpub-zz mentioned this pull request Sep 15, 2024

REL: 5.0.0 #2851

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: add incremental capability to PdfWriter #2811

ENH: add incremental capability to PdfWriter #2811

pubpub-zz commented Aug 23, 2024 •

edited

Loading

codecov bot commented Aug 24, 2024 •

edited

Loading

stefan6419846 commented Aug 27, 2024

pubpub-zz commented Aug 27, 2024

ljbergmann commented Aug 28, 2024

pubpub-zz commented Aug 28, 2024

ljbergmann commented Aug 28, 2024

pubpub-zz commented Aug 29, 2024

ljbergmann commented Aug 30, 2024

stefan6419846 commented Sep 10, 2024

pubpub-zz commented Sep 10, 2024

stefan6419846 commented Sep 11, 2024

ENH: add incremental capability to PdfWriter #2811

ENH: add incremental capability to PdfWriter #2811

Conversation

pubpub-zz commented Aug 23, 2024 • edited Loading

codecov bot commented Aug 24, 2024 • edited Loading

Codecov Report

stefan6419846 commented Aug 27, 2024

pubpub-zz commented Aug 27, 2024

ljbergmann commented Aug 28, 2024

pubpub-zz commented Aug 28, 2024

ljbergmann commented Aug 28, 2024

pubpub-zz commented Aug 29, 2024

ljbergmann commented Aug 30, 2024

stefan6419846 commented Sep 10, 2024

pubpub-zz commented Sep 10, 2024

stefan6419846 commented Sep 11, 2024

pubpub-zz commented Aug 23, 2024 •

edited

Loading

codecov bot commented Aug 24, 2024 •

edited

Loading