Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: Notes about form fields and annotations #1945

Merged
merged 9 commits into from
Dec 23, 2023
41 changes: 41 additions & 0 deletions docs/user/add-watermark.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,3 +81,44 @@ def watermark(
```

![watermark.png](watermark.png)

## Stamping images not in PDF format

The above code only works for images that are already in PDF format. However, you can easilly convert an image to PDF image using [Pillow](https://pypi.org/project/Pillow/).

```python
from PIL import Image
from io import BytesIO
from pypdf import PdfWriter, PdfReader, Transformation

def stamp_img(
content_pdf: Path,
stamp_img: Path,
pdf_result: Path,
page_indices: Union[Literal["ALL"], List[int]] = "ALL",
):
# Convert the image to a PDF
img = Image.open(stamp_img)
img_as_pdf = BytesIO()
img.save(img_as_pdf, 'pdf')
stamp_pdf = PdfReader(img_as_pdf)

# Then use the same stamp code from above
stamp_page = stamp_pdf.pages[0]

writer = PdfWriter()

reader = PdfReader(content_pdf)
if page_indices == "ALL":
page_indices = list(range(0, len(reader.pages)))
for index in page_indices:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you review your code based on the exchanges in #1902
this will provide a nicer solution not changing pages from the reader object and also keeping the pdf structure

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the PR to use writer.append() based on your suggestion.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way you are using .append() will not capture the outlines, you should more likely pass the indices(as a list) as second parameter and remove the loop:
writer.append(reader,indices)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I think I understand now. Wouldn't it be the pages parameter though? writer.append(reader, pages=page_indices)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct

content_page = reader.pages[index]
content_page.merge_transformed_page(
stamp_page,
Transformation(),
)
writer.add_page(content_page)

with open(pdf_result, "wb") as fp:
writer.write(fp)
```
34 changes: 34 additions & 0 deletions docs/user/forms.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,9 @@ from pypdf import PdfReader
reader = PdfReader("form.pdf")
fields = reader.get_form_text_fields()
fields == {"key": "value", "key2": "value2"}

# Or get Field objects instead of just text values:
MartinThoma marked this conversation as resolved.
Show resolved Hide resolved
fields = reader.get_fields()
```

## Filling out forms
Expand All @@ -27,7 +30,38 @@ writer.update_page_form_field_values(
writer.pages[0], {"fieldname": "some filled in text"}
)

# If you want to fill out *all* pages, it is also safe to do this:
data = {"fieldname": "some filled in text", "othername": "more text for an input on a different page"}
for page in writer.pages:
writer.update_page_form_field_values(page, data)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you have a look at #1903. more details about auto_regenerate = False may help many people

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read over the linked thread and honestly I failed to fully understand it. At least, not well enough to be able to write any useful documentation about it. Someone more knowledgeable than I should probably provide those additional details as to why auto_regenerate = False is recommended.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

previously update_page_form_field_values() is computing the rendering of the field but also sets for the whole document the field NeedAppearances forcing the viewer to recompute the rendering of the fields. the side effect is that the document will be marked as modified and the user will be ask to save the file. if you set the auto_regenerate parameter to false, NeedAppearances will be set to false.
the default value of value of auto_regenerate is true to keep compatibility with legacy behavior.


# write "output" to pypdf-output.pdf
with open("filled-out.pdf", "wb") as output_stream:
writer.write(output_stream)
```

## A note about form fields and annotations

The PDF form stores form fields as annotations with the subtype "\Widget". This means that the following two blocks of code will give fairly similar results:

```python
from pypdf import PdfReader
reader = PdfReader("form.pdf")
fields = reader.get_fields()
```

```python
from pypdf import PdfReader
from pypdf.constants import AnnotationDictionaryAttributes
reader = PdfReader("form.pdf")
fields = []
for page in reader.pages:
for annot in page.annotations:
annot = annot.get_object()
if annot[AnnotationDictionaryAttributes.Subtype] == "/Widget":
fields.append(annot)
```

However, while similar, there are some very important differences between the two above blocks of code. Most importantly, the first block will return a list of Field objects, where as the second will return more generic dictionary-like objects. The objects lists will *mostly* reference the same object in the underlying PDF, meaning you'll find that `obj_taken_fom_first_list.indirect_reference == obj_taken_from _second_list.indirect_reference`. Field objects are generally more ergonomic, as the exposed data can be access via clearly named properties. However, the more generic dictionary-like objects will contain data that the Field object does not expose, such as the Rect (the widget's position on the page). So, which to use will depend on your use case.

However, it's also important to note that the two lists do not *always* refer to the same underlying PDF objects. For example, if the form contains radio buttons, you will find that `reader.get_fields()` will get the parent object (the group of radio buttons) whereas `page.annotations` will return all the child objects (the individual radio buttons).