-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DOC: Notes about form fields and annotations #1945
Changes from 1 commit
c50859f
97ac1e7
a5957b6
0344d55
4f8e1bc
7757688
c421237
d1e2e6f
49626df
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -8,6 +8,9 @@ from pypdf import PdfReader | |
reader = PdfReader("form.pdf") | ||
fields = reader.get_form_text_fields() | ||
fields == {"key": "value", "key2": "value2"} | ||
|
||
# Or get Field objects instead of just text values: | ||
MartinThoma marked this conversation as resolved.
Show resolved
Hide resolved
|
||
fields = reader.get_fields() | ||
``` | ||
|
||
## Filling out forms | ||
|
@@ -27,7 +30,38 @@ writer.update_page_form_field_values( | |
writer.pages[0], {"fieldname": "some filled in text"} | ||
) | ||
|
||
# If you want to fill out *all* pages, it is also safe to do this: | ||
data = {"fieldname": "some filled in text", "othername": "more text for an input on a different page"} | ||
for page in writer.pages: | ||
writer.update_page_form_field_values(page, data) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you have a look at #1903. more details about There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I read over the linked thread and honestly I failed to fully understand it. At least, not well enough to be able to write any useful documentation about it. Someone more knowledgeable than I should probably provide those additional details as to why There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. previously update_page_form_field_values() is computing the rendering of the field but also sets for the whole document the field |
||
|
||
# write "output" to pypdf-output.pdf | ||
with open("filled-out.pdf", "wb") as output_stream: | ||
writer.write(output_stream) | ||
``` | ||
|
||
## A note about form fields and annotations | ||
|
||
The PDF form stores form fields as annotations with the subtype "\Widget". This means that the following two blocks of code will give fairly similar results: | ||
|
||
```python | ||
from pypdf import PdfReader | ||
reader = PdfReader("form.pdf") | ||
fields = reader.get_fields() | ||
``` | ||
|
||
```python | ||
from pypdf import PdfReader | ||
from pypdf.constants import AnnotationDictionaryAttributes | ||
reader = PdfReader("form.pdf") | ||
fields = [] | ||
for page in reader.pages: | ||
for annot in page.annotations: | ||
annot = annot.get_object() | ||
if annot[AnnotationDictionaryAttributes.Subtype] == "/Widget": | ||
fields.append(annot) | ||
``` | ||
|
||
However, while similar, there are some very important differences between the two above blocks of code. Most importantly, the first block will return a list of Field objects, where as the second will return more generic dictionary-like objects. The objects lists will *mostly* reference the same object in the underlying PDF, meaning you'll find that `obj_taken_fom_first_list.indirect_reference == obj_taken_from _second_list.indirect_reference`. Field objects are generally more ergonomic, as the exposed data can be access via clearly named properties. However, the more generic dictionary-like objects will contain data that the Field object does not expose, such as the Rect (the widget's position on the page). So, which to use will depend on your use case. | ||
|
||
However, it's also important to note that the two lists do not *always* refer to the same underlying PDF objects. For example, if the form contains radio buttons, you will find that `reader.get_fields()` will get the parent object (the group of radio buttons) whereas `page.annotations` will return all the child objects (the individual radio buttons). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you review your code based on the exchanges in #1902
this will provide a nicer solution not changing pages from the reader object and also keeping the pdf structure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the PR to use writer.append() based on your suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way you are using
.append()
will not capture the outlines, you should more likely pass the indices(as a list) as second parameter and remove the loop:writer.append(reader,indices)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I think I understand now. Wouldn't it be the
pages
parameter though?writer.append(reader, pages=page_indices)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct