Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong field names when dot in names #1468

Closed
PhunkyBob opened this issue Dec 2, 2022 · 8 comments · Fixed by #1529
Closed

Wrong field names when dot in names #1468

PhunkyBob opened this issue Dec 2, 2022 · 8 comments · Fixed by #1529

Comments

@PhunkyBob
Copy link

PhunkyBob commented Dec 2, 2022

I would like to retrieve all elements of a PDF form and display for each element it's content.
When I read fields having a "." in its name, the "key" returned is only the last part of the key, and the value is the value of the last element having this (wrong) key.

Environment

Windows 11 (Windows-10-10.0.22623-SP0)
Python 3.10
pypdf 3.2.0 / PyPDF2-2.11.2

Code + PDF

I have a PDF file with the following fields:

  • customer.name
  • customer.lastname
  • other_field
  • company.name

pdf

fields_with_dots.pdf

I want to get all fields and the corresponding values.

from PyPDF2 import PdfFileReader

if __name__ == "__main__":
    pdf_file_name = "fields_with_dots.pdf"
    with open(pdf_file_name, "rb") as pdfobject:
        pdf = PdfFileReader(pdfobject)
        fields = pdf.getFormTextFields()
        for k, v in fields.items():
            print(f"{k:32} : {v}")

What I expected

customer.name               : My Name
customer.lastname         : My Lastname
other_field                      : Hello world!
company.name               : My company

What the result is

name                             : My company
lastname                         : My Lastname
other_field                      : Hello world!

"customer.name" is identified as "name". "company.name" is also identified as "name".
The value is the latest found in document : "My Company".
--> value for "customer.name" is lost.

@PhunkyBob
Copy link
Author

I switched to pypdf instead of PyPDF2.
The same problem occurs.

@MartinThoma
Copy link
Member

Of course. It's the same software in the end. But in pypdf the development will continue and the issue will eventually be resolved.

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Jan 4, 2023
@pubpub-zz
Copy link
Collaborator

@PhunkyBob,
I've implemented the handling of hierarchical fields, you will have to modify your code to add a true parameter to get_form_text_fields to get the fully_qualified_name

@PhunkyBob
Copy link
Author

Thank you.
How can I test it?

@pubpub-zz
Copy link
Collaborator

Thank you.
How can I test it?

Yes you can load the files from the PR
It is not yet merged as waiting for adding test

@PhunkyBob
Copy link
Author

PhunkyBob commented Jan 5, 2023

The result with your hierachical_fields branch:

customer.name                    : My Name
customer.lastname                : My Lastname
other_field                      : Hello world!
company.name                     : My company

-> it's what I expected!

Thank you!
I hope this will be merged soon.

Code to reproduce:

from pypdf import PdfReader

if __name__ == "__main__":
    pdf_file_name = "fields_with_dots.pdf"
    pdf = PdfReader(pdf_file_name)
    fields = pdf.get_form_text_fields(full_qualified_name=True)
    for k, v in fields.items():
        print(f"{k:32} : {v}")

@ahmedshabib
Copy link

@PhunkyBob, I've implemented the handling of hierarchical fields, you will have to modify your code to add a true parameter to get_form_text_fields to get the fully_qualified_name

I checked your PR, and tried it locally, if I use get_fields , this gives me all of the keys now, but still has some stray keys.

@pubpub-zz
Copy link
Collaborator

I checked your PR, and tried it locally, if I use get_fields , this gives me all of the keys now, but still has some stray keys.

@ahmedshabib
Can you provide an pdf example withe "stray keys"?

MartinThoma pushed a commit that referenced this issue Jan 8, 2023
Indexed names are implemented with `.` not `_` (possible mix up with names).
An optional parameter `full_qualified_name` was added to get_form_text_fields.

Fixes #1468
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants