Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decode Integer Metadata #297

Closed
prgx-csmith01 opened this issue Oct 27, 2020 · 6 comments
Closed

Decode Integer Metadata #297

prgx-csmith01 opened this issue Oct 27, 2020 · 6 comments
Labels

Comments

@prgx-csmith01
Copy link

I have received this error message for a PDF file:

Traceback (most recent call last):
  File "####", line 145, in ####
    pdf = pdfplumber.load( #### )
  File "/opt/venv-app-microservice/lib/python3.7/site-packages/pdfplumber/__init__.py", line 11, in load
    return PDF(file_or_buffer, **kwargs)
  File "/opt/venv-app-microservice/lib/python3.7/site-packages/pdfplumber/pdf.py", line 42, in __init__
    self.metadata[k] = decode_text(v)
  File "/opt/venv-app-microservice/lib/python3.7/site-packages/pdfplumber/utils.py", line 70, in decode_text
    ords = (ord(c) if type(c) == str else c for c in s)
TypeError: 'int' object is not iterable

It seems that there is no handling for integer metadata in the init of pdf.py

Previously there was a similar bug raised #67 for boolean objects.

I cannot provide the PDF used that caused this error as it is client data. The metadata of the file contains { ... , "Copies" : 0 }.

@samkit-jain
Copy link
Collaborator

Hi @prgx-csmith01 Would it be possible for you to redact everything from the PDF and then share it so that it can be added as a test to PR #298 ?

@prgx-csmith01
Copy link
Author

Hi @samkit-jain , I can't share the PDF but we have created a test file for you with an example of the metadata issue. I hope this helps. Thanks!

test_int_metadata.pdf

samkit-jain added a commit that referenced this issue Oct 29, 2020
h/t @prgx-csmith01 for providing the PDF
@samkit-jain
Copy link
Collaborator

Many thanks @prgx-csmith01 I have updated the PR #298 with the test case.

@mkl-public
Copy link

As an aside: That integer value of the Copies entry is invalid.

According to the specification:

14.3.3 Document Information Dictionary

...

The value associated with any key not specifically mentioned in Table 317 shall be a text string.

(ISO 32000-1)

... and neither is there any Copies entry in table 317 nor any other entry with a numeric type, merely text strings, dates, and names.

Thus, this issue strictly speaking is not a bug (as labeled currently) but a request to support one more type of invalid PDFs.

@jsvine
Copy link
Owner

jsvine commented Oct 30, 2020

@mkl-public That's a good point, and thank you for raising it. I think your diagnosis is correct. I certainly don't want to slide down the slippery slope of trying to handle all malformed PDFs. In this case, however, @samkit-jain has PR'ed an efficient solution — it's a simple adjustment, and one that hopefully will accommodate a few other classes of invalid metadata entries in the future (without becoming a burden on the processing of valid PDFs).

samkit-jain added a commit that referenced this issue Oct 30, 2020
h/t @prgx-csmith01 for providing the PDF
@jsvine
Copy link
Owner

jsvine commented Nov 1, 2020

Closed via #298; now available in develop and will appear in the next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants