Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues when decrypting a PDF with empty metadata values #766

Closed
apollo13 opened this issue Jun 7, 2022 · 1 comment
Closed

Issues when decrypting a PDF with empty metadata values #766

apollo13 opened this issue Jun 7, 2022 · 1 comment
Labels
component:document Related to PDFDocument status: accepted type:anomaly Errors caused by deviations from the PDF Reference

Comments

@apollo13
Copy link

apollo13 commented Jun 7, 2022

I am trying to run extract_text against an encrypted pdf (password is simply the default PASSWORD_PADDING) and get the following traceback:

Traceback (most recent call last):
  File "/home/florian/sources/pdfminer.six/test.py", line 3, in <module>
    extract_text("/home/florian/Downloads/test.pdf")
  File "/home/florian/sources/pdfminer.six/pdfminer/high_level.py", line 157, in extract_text
    for page in PDFPage.get_pages(
  File "/home/florian/sources/pdfminer.six/pdfminer/pdfpage.py", line 151, in get_pages
    doc = PDFDocument(parser, password=password, caching=caching)
  File "/home/florian/sources/pdfminer.six/pdfminer/pdfdocument.py", line 746, in __init__
    self.info.append(dict_value(trailer["Info"]))
  File "/home/florian/sources/pdfminer.six/pdfminer/pdftypes.py", line 206, in dict_value
    x = resolve1(x)
  File "/home/florian/sources/pdfminer.six/pdfminer/pdftypes.py", line 118, in resolve1
    x = x.resolve(default=default)
  File "/home/florian/sources/pdfminer.six/pdfminer/pdftypes.py", line 106, in resolve
    return self.doc.getobj(self.objid)
  File "/home/florian/sources/pdfminer.six/pdfminer/pdfdocument.py", line 868, in getobj
    obj = decipher_all(self.decipher, objid, genno, obj)
  File "/home/florian/sources/pdfminer.six/pdfminer/pdftypes.py", line 146, in decipher_all
    x[k] = decipher_all(decipher, objid, genno, v)  # if v else b""
  File "/home/florian/sources/pdfminer.six/pdfminer/pdftypes.py", line 141, in decipher_all
    return decipher(objid, genno, x)
  File "/home/florian/sources/pdfminer.six/pdfminer/pdfdocument.py", line 534, in decrypt
    return self.cfm[name](objid, genno, data)
  File "/home/florian/sources/pdfminer.six/pdfminer/pdfdocument.py", line 551, in decrypt_aes128
    cipher = Cipher(
  File "/home/florian/.local/share/virtualenvs/e3be3bc38212999/lib64/python3.10/site-packages/cryptography/hazmat/primitives/ciphers/base.py", line 92, in __init__
    mode.validate_for_algorithm(algorithm)
  File "/home/florian/.local/share/virtualenvs/e3be3bc38212999/lib64/python3.10/site-packages/cryptography/hazmat/primitives/ciphers/modes.py", line 101, in _check_iv_and_key_length
    _check_iv_length(self, algorithm)
  File "/home/florian/.local/share/virtualenvs/e3be3bc38212999/lib64/python3.10/site-packages/cryptography/hazmat/primitives/ciphers/modes.py", line 75, in _check_iv_length
    raise ValueError(
ValueError: Invalid IV size (0) for CBC.

Digging in I realized that this happens when decrypting the metadata; most notably the (encrypted) metadata looks like this (added print statements):

{'Author': b'\xda)m>b\xe0\x96\x124\xbf9\xa6?\x89^\xf1\xef\xfe\xa6\xafD<\x0f\x9c\xc4x>R\tI\xb9V', 'CreationDate': b'm\x0e\xe3l(\xa1\x1e \x1d\xcb\xc2\x03?A\x07\x84$,\xf2q\xbbr\xe1X(\xd6Q\xdf\x8c\xd1\xd7\x9f', 'Keywords': b'', ...

As you can see Keywords is an empty string (the same goes for Subject).

Now I do not know enough about the PDF specification and cannot comment on whether this is allowed or not (ie should those empty keys be there at all etc) but the error is rather clear now. The IV is taken from the first 16 bytes of data and in this case there is nothing there. One fix is:

diff --git a/pdfminer/pdftypes.py b/pdfminer/pdftypes.py
index f4543b9..8063108 100644
--- a/pdfminer/pdftypes.py
+++ b/pdfminer/pdftypes.py
@@ -138,6 +138,8 @@ def resolve_all(x: object, default: object = None) -> Any:
 def decipher_all(decipher: DecipherCallable, objid: int, genno: int, x: object) -> Any:
     """Recursively deciphers the given object."""
     if isinstance(x, bytes):
+        if not x:  # Do not attempt to decipher empty data (seen in the wild)
+            return x
         return decipher(objid, genno, x)
     if isinstance(x, list):
         x = [decipher_all(decipher, objid, genno, v) for v in x]

Would this be an acceptable fix for you? If yes I could prepare a PR to fix this.

Thank you for your work on pdfminer!

@pietermarsman
Copy link
Member

The specification of the trailer is in Section 10.2.1 of the PDF reference. The subject and keywords can be empty strings without encryption.

However, I'm pretty sure that encrypting an empty string will not return an empty string. That wouldn't be a very good encryption ;)

It's ok to add the if statement. I prefer it explicitly checks the lenght of the byte string.

@pietermarsman pietermarsman added component:document Related to PDFDocument type:anomaly Errors caused by deviations from the PDF Reference status: accepted labels Jun 25, 2022
apollo13 added a commit to apollo13/pdfminer.six that referenced this issue Jun 26, 2022
apollo13 added a commit to apollo13/pdfminer.six that referenced this issue Jun 26, 2022
Beants added a commit to HiTalentAlgorithms/pdfminer.six that referenced this issue Aug 5, 2022
* commit '8f52578e85b27831ab8a68a6d86721ea3348a553':
  Run black locally with nox (pdfminer#776)
  Install typing_extensions on Python 3.6 and 3.7 (pdfminer#775)
  Fix `TypeError` by Ignoring null characters in PSBaseParser (pdfminer#768)
  Fix `ValueError` with unencrypted metadata values (Fixes pdfminer#766). (pdfminer#774)
  Fix `TypeError` when getting default width of font (pdfminer#772)
  Deprecate usage of `if __name__ == "__main__"` in scripts that are not documented. Also deprecate usage of scripts that are only there for testing purposes. (pdfminer#756)
  Fix Sphinx warnings and error (pdfminer#760)
  Update CHANGELOG.md for pdfminer#755
  Remove upper version bounds (pdfminer#755)
  Ignore path constructors that do not begin with  m (pdfminer#749)
  Bump version 20220506 & fix small issue with types
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:document Related to PDFDocument status: accepted type:anomaly Errors caused by deviations from the PDF Reference
Projects
None yet
Development

No branches or pull requests

2 participants