Issues when decrypting a PDF with empty metadata values #766

apollo13 · 2022-06-07T09:17:52Z

I am trying to run extract_text against an encrypted pdf (password is simply the default PASSWORD_PADDING) and get the following traceback:

Traceback (most recent call last):
  File "/home/florian/sources/pdfminer.six/test.py", line 3, in <module>
    extract_text("/home/florian/Downloads/test.pdf")
  File "/home/florian/sources/pdfminer.six/pdfminer/high_level.py", line 157, in extract_text
    for page in PDFPage.get_pages(
  File "/home/florian/sources/pdfminer.six/pdfminer/pdfpage.py", line 151, in get_pages
    doc = PDFDocument(parser, password=password, caching=caching)
  File "/home/florian/sources/pdfminer.six/pdfminer/pdfdocument.py", line 746, in __init__
    self.info.append(dict_value(trailer["Info"]))
  File "/home/florian/sources/pdfminer.six/pdfminer/pdftypes.py", line 206, in dict_value
    x = resolve1(x)
  File "/home/florian/sources/pdfminer.six/pdfminer/pdftypes.py", line 118, in resolve1
    x = x.resolve(default=default)
  File "/home/florian/sources/pdfminer.six/pdfminer/pdftypes.py", line 106, in resolve
    return self.doc.getobj(self.objid)
  File "/home/florian/sources/pdfminer.six/pdfminer/pdfdocument.py", line 868, in getobj
    obj = decipher_all(self.decipher, objid, genno, obj)
  File "/home/florian/sources/pdfminer.six/pdfminer/pdftypes.py", line 146, in decipher_all
    x[k] = decipher_all(decipher, objid, genno, v)  # if v else b""
  File "/home/florian/sources/pdfminer.six/pdfminer/pdftypes.py", line 141, in decipher_all
    return decipher(objid, genno, x)
  File "/home/florian/sources/pdfminer.six/pdfminer/pdfdocument.py", line 534, in decrypt
    return self.cfm[name](objid, genno, data)
  File "/home/florian/sources/pdfminer.six/pdfminer/pdfdocument.py", line 551, in decrypt_aes128
    cipher = Cipher(
  File "/home/florian/.local/share/virtualenvs/e3be3bc38212999/lib64/python3.10/site-packages/cryptography/hazmat/primitives/ciphers/base.py", line 92, in __init__
    mode.validate_for_algorithm(algorithm)
  File "/home/florian/.local/share/virtualenvs/e3be3bc38212999/lib64/python3.10/site-packages/cryptography/hazmat/primitives/ciphers/modes.py", line 101, in _check_iv_and_key_length
    _check_iv_length(self, algorithm)
  File "/home/florian/.local/share/virtualenvs/e3be3bc38212999/lib64/python3.10/site-packages/cryptography/hazmat/primitives/ciphers/modes.py", line 75, in _check_iv_length
    raise ValueError(
ValueError: Invalid IV size (0) for CBC.

Digging in I realized that this happens when decrypting the metadata; most notably the (encrypted) metadata looks like this (added print statements):

{'Author': b'\xda)m>b\xe0\x96\x124\xbf9\xa6?\x89^\xf1\xef\xfe\xa6\xafD<\x0f\x9c\xc4x>R\tI\xb9V', 'CreationDate': b'm\x0e\xe3l(\xa1\x1e \x1d\xcb\xc2\x03?A\x07\x84$,\xf2q\xbbr\xe1X(\xd6Q\xdf\x8c\xd1\xd7\x9f', 'Keywords': b'', ...

As you can see Keywords is an empty string (the same goes for Subject).

Now I do not know enough about the PDF specification and cannot comment on whether this is allowed or not (ie should those empty keys be there at all etc) but the error is rather clear now. The IV is taken from the first 16 bytes of data and in this case there is nothing there. One fix is:

diff --git a/pdfminer/pdftypes.py b/pdfminer/pdftypes.py
index f4543b9..8063108 100644
--- a/pdfminer/pdftypes.py
+++ b/pdfminer/pdftypes.py
@@ -138,6 +138,8 @@ def resolve_all(x: object, default: object = None) -> Any:
 def decipher_all(decipher: DecipherCallable, objid: int, genno: int, x: object) -> Any:
     """Recursively deciphers the given object."""
     if isinstance(x, bytes):
+        if not x:  # Do not attempt to decipher empty data (seen in the wild)
+            return x
         return decipher(objid, genno, x)
     if isinstance(x, list):
         x = [decipher_all(decipher, objid, genno, v) for v in x]

Would this be an acceptable fix for you? If yes I could prepare a PR to fix this.

Thank you for your work on pdfminer!

The text was updated successfully, but these errors were encountered:

pietermarsman · 2022-06-25T20:28:52Z

The specification of the trailer is in Section 10.2.1 of the PDF reference. The subject and keywords can be empty strings without encryption.

However, I'm pretty sure that encrypting an empty string will not return an empty string. That wouldn't be a very good encryption ;)

It's ok to add the if statement. I prefer it explicitly checks the lenght of the byte string.

* commit '8f52578e85b27831ab8a68a6d86721ea3348a553': Run black locally with nox (pdfminer#776) Install typing_extensions on Python 3.6 and 3.7 (pdfminer#775) Fix `TypeError` by Ignoring null characters in PSBaseParser (pdfminer#768) Fix `ValueError` with unencrypted metadata values (Fixes pdfminer#766). (pdfminer#774) Fix `TypeError` when getting default width of font (pdfminer#772) Deprecate usage of `if __name__ == "__main__"` in scripts that are not documented. Also deprecate usage of scripts that are only there for testing purposes. (pdfminer#756) Fix Sphinx warnings and error (pdfminer#760) Update CHANGELOG.md for pdfminer#755 Remove upper version bounds (pdfminer#755) Ignore path constructors that do not begin with m (pdfminer#749) Bump version 20220506 & fix small issue with types

pietermarsman added component:document Related to PDFDocument type:anomaly Errors caused by deviations from the PDF Reference status: accepted labels Jun 25, 2022

apollo13 added a commit to apollo13/pdfminer.six that referenced this issue Jun 26, 2022

Fix crash with unencrypted metadata values. Fixes pdfminer#766

331136e

apollo13 added a commit to apollo13/pdfminer.six that referenced this issue Jun 26, 2022

Fix crash with unencrypted metadata values (pdfminer#766).

ee58051

This was referenced Jun 26, 2022

Fix crash with unencrypted metadata values (Fixes #766). apollo13/pdfminer.six#1

Closed

Fix crash with unencrypted metadata values (Fixes #766). #774

Merged

pietermarsman closed this as completed in f63e9fb Jun 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues when decrypting a PDF with empty metadata values #766

Issues when decrypting a PDF with empty metadata values #766

apollo13 commented Jun 7, 2022

pietermarsman commented Jun 25, 2022

Issues when decrypting a PDF with empty metadata values #766

Issues when decrypting a PDF with empty metadata values #766

Comments

apollo13 commented Jun 7, 2022

pietermarsman commented Jun 25, 2022