Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeError: cannot unpack non-iterable PDFObjRef object, when unpacking the value of 'DW2' for font CMAP #518

Closed
EucliTs0 opened this issue Oct 6, 2020 · 3 comments
Labels
type:anomaly Errors caused by deviations from the PDF Reference

Comments

@EucliTs0
Copy link

EucliTs0 commented Oct 6, 2020

Hello,
I have encountered an error, during PDF parse. The error happens in pdffont.py file, when the condition of the cmap.is_vertical() is True (line 708).

When it gets inside the block the following is produced:

`File "/home/dtsolakidis/workspace/OCR-1183-Pdf-to-xml-crash-when-empty-page/pdfminer_in_script.py", line 43, in <module>
    interpreter.process_page(page)

  File "/home/dtsolakidis/anaconda3/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 897, in process_page
    self.render_contents(page.resources, page.contents, ctm=ctm)

  File "/home/dtsolakidis/anaconda3/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 908, in render_contents
    self.init_resources(resources)

  File "/home/dtsolakidis/anaconda3/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 356, in init_resources
    self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec)

  File "/home/dtsolakidis/anaconda3/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 202, in get_font
    font = self.get_font(None, subspec)

  File "/home/dtsolakidis/anaconda3/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 193, in get_font
    font = PDFCIDFont(self, spec)

  File "/home/dtsolakidis/anaconda3/lib/python3.7/site-packages/pdfminer/pdffont.py", line 718, in __init__
    (vy, w) = spec.get('DW2', [880, -1000])

TypeError: cannot unpack non-iterable PDFObjRef object`

I printed the whole 'spec' dictionary to see the type of 'DW2':
{'BaseFont': /'MS-Gothic', 'CIDSystemInfo': <PDFObjRef:151>, 'CIDToGIDMap': /'Identity', 'DW': 500, 'DW2': <PDFObjRef:152>, 'FontDescriptor': <PDFObjRef:153>, 'Subtype': /'CIDFontType2', 'Type': /'Font', 'W2': <PDFObjRef:154>, 'Encoding': /'Identity-V', 'ToUnicode': <PDFStream(155): len=507, {'Filter': /'FlateDecode', 'Length': <PDFObjRef:156>}>}

Normally it should be a list type I suppose, but here it is a PDFObjRef type. I have not seen any other people encountered this, could be a bug?
We can get the list value by typing: spec['DW2'].resolve()

The code I use is just standard code to read the PDF:

path = 'document_issue.pdf'

rsrcmgr = PDFResourceManager()
laparams = LAParams()

output = open('test_out.xml', 'wb')
device = XMLConverter(rsrcmgr, outfp=output, laparams=laparams,
                      stripcontrol=True)

fp = open(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)


for page in PDFPage.get_pages(fp, check_extractable=False):
    interpreter.process_page(page)

Unfortunately, I cannot provide the PDF file because it is confidential document. I use the latest version of pdfminer.six.
Thank you!

@pietermarsman pietermarsman added the type:anomaly Errors caused by deviations from the PDF Reference label Oct 11, 2020
@pietermarsman
Copy link
Member

Hi @EucliTs0, thanks for sharing the bug. Could you copy paste the stacktrace directly from the output? The last line shows

  File "/home/dtsolakidis/anaconda3/lib/python3.7/site-packages/pdfminer/pdffont.py", line 718, in __init__
    (vy, w) = spec.get('DW2', [880, -1000])

But you mention line 708, and also (vy, w) = spec.get('DW2', [880, -1000]) is line 708, not 718 as the stack trace shows. So I'm confused now if this is about 708 or 718.

Anyway, I think this can be solved by using resolve1(spec.get('DW2', [880, -1000])). Can you test that? If succesfull, do you have time to create a PR?

@pietermarsman
Copy link
Member

@EucliTs0 this is a friendly reminder to upload extra details about this issue.

@EucliTs0
Copy link
Author

EucliTs0 commented Oct 21, 2020

@pietermarsman Hello, sorry for the delay I was on holidays. I paste the full traceback as you requested, below:

``Traceback (most recent call last):

File "/home/dtsolakidis/workspace/OCR-1183-Pdf-to-xml-crash-when-empty-page/pdfminer_in_script.py", line 42, in
interpreter.process_page(page)

File "/home/dtsolakidis/anaconda3/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 895, in process_page
self.render_contents(page.resources, page.contents, ctm=ctm)

File "/home/dtsolakidis/anaconda3/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 906, in render_contents
self.init_resources(resources)

File "/home/dtsolakidis/anaconda3/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 354, in init_resources
self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec)

File "/home/dtsolakidis/anaconda3/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 202, in get_font
font = self.get_font(None, subspec)

File "/home/dtsolakidis/anaconda3/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 193, in get_font
font = PDFCIDFont(self, spec)

File "/home/dtsolakidis/anaconda3/lib/python3.7/site-packages/pdfminer/pdffont.py", line 709, in init
(vy, w) = spec.get('DW2', [880, -1000])

TypeError: cannot unpack non-iterable PDFObjRef object``

Also, my bad regarding the line counting, 709 is the line with the latest version of pdfminer.six.

I tried with resolve1() and it solved this issue. I can create a PR for this small fix

EucliTs0 pushed a commit to EucliTs0/pdfminer.six that referenced this issue Oct 21, 2020
…cking the value of 'DW2'

An error is occured when the 'DW2' key contains a PDFObjRef object instead of a list of int values, e.g: 'DW2': <PDFObjRef:152>.
To solve this issue, we utilise the resolve1() function

See: pdfminer#518
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:anomaly Errors caused by deviations from the PDF Reference
Projects
None yet
Development

No branches or pull requests

2 participants