TypeError: cannot unpack non-iterable PDFObjRef object, when unpacking the value of 'DW2' for font CMAP #518

EucliTs0 · 2020-10-06T10:02:56Z

Hello,
I have encountered an error, during PDF parse. The error happens in pdffont.py file, when the condition of the cmap.is_vertical() is True (line 708).

When it gets inside the block the following is produced:

`File "/home/dtsolakidis/workspace/OCR-1183-Pdf-to-xml-crash-when-empty-page/pdfminer_in_script.py", line 43, in <module>
    interpreter.process_page(page)

  File "/home/dtsolakidis/anaconda3/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 897, in process_page
    self.render_contents(page.resources, page.contents, ctm=ctm)

  File "/home/dtsolakidis/anaconda3/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 908, in render_contents
    self.init_resources(resources)

  File "/home/dtsolakidis/anaconda3/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 356, in init_resources
    self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec)

  File "/home/dtsolakidis/anaconda3/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 202, in get_font
    font = self.get_font(None, subspec)

  File "/home/dtsolakidis/anaconda3/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 193, in get_font
    font = PDFCIDFont(self, spec)

  File "/home/dtsolakidis/anaconda3/lib/python3.7/site-packages/pdfminer/pdffont.py", line 718, in __init__
    (vy, w) = spec.get('DW2', [880, -1000])

TypeError: cannot unpack non-iterable PDFObjRef object`

I printed the whole 'spec' dictionary to see the type of 'DW2':
{'BaseFont': /'MS-Gothic', 'CIDSystemInfo': <PDFObjRef:151>, 'CIDToGIDMap': /'Identity', 'DW': 500, 'DW2': <PDFObjRef:152>, 'FontDescriptor': <PDFObjRef:153>, 'Subtype': /'CIDFontType2', 'Type': /'Font', 'W2': <PDFObjRef:154>, 'Encoding': /'Identity-V', 'ToUnicode': <PDFStream(155): len=507, {'Filter': /'FlateDecode', 'Length': <PDFObjRef:156>}>}

Normally it should be a list type I suppose, but here it is a PDFObjRef type. I have not seen any other people encountered this, could be a bug?
We can get the list value by typing: spec['DW2'].resolve()

The code I use is just standard code to read the PDF:

path = 'document_issue.pdf'

rsrcmgr = PDFResourceManager()
laparams = LAParams()

output = open('test_out.xml', 'wb')
device = XMLConverter(rsrcmgr, outfp=output, laparams=laparams,
                      stripcontrol=True)

fp = open(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)


for page in PDFPage.get_pages(fp, check_extractable=False):
    interpreter.process_page(page)

Unfortunately, I cannot provide the PDF file because it is confidential document. I use the latest version of pdfminer.six.
Thank you!

The text was updated successfully, but these errors were encountered:

pietermarsman · 2020-10-11T16:00:02Z

Hi @EucliTs0, thanks for sharing the bug. Could you copy paste the stacktrace directly from the output? The last line shows

  File "/home/dtsolakidis/anaconda3/lib/python3.7/site-packages/pdfminer/pdffont.py", line 718, in __init__
    (vy, w) = spec.get('DW2', [880, -1000])

But you mention line 708, and also (vy, w) = spec.get('DW2', [880, -1000]) is line 708, not 718 as the stack trace shows. So I'm confused now if this is about 708 or 718.

Anyway, I think this can be solved by using resolve1(spec.get('DW2', [880, -1000])). Can you test that? If succesfull, do you have time to create a PR?

pietermarsman · 2020-10-18T10:44:37Z

@EucliTs0 this is a friendly reminder to upload extra details about this issue.

EucliTs0 · 2020-10-21T06:44:09Z

@pietermarsman Hello, sorry for the delay I was on holidays. I paste the full traceback as you requested, below:

``Traceback (most recent call last):

File "/home/dtsolakidis/workspace/OCR-1183-Pdf-to-xml-crash-when-empty-page/pdfminer_in_script.py", line 42, in
interpreter.process_page(page)

File "/home/dtsolakidis/anaconda3/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 895, in process_page
self.render_contents(page.resources, page.contents, ctm=ctm)

File "/home/dtsolakidis/anaconda3/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 906, in render_contents
self.init_resources(resources)

File "/home/dtsolakidis/anaconda3/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 354, in init_resources
self.fontmap[fontid] = self.rsrcmgr.get_font(objid, spec)

File "/home/dtsolakidis/anaconda3/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 202, in get_font
font = self.get_font(None, subspec)

File "/home/dtsolakidis/anaconda3/lib/python3.7/site-packages/pdfminer/pdfinterp.py", line 193, in get_font
font = PDFCIDFont(self, spec)

File "/home/dtsolakidis/anaconda3/lib/python3.7/site-packages/pdfminer/pdffont.py", line 709, in init
(vy, w) = spec.get('DW2', [880, -1000])

TypeError: cannot unpack non-iterable PDFObjRef object``

Also, my bad regarding the line counting, 709 is the line with the latest version of pdfminer.six.

I tried with resolve1() and it solved this issue. I can create a PR for this small fix

…cking the value of 'DW2' An error is occured when the 'DW2' key contains a PDFObjRef object instead of a list of int values, e.g: 'DW2': <PDFObjRef:152>. To solve this issue, we utilise the resolve1() function See: pdfminer#518

pietermarsman added the type:anomaly Errors caused by deviations from the PDF Reference label Oct 11, 2020

EucliTs0 mentioned this issue Oct 21, 2020

Fix TypeError: cannot unpack non-iterable PDFObjRef object, when unpacking the value of 'DW2' #529

Merged

6 tasks

pietermarsman closed this as completed in fc75972 Oct 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TypeError: cannot unpack non-iterable PDFObjRef object, when unpacking the value of 'DW2' for font CMAP #518

TypeError: cannot unpack non-iterable PDFObjRef object, when unpacking the value of 'DW2' for font CMAP #518

EucliTs0 commented Oct 6, 2020

pietermarsman commented Oct 11, 2020

pietermarsman commented Oct 18, 2020

EucliTs0 commented Oct 21, 2020 •

edited

Loading

TypeError: cannot unpack non-iterable PDFObjRef object, when unpacking the value of 'DW2' for font CMAP #518

TypeError: cannot unpack non-iterable PDFObjRef object, when unpacking the value of 'DW2' for font CMAP #518

Comments

EucliTs0 commented Oct 6, 2020

pietermarsman commented Oct 11, 2020

pietermarsman commented Oct 18, 2020

EucliTs0 commented Oct 21, 2020 • edited Loading

EucliTs0 commented Oct 21, 2020 •

edited

Loading