Version pdfminer.six 20191107 incorrectly orders some text #334

lithiumFlower · 2019-11-08T16:06:56Z

Upgrading from version 20181108 to 20191107 pdfminer parses some words out of order.
In version 20181108 the ordering was correct, see first output below.
In version 20191107 the ordering is incorrect, see second output below.

In this pdf: http://unec.edu.az/application/uploads/2014/12/pdf-sample.pdf

The fourth bullet point ends with the word "anyone". When parsing, "anyone" now ends up at the end of the third bullet point instead (directly following "incompatibilities").

Sample parsing logic:

import os
import sys
from io import StringIO
from pprint import pprint

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage, PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.layout import LAParams
from pdfminer.converter import TextConverter

def get_text(path):
    fd = open(path, 'rb')

    parser = PDFParser(fd)
    document = PDFDocument(parser)

    # Try to parse the document
    if not document.is_extractable:
        raise PDFTextExtractionNotAllowed

    rsrcmgr = PDFResourceManager()

    text = StringIO()

    device = TextConverter(rsrcmgr, text, codec='utf-8', laparams=LAParams())

    interpreter = PDFPageInterpreter(rsrcmgr, device)

    for page in PDFPage.create_pages(document):
        interpreter.process_page(page)

    return text.getvalue()

txt = get_text(os.path.abspath(sys.argv[1]))
pprint(txt)

20181108 output

$ python testit.py pdf-sample.pdf
('Adobe Acrobat PDF Files\n'
 '\n'
 'Adobe® Portable Document Format (PDF) is a universal file format that '
 'preserves all\n'
 'of the fonts, formatting, colours and graphics of any source document, '
 'regardless of\n'
 'the application and platform used to create it.\n'
 '\n'
 'Adobe PDF is an ideal format for electronic document distribution as it '
 'overcomes the\n'
 'problems commonly encountered with electronic file sharing.\n'
 '\n'
 '•  Anyone, anywhere can open a PDF file. All you need is the free Adobe '
 'Acrobat\n'
 "Reader. Recipients of other file formats sometimes can't open files because "
 'they\n'
 "don't have the applications used to create the documents.\n"
 '\n'
 '•  PDF files always print correctly on any printing device.\n'
 '\n'
 '•  PDF  files  always  display  exactly  as  created,  regardless  of  '
 'fonts,  software,  and\n'
 'operating systems. Fonts, and graphics are not lost due to platform, '
 'software, and\n'
 'version incompatibilities.\n'
 '\n'
 '•  The  free  Acrobat  Reader  is  easy  to  download  and  can  be  freely  '
 'distributed  by\n'
 '\n'
 'anyone.\n'
 '\n'
 '•  Compact  PDF  files  are  smaller  than  their  source  files  and  '
 'download  a\n'
 '\n'
 'page at a time for fast display on the Web.\n'
 '\n'
 '\x0c')

20191107 output

$ python testit.py pdf-sample.pdf
('Adobe Acrobat PDF Files\n'
 '\n'
 'Adobe® Portable Document Format (PDF) is a universal file format that '
 'preserves all\n'
 'of the fonts, formatting, colours and graphics of any source document, '
 'regardless of\n'
 'the application and platform used to create it.\n'
 '\n'
 'Adobe PDF is an ideal format for electronic document distribution as it '
 'overcomes the\n'
 'problems commonly encountered with electronic file sharing.\n'
 '\n'
 '•  Anyone, anywhere can open a PDF file. All you need is the free Adobe '
 'Acrobat\n'
 "Reader. Recipients of other file formats sometimes can't open files because "
 'they\n'
 "don't have the applications used to create the documents.\n"
 '\n'
 '•  PDF files always print correctly on any printing device.\n'
 '\n'
 '•  PDF  files  always  display  exactly  as  created,  regardless  of  '
 'fonts,  software,  and\n'
 'operating systems. Fonts, and graphics are not lost due to platform, '
 'software, and\n'
 'version incompatibilities.\n'
 '\n'
 'anyone.\n'
 '\n'
 '•  The  free  Acrobat  Reader  is  easy  to  download  and  can  be  freely  '
 'distributed  by\n'
 '\n'
 '•  Compact  PDF  files  are  smaller  than  their  source  files  and  '
 'download  a\n'
 '\n'
 'page at a time for fast display on the Web.\n'
 '\n'
 '\x0c')

The text was updated successfully, but these errors were encountered:

pietermarsman · 2019-11-08T16:35:44Z

Hi @lithiumFlower, thanks for raising this issue. I think this is caused by PR #315, which improved the speedup of layout analysis by 20% to 500%, depending on the PDF. It should not have deteriorated the result, but in this case it clearly does.

I will try to pinpoint what's going wrong.

pietermarsman · 2019-11-09T10:26:09Z

I've use git bisect and 44b223c is the first commit that has this bug so my suspect about PR #315 is right.

pietermarsman · 2019-11-09T10:47:56Z

Aha! I've figured it out. PR #315 changes the distance list. @mikkkee, you might want to see this.

old version starts with (0, dist(obj1, obj2), ...)
new version starts with (True, dist(obj1, obj2), ...)

In the new version it will always prefer grouping text boxes that have not been grouped yet.

…t grouping of textboxes should be skipped if there are intermediate textboxes. (#335) Fixes #334

pietermarsman · 2019-11-10T11:57:53Z

It is fixed in the latest version!

lithiumFlower · 2019-11-11T15:30:47Z

I'm not used to open source moving quickly - thanks @pietermarsman for the fix

pietermarsman · 2019-11-11T15:53:58Z

👍

pietermarsman added type: bug component: converter Related to any PDFLayoutAnalyzer labels Nov 8, 2019

pietermarsman mentioned this issue Nov 9, 2019

Fix ordering of grouping textboxes. The first grouping of textboxes should be skipped if there are intermediate textboxes. #335

Merged

9 tasks

pietermarsman closed this as completed in #335 Nov 10, 2019

pietermarsman added a commit that referenced this issue Nov 10, 2019

Fix wrong ordering of grouping textboxes introduced by #315. The firs…

2bee7d8

…t grouping of textboxes should be skipped if there are intermediate textboxes. (#335) Fixes #334

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Version pdfminer.six 20191107 incorrectly orders some text #334

Version pdfminer.six 20191107 incorrectly orders some text #334

lithiumFlower commented Nov 8, 2019

pietermarsman commented Nov 8, 2019

pietermarsman commented Nov 9, 2019

pietermarsman commented Nov 9, 2019

pietermarsman commented Nov 10, 2019

lithiumFlower commented Nov 11, 2019

pietermarsman commented Nov 11, 2019

Version pdfminer.six 20191107 incorrectly orders some text #334

Version pdfminer.six 20191107 incorrectly orders some text #334

Comments

lithiumFlower commented Nov 8, 2019

20181108 output

20191107 output

pietermarsman commented Nov 8, 2019

pietermarsman commented Nov 9, 2019

pietermarsman commented Nov 9, 2019

pietermarsman commented Nov 10, 2019

lithiumFlower commented Nov 11, 2019

pietermarsman commented Nov 11, 2019