Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: min() arg is an empty sequence #53

Closed
gtholpadiperitusai opened this issue Apr 17, 2019 · 3 comments
Closed

ValueError: min() arg is an empty sequence #53

gtholpadiperitusai opened this issue Apr 17, 2019 · 3 comments

Comments

@gtholpadiperitusai
Copy link

Describe the bug
When I run pdftotree on a PDF file, I get a runtime exception: ValueError: min() arg is an empty sequence.

To Reproduce
Steps to reproduce the behavior:

  1. Download this PDF file: performance-smart-networks.pdf

  2. Execute the following code: html = pdftotree.parse(pdf_file="performance-smart-networks.pdf", html_path=None, model_type=None, model_path=None, favor_figures=True, visualize=False)

Expected behavior
The variable html should contain the HTML mark-up with the text from the PDF.

Error Logs/Screenshots
Here is the full error stack trace:

~/anaconda3/lib/python3.6/site-packages/pdftotree/core.py in parse(pdf_file, html_path, model_type, model_path, favor_figures, visualize)
     63     if not extractor.is_scanned():
     64         log.info("Digitized PDF detected, building tree structure...")
---> 65         pdf_tree = extractor.get_tree_structure(model_type, model, favor_figures)
     66         log.info("Tree structure built, creating html...")
     67         pdf_html = extractor.get_html_tree()

~/anaconda3/lib/python3.6/site-packages/pdftotree/TreeExtract.py in get_tree_structure(self, model_type, model, favor_figures)
    236                 ref_page_seen,
    237                 tables[page_num],
--> 238                 favor_figures,
    239             )
    240         return self.tree

~/anaconda3/lib/python3.6/site-packages/pdftotree/utils/pdf/pdf_parsers.py in parse_tree_structure(elems, font_stat, page_num, ref_page_seen, tables, favor_figures)
    760     # Figures for this page
    761     figures_page = get_figures(
--> 762         mentions, elems.layout.bbox, page_num, boxes_figures, page_width, page_height
    763     )
    764 

~/anaconda3/lib/python3.6/site-packages/pdftotree/utils/pdf/pdf_parsers.py in get_figures(boxes, page_bbox, page_num, boxes_figures, page_width, page_height)
   1244 
   1245     for fig_box in boxes_figures:
-> 1246         node_fig = Node(fig_box)
   1247         nodes_figures.append(node_fig)
   1248 

~/anaconda3/lib/python3.6/site-packages/pdftotree/utils/pdf/node.py in __init__(self, elems)
     45         #     # self.sum_elem_bbox = self.sum_elem_bbox + len(elem.get_text())
     46         self.table_area_threshold = 0.7
---> 47         self.set_bbox(bound_elems(elems))
     48         # self.table_indicator = True
     49         self.type_counts = Counter(map(elem_type, elems))

~/anaconda3/lib/python3.6/site-packages/pdftotree/utils/pdf/vector_utils.py in bound_elems(elems)
    119     Finds the minimal bbox that contains all given elems
    120     """
--> 121     group_x0 = min(map(lambda l: l.x0, elems))
    122     group_y0 = min(map(lambda l: l.y0, elems))
    123     group_x1 = max(map(lambda l: l.x1, elems))

ValueError: min() arg is an empty sequence

Environment (please complete the following information):

  • OS: Ubuntu 18.04
  • Python: 3.6.4 (Anaconda distribution)
  • pdftotree Version: v0.4.0
@scherbatsky-jr
Copy link

I have run into same issue with my pdf and it is not a scanned document. I have checked all the bug fixes and the problem still persists.

@aschetsen
Copy link

I had an similar error and found out that my pdf simulated empty LTFigures. These empty objects will cause your error, since l.x0, l.y0, l.x1 and l.y1 just don't exists, and therefore your mapping will be empty, i.e. min() arg is an empty sequence.

I solved it by not adding empty LTFigures while constructing the elements of the pdf. You need to add a single if statement in function processor(m) of the package pdf_utils.py (pdftotree.utils.pdf.pdf_utils). See # ADD THIS.

def processor(m):
        # Normalizes the coordinate system to be consistent with
        # image library conventions (top left as origin)
        if isinstance(m, LTComponent):
            m.set_bbox(normalize_bbox(m.bbox, height, scaler))

            if isinstance(m, LTCurve):
                m.pts = normalize_pts(m.pts, height, scaler)
                # Only keep longer lines here
                if isinstance(m, LTLine) and max(m.width, m.height) > pts_thres:
                    segments.append(m)
                    return
                # Here we exclude straight lines from curves
                curves.append(m)
                return

            if isinstance(m, LTFigure):
                if len(m) > 0: # ADD THIS
                    figures.append(m)
                    return

            # Collect stats on the chars
            if isinstance(m, LTChar):
                chars.append(m)
                # fonts could be rotated 90/270 degrees
                font_size = _font_size_of(m)
                font_size_counter[font_size] += 1
                return

            if isinstance(m, LTTextLine):
                mention_text = keep_allowed_chars(m.get_text()).strip()
                # Skip empty and invalid lines
                if mention_text:
                    # TODO: add subscript detection and use latex underscore
                    # or superscript
                    m.clean_text = mention_text
                    m.font_name, m.font_size = _font_of_mention(m)
                    mentions.append(m)
                return

@HiromuHota
Copy link
Contributor

HiromuHota commented Oct 2, 2020

Duplicate of #42

@HiromuHota HiromuHota marked this as a duplicate of #42 Oct 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants