Rect model #30

juiwenchen · 2024-01-22T09:31:04Z

@kreuzberger I cherry picked your branch upgrade to this PR in order to make the change atomic. The credit for this rect feature is for you.

#25

-textboxes are not excluded for rects -add rect model -extract rect

kreuzberger · 2024-01-22T11:06:13Z

Hi! Thanks for integration!. If found a small issue in core.py

LOG.info('Text rects crop: %s', 'no' if crop_rects_text else 'no')
The first one should be a "yes"

But only affects logging, so very minor

ubmarco

Good change, seems to work in a small test locally.
Please run tox -e format

ubmarco · 2024-01-22T11:29:33Z

libpdf/core.py

+        LOG.info('Extract tables: %s', 'no' if no_tables else 'yes')
+        LOG.info('Extract figures: %s', 'no' if no_figures else 'yes')
+        LOG.info('Extract rects: %s', 'no' if no_rects else 'yes')
+        LOG.info('Text rects crop: %s', 'no' if crop_rects_text else 'no')


as @kreuzberger already mentioned, is this a typo?

Suggested change

LOG.info('Text rects crop: %s', 'no' if crop_rects_text else 'no')

LOG.info('Text rects crop: %s', 'yes' if crop_rects_text else 'no')

ubmarco · 2024-01-22T12:51:17Z

libpdf/extract.py

+
+                rect_path = os.path.abspath(os.path.join(figure_dir, rect_name))
+
+                #figure = Figure(idx_figure + 1, image_path, fig_pos, links, textboxes, 'None')


commented code

ubmarco · 2024-01-22T13:03:56Z

Please rebase to check the new format-check tox env.

juiwenchen · 2024-01-22T13:24:04Z

Hi! Thanks for integration!. If found a small issue in core.py

LOG.info('Text rects crop: %s', 'no' if crop_rects_text else 'no') The first one should be a "yes"

But only affects logging, so very minor

@kreuzberger Sorry, I didn't pay attention to the flag crop_rects_text. In our internal discussion, we would still like to consider the text in the coverage of rects as paragraphs or chapters. The reason is that chapters and paragraphs may also have the background color. I have adapted this concept in this RP, so rects are not excluded in chapter and paragraphs extraction. What do you think?

kreuzberger · 2024-01-22T15:18:52Z

Hi! Thanks for integration!. If found a small issue in core.py
LOG.info('Text rects crop: %s', 'no' if crop_rects_text else 'no') The first one should be a "yes"
But only affects logging, so very minor

@kreuzberger Sorry, I didn't pay attention to the flag crop_rects_text. In our internal discussion, we would still like to consider the text in the coverage of rects as paragraphs or chapters. The reason is that chapters and paragraphs may also have the background color. I have adapted this concept in this RP, so rects are not excluded in chapter and paragraphs extraction. What do you think?

@juiwenchen I think this is a good concept, rects should not be excluded in chapter and paragrapsh by default. This option was added cause the old figure handling behaved different (figure text was removed from paragraphs and chapters) and i wanted it therefore as an option. So this options could also be removed (crop_rects_text).

ubmarco · 2024-01-22T16:35:28Z

docs/contents/pdf_model.puml


        Paragraph "+b_source  1" *-- "+links  *" Link
        Figure "+b_source  1" *-- "+links  *" Link
        Cell "+b_source  1" *-- "+links  *" Link
+        Rect "+b_source  1" *-- "+links  *" Link


I think Rect cannot be a target of a link. It's just a graphical element.
If we do target search for links I think what people want is a paragraph, table or figure.
So if a Rect contains text, it is also a paragraph and the alorithm will find this in the search area.

-remove crop_rects_text flag -text within the rect is extracted

juiwenchen · 2024-01-23T11:09:04Z

Hi! Thanks for integration!. If found a small issue in core.py
LOG.info('Text rects crop: %s', 'no' if crop_rects_text else 'no') The first one should be a "yes"
But only affects logging, so very minor

@kreuzberger Sorry, I didn't pay attention to the flag crop_rects_text. In our internal discussion, we would still like to consider the text in the coverage of rects as paragraphs or chapters. The reason is that chapters and paragraphs may also have the background color. I have adapted this concept in this RP, so rects are not excluded in chapter and paragraphs extraction. What do you think?

@juiwenchen I think this is a good concept, rects should not be excluded in chapter and paragrapsh by default. This option was added cause the old figure handling behaved different (figure text was removed from paragraphs and chapters) and i wanted it therefore as an option. So this options could also be removed (crop_rects_text).

@kreuzberger I finalized the PR. Apart from the above-mentioned change, I adjusted the rect model that only one textbox at maximum can be in the rect. In this case, only the text covered in the rect is extracted to a newly instantiated textbox as I don't know what is the best way to address the lt_textboxes which are overflowed the rect, so it is the simplest solution from our side. What do you think?

The following is to summarize the changes in this PR based on your commit. If you are happy with this PR, do you mind running it against your test case and let us know if we should merge it.

remove crop_rects_text as the text covered by the rect is considered as paragraphs or chapters
extract the text covered by the rect to a newly instantiated textbox.

kreuzberger · 2024-01-23T13:00:35Z

Hi! I tried to test the branch, and it failed.
It seems that the sphinx-simplepdf files itself will fail in the initial parse. This was fixed in the original pr

pdf = <pdfplumber.pdf.PDF object at 0x7fb9e9f3a350>

    def get_named_destination(pdf):  # pylint: disable=too-many-branches
        """
        Extract Name destination catalog.
    
        Extracts Name destination catalog (link target) from pdf.doc.catalog['Name'] to obtain
        the coordinates (x,y) and page for the corresponding destination's name.
    
        PDFPlumber does not provide explict 'Named Destinations of Document Catalog' like py2pdf, so it needs to be obtained
        by resolving the hierarchical indirect objects.
    
        The first step in this function is to check if the name destination exist in the PDF. If it does not, no extraction
        is executed.
    
        :param pdf: pdf object of pdfplumber.pdf.PDF
        :return: named destination dictionary mapping reference of destination by name object
        """
        LOG.info("Catalog extraction: name destination ...")
    
        # check if name tree exist in catalog and extract name tree
        name_tree = {}
        named_destination = {}
        pdf_catalog = pdf.doc.catalog
        if "Names" in pdf_catalog:
            # PDF 1.2
            if (
                isinstance(pdf_catalog["Names"], PDFObjRef)
                and "Dests" in pdf_catalog["Names"].resolve()
            ):
                name_tree = pdf_catalog["Names"].resolve()["Dests"].resolve()
            elif isinstance(pdf_catalog["Names"], dict) and "Dests" in pdf_catalog["Names"]:
>               name_tree = pdf_catalog["Names"]["Dests"].resolve()
E               AttributeError: 'dict' object has no attribute 'resolve

The orignal code in the PR was:

   if 'Names' in pdf_catalog:
        # PDF 1.2
        if isinstance(pdf_catalog['Names'], PDFObjRef) and 'Dests' in pdf_catalog['Names'].resolve():
            name_tree = pdf_catalog['Names'].resolve()['Dests'].resolve()
        elif isinstance(pdf_catalog['Names'], dict) and 'Dests' in pdf_catalog['Names']:
            name_tree = resolve1(pdf_catalog['Names']['Dests'])
            # name_tree = pdf_catalog['Names']['Dests'].resolve()
            # LOG.debug(f"{name_tree}")

@juiwenchen Suggestion: we add a test for it in this repository on this branch?
test_rects_extraction.pdf

by adding tests/test_rects.py and test this suggested pdf?

And how can i support this?

juiwenchen · 2024-01-23T13:23:28Z

PDF 1.2

@kreuzberger Interesting. I will have a quick fix from your original PR, and then we can merge this PR first. Afterwards, you can create a PR to add the test case for test_rects_extraction.pdf. What do you think

kreuzberger · 2024-01-23T14:54:31Z

i would suggest to add a testcase before merge to ensure the file could be opened
Could be simple. Here are the file contents:

"""Test rects extraction."""
from click.testing import CliRunner

import libpdf
from tests.conftest import (
    PDF_RECTS_EXTRACTION,
)

def test_rects_extraction():
    objects = libpdf.load(PDF_RECTS_EXTRACTION)
    assert objects.flattened.rects is not None

I would then add tests on a new PR. But also ok if i do it all later.

While running the tests i got severe problems executing them:

wand.exceptions.PolicyError: attempt to perform an operation not allowed by the security policy

I have to patch as superuser my etc/configs to get it running! See https://bugs.archlinux.org/task/60580
Was this due to changes in the code / libraries used? I didnt run into those problems before...

kreuzberger · 2024-01-23T15:11:04Z

@juiwenchen ok, after latest update the test runs locally! You can merge, i add the tests later 😄 👍

kreuzberger and others added 6 commits January 20, 2024 14:02

cherry pick the commit from upgrade branch

5433f17

-textboxes are not excluded for rects -add rect model -extract rect

adapted code to include textboxes of rects

1381108

fixed typo

e57322a

make rect visual obvious

fc488f7

improved comments

8b4447f

fixed docs

f45b86d

juiwenchen requested a review from ubmarco January 22, 2024 10:24

adapted CLI

2ab1fd7

ubmarco requested changes Jan 22, 2024

View reviewed changes

ubmarco added 2 commits January 23, 2024 09:08

Fix ruff issues for new rect file

99d95b9

Improved docstring

b06bdcb

kreuzberger mentioned this pull request Jan 23, 2024

Color Information for Paragraphs #25

Closed

juiwenchen added 5 commits January 23, 2024 10:16

rect extraction not saved in figures

8484832

adapted model

1d2a721

changed rect

5cd2e64

-remove crop_rects_text flag -text within the rect is extracted

reformatted by ruff

b6b9661

versioned

1aa7224

juiwenchen requested a review from ubmarco January 23, 2024 11:10

juiwenchen and others added 4 commits January 23, 2024 14:23

fix for resolve()

64fc19f

Improve docs for non_stroking_color

7681390

Revert version update

7335d53

Updated changelog

946c820

ubmarco approved these changes Jan 23, 2024

View reviewed changes

juiwenchen linked an issue Jan 23, 2024 that may be closed by this pull request

Color Information for Paragraphs #25

Closed

juiwenchen merged commit aa2999f into master Jan 23, 2024
15 checks passed

ubmarco deleted the rect-model branch January 24, 2024 08:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rect model #30

Rect model #30

juiwenchen commented Jan 22, 2024 •

edited

Loading

kreuzberger commented Jan 22, 2024 •

edited

Loading

ubmarco left a comment

ubmarco Jan 22, 2024

ubmarco Jan 22, 2024

ubmarco commented Jan 22, 2024

juiwenchen commented Jan 22, 2024 •

edited

Loading

kreuzberger commented Jan 22, 2024

ubmarco Jan 22, 2024

juiwenchen commented Jan 23, 2024 •

edited

Loading

kreuzberger commented Jan 23, 2024

juiwenchen commented Jan 23, 2024 •

edited

Loading

kreuzberger commented Jan 23, 2024 •

edited

Loading

kreuzberger commented Jan 23, 2024

	LOG.info('Text rects crop: %s', 'no' if crop_rects_text else 'no')
	LOG.info('Text rects crop: %s', 'yes' if crop_rects_text else 'no')


		rect_path = os.path.abspath(os.path.join(figure_dir, rect_name))

		#figure = Figure(idx_figure + 1, image_path, fig_pos, links, textboxes, 'None')

Rect model #30

Rect model #30

Conversation

juiwenchen commented Jan 22, 2024 • edited Loading

kreuzberger commented Jan 22, 2024 • edited Loading

ubmarco left a comment

Choose a reason for hiding this comment

ubmarco Jan 22, 2024

Choose a reason for hiding this comment

ubmarco Jan 22, 2024

Choose a reason for hiding this comment

ubmarco commented Jan 22, 2024

juiwenchen commented Jan 22, 2024 • edited Loading

kreuzberger commented Jan 22, 2024

ubmarco Jan 22, 2024

Choose a reason for hiding this comment

juiwenchen commented Jan 23, 2024 • edited Loading

kreuzberger commented Jan 23, 2024

juiwenchen commented Jan 23, 2024 • edited Loading

kreuzberger commented Jan 23, 2024 • edited Loading

kreuzberger commented Jan 23, 2024

juiwenchen commented Jan 22, 2024 •

edited

Loading

kreuzberger commented Jan 22, 2024 •

edited

Loading

juiwenchen commented Jan 22, 2024 •

edited

Loading

juiwenchen commented Jan 23, 2024 •

edited

Loading

juiwenchen commented Jan 23, 2024 •

edited

Loading

kreuzberger commented Jan 23, 2024 •

edited

Loading