Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rect model #30

Merged
merged 18 commits into from
Jan 23, 2024
Merged

Rect model #30

merged 18 commits into from
Jan 23, 2024

Conversation

juiwenchen
Copy link
Contributor

@juiwenchen juiwenchen commented Jan 22, 2024

@kreuzberger I cherry picked your branch upgrade to this PR in order to make the change atomic. The credit for this rect feature is for you.

#25

@kreuzberger
Copy link
Contributor

kreuzberger commented Jan 22, 2024

Hi! Thanks for integration!. If found a small issue in core.py

LOG.info('Text rects crop: %s', 'no' if crop_rects_text else 'no')
The first one should be a "yes"

But only affects logging, so very minor

Copy link
Member

@ubmarco ubmarco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good change, seems to work in a small test locally.
Please run tox -e format

libpdf/core.py Outdated
LOG.info('Extract tables: %s', 'no' if no_tables else 'yes')
LOG.info('Extract figures: %s', 'no' if no_figures else 'yes')
LOG.info('Extract rects: %s', 'no' if no_rects else 'yes')
LOG.info('Text rects crop: %s', 'no' if crop_rects_text else 'no')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as @kreuzberger already mentioned, is this a typo?

Suggested change
LOG.info('Text rects crop: %s', 'no' if crop_rects_text else 'no')
LOG.info('Text rects crop: %s', 'yes' if crop_rects_text else 'no')


rect_path = os.path.abspath(os.path.join(figure_dir, rect_name))

#figure = Figure(idx_figure + 1, image_path, fig_pos, links, textboxes, 'None')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commented code

@ubmarco
Copy link
Member

ubmarco commented Jan 22, 2024

Please rebase to check the new format-check tox env.

@juiwenchen
Copy link
Contributor Author

juiwenchen commented Jan 22, 2024

Hi! Thanks for integration!. If found a small issue in core.py

LOG.info('Text rects crop: %s', 'no' if crop_rects_text else 'no') The first one should be a "yes"

But only affects logging, so very minor

@kreuzberger Sorry, I didn't pay attention to the flag crop_rects_text. In our internal discussion, we would still like to consider the text in the coverage of rects as paragraphs or chapters. The reason is that chapters and paragraphs may also have the background color. I have adapted this concept in this RP, so rects are not excluded in chapter and paragraphs extraction. What do you think?

@kreuzberger
Copy link
Contributor

Hi! Thanks for integration!. If found a small issue in core.py
LOG.info('Text rects crop: %s', 'no' if crop_rects_text else 'no') The first one should be a "yes"
But only affects logging, so very minor

@kreuzberger Sorry, I didn't pay attention to the flag crop_rects_text. In our internal discussion, we would still like to consider the text in the coverage of rects as paragraphs or chapters. The reason is that chapters and paragraphs may also have the background color. I have adapted this concept in this RP, so rects are not excluded in chapter and paragraphs extraction. What do you think?

@juiwenchen I think this is a good concept, rects should not be excluded in chapter and paragrapsh by default. This option was added cause the old figure handling behaved different (figure text was removed from paragraphs and chapters) and i wanted it therefore as an option. So this options could also be removed (crop_rects_text).


Paragraph "+b_source 1" *-- "+links *" Link
Figure "+b_source 1" *-- "+links *" Link
Cell "+b_source 1" *-- "+links *" Link
Rect "+b_source 1" *-- "+links *" Link
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Rect cannot be a target of a link. It's just a graphical element.
If we do target search for links I think what people want is a paragraph, table or figure.
So if a Rect contains text, it is also a paragraph and the alorithm will find this in the search area.

@juiwenchen
Copy link
Contributor Author

juiwenchen commented Jan 23, 2024

Hi! Thanks for integration!. If found a small issue in core.py
LOG.info('Text rects crop: %s', 'no' if crop_rects_text else 'no') The first one should be a "yes"
But only affects logging, so very minor

@kreuzberger Sorry, I didn't pay attention to the flag crop_rects_text. In our internal discussion, we would still like to consider the text in the coverage of rects as paragraphs or chapters. The reason is that chapters and paragraphs may also have the background color. I have adapted this concept in this RP, so rects are not excluded in chapter and paragraphs extraction. What do you think?

@juiwenchen I think this is a good concept, rects should not be excluded in chapter and paragrapsh by default. This option was added cause the old figure handling behaved different (figure text was removed from paragraphs and chapters) and i wanted it therefore as an option. So this options could also be removed (crop_rects_text).

@kreuzberger I finalized the PR. Apart from the above-mentioned change, I adjusted the rect model that only one textbox at maximum can be in the rect. In this case, only the text covered in the rect is extracted to a newly instantiated textbox as I don't know what is the best way to address the lt_textboxes which are overflowed the rect, so it is the simplest solution from our side. What do you think?

The following is to summarize the changes in this PR based on your commit. If you are happy with this PR, do you mind running it against your test case and let us know if we should merge it.

  • remove crop_rects_text as the text covered by the rect is considered as paragraphs or chapters
  • extract the text covered by the rect to a newly instantiated textbox.

@kreuzberger
Copy link
Contributor

Hi! I tried to test the branch, and it failed.
It seems that the sphinx-simplepdf files itself will fail in the initial parse. This was fixed in the original pr

pdf = <pdfplumber.pdf.PDF object at 0x7fb9e9f3a350>

    def get_named_destination(pdf):  # pylint: disable=too-many-branches
        """
        Extract Name destination catalog.
    
        Extracts Name destination catalog (link target) from pdf.doc.catalog['Name'] to obtain
        the coordinates (x,y) and page for the corresponding destination's name.
    
        PDFPlumber does not provide explict 'Named Destinations of Document Catalog' like py2pdf, so it needs to be obtained
        by resolving the hierarchical indirect objects.
    
        The first step in this function is to check if the name destination exist in the PDF. If it does not, no extraction
        is executed.
    
        :param pdf: pdf object of pdfplumber.pdf.PDF
        :return: named destination dictionary mapping reference of destination by name object
        """
        LOG.info("Catalog extraction: name destination ...")
    
        # check if name tree exist in catalog and extract name tree
        name_tree = {}
        named_destination = {}
        pdf_catalog = pdf.doc.catalog
        if "Names" in pdf_catalog:
            # PDF 1.2
            if (
                isinstance(pdf_catalog["Names"], PDFObjRef)
                and "Dests" in pdf_catalog["Names"].resolve()
            ):
                name_tree = pdf_catalog["Names"].resolve()["Dests"].resolve()
            elif isinstance(pdf_catalog["Names"], dict) and "Dests" in pdf_catalog["Names"]:
>               name_tree = pdf_catalog["Names"]["Dests"].resolve()
E               AttributeError: 'dict' object has no attribute 'resolve

The orignal code in the PR was:

   if 'Names' in pdf_catalog:
        # PDF 1.2
        if isinstance(pdf_catalog['Names'], PDFObjRef) and 'Dests' in pdf_catalog['Names'].resolve():
            name_tree = pdf_catalog['Names'].resolve()['Dests'].resolve()
        elif isinstance(pdf_catalog['Names'], dict) and 'Dests' in pdf_catalog['Names']:
            name_tree = resolve1(pdf_catalog['Names']['Dests'])
            # name_tree = pdf_catalog['Names']['Dests'].resolve()
            # LOG.debug(f"{name_tree}")

@juiwenchen Suggestion: we add a test for it in this repository on this branch?
test_rects_extraction.pdf

by adding tests/test_rects.py and test this suggested pdf?

And how can i support this?

@juiwenchen
Copy link
Contributor Author

juiwenchen commented Jan 23, 2024

PDF 1.2

@kreuzberger Interesting. I will have a quick fix from your original PR, and then we can merge this PR first. Afterwards, you can create a PR to add the test case for test_rects_extraction.pdf. What do you think

@juiwenchen juiwenchen linked an issue Jan 23, 2024 that may be closed by this pull request
@kreuzberger
Copy link
Contributor

kreuzberger commented Jan 23, 2024

i would suggest to add a testcase before merge to ensure the file could be opened
Could be simple. Here are the file contents:

"""Test rects extraction."""
from click.testing import CliRunner

import libpdf
from tests.conftest import (
    PDF_RECTS_EXTRACTION,
)

def test_rects_extraction():
    objects = libpdf.load(PDF_RECTS_EXTRACTION)
    assert objects.flattened.rects is not None

I would then add tests on a new PR. But also ok if i do it all later.

While running the tests i got severe problems executing them:

wand.exceptions.PolicyError: attempt to perform an operation not allowed by the security policy

I have to patch as superuser my etc/configs to get it running! See https://bugs.archlinux.org/task/60580
Was this due to changes in the code / libraries used? I didnt run into those problems before...

@kreuzberger
Copy link
Contributor

@juiwenchen ok, after latest update the test runs locally! You can merge, i add the tests later 😄 👍

@juiwenchen juiwenchen merged commit aa2999f into master Jan 23, 2024
15 checks passed
@ubmarco ubmarco deleted the rect-model branch January 24, 2024 08:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Color Information for Paragraphs
3 participants