-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rect model #30
Rect model #30
Conversation
-textboxes are not excluded for rects -add rect model -extract rect
Hi! Thanks for integration!. If found a small issue in core.py
But only affects logging, so very minor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good change, seems to work in a small test locally.
Please run tox -e format
libpdf/core.py
Outdated
LOG.info('Extract tables: %s', 'no' if no_tables else 'yes') | ||
LOG.info('Extract figures: %s', 'no' if no_figures else 'yes') | ||
LOG.info('Extract rects: %s', 'no' if no_rects else 'yes') | ||
LOG.info('Text rects crop: %s', 'no' if crop_rects_text else 'no') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as @kreuzberger already mentioned, is this a typo?
LOG.info('Text rects crop: %s', 'no' if crop_rects_text else 'no') | |
LOG.info('Text rects crop: %s', 'yes' if crop_rects_text else 'no') |
libpdf/extract.py
Outdated
|
||
rect_path = os.path.abspath(os.path.join(figure_dir, rect_name)) | ||
|
||
#figure = Figure(idx_figure + 1, image_path, fig_pos, links, textboxes, 'None') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
commented code
Please rebase to check the new format-check tox env. |
@kreuzberger Sorry, I didn't pay attention to the flag |
@juiwenchen I think this is a good concept, rects should not be excluded in chapter and paragrapsh by default. This option was added cause the old figure handling behaved different (figure text was removed from paragraphs and chapters) and i wanted it therefore as an option. So this options could also be removed (crop_rects_text). |
docs/contents/pdf_model.puml
Outdated
|
||
Paragraph "+b_source 1" *-- "+links *" Link | ||
Figure "+b_source 1" *-- "+links *" Link | ||
Cell "+b_source 1" *-- "+links *" Link | ||
Rect "+b_source 1" *-- "+links *" Link |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Rect
cannot be a target of a link. It's just a graphical element.
If we do target search for links I think what people want is a paragraph, table or figure.
So if a Rect
contains text, it is also a paragraph and the alorithm will find this in the search area.
-remove crop_rects_text flag -text within the rect is extracted
@kreuzberger I finalized the PR. Apart from the above-mentioned change, I adjusted the rect model that only one textbox at maximum can be in the rect. In this case, only the text covered in the rect is extracted to a newly instantiated textbox as I don't know what is the best way to address the lt_textboxes which are overflowed the rect, so it is the simplest solution from our side. What do you think? The following is to summarize the changes in this PR based on your commit. If you are happy with this PR, do you mind running it against your test case and let us know if we should merge it.
|
Hi! I tried to test the branch, and it failed.
The orignal code in the PR was:
@juiwenchen Suggestion: we add a test for it in this repository on this branch? by adding tests/test_rects.py and test this suggested pdf? And how can i support this? |
@kreuzberger Interesting. I will have a quick fix from your original PR, and then we can merge this PR first. Afterwards, you can create a PR to add the test case for |
i would suggest to add a testcase before merge to ensure the file could be opened
I would then add tests on a new PR. But also ok if i do it all later. While running the tests i got severe problems executing them:
I have to patch as superuser my etc/configs to get it running! See https://bugs.archlinux.org/task/60580 |
@juiwenchen ok, after latest update the test runs locally! You can merge, i add the tests later 😄 👍 |
@kreuzberger I cherry picked your branch
upgrade
to this PR in order to make the change atomic. The credit for this rect feature is for you.#25