ENH: consider images inside PDF made with onlyoffice #2637

0xNath · 2024-05-11T21:52:24Z

Added code to detect patterns in "_get_ids_image".
To avoid any conflicts with images that could be located directly in a page or images using the same ID in different patterns, images ids under patterns are returned in this form : "/Pattern/patternNameHere/imageNameHere"

Added code to deal with Pattern images in "_get_image".

Added a new test to verify ids and image data returned related to the modifications done above.

0xNath · 2024-05-11T22:05:17Z

I made a PR to add the sample files :
py-pdf/sample-files#30

Instead of putting the files directly into this repository, should I host the files on github or somewhere else so they get downloaded ?

stefan6419846 · 2024-05-12T08:21:26Z

Instead of putting the files directly into this repository, should I host the files on github or somewhere else so they get downloaded?

In general, files should go into the sample files if you own all copyrights and are okay with the license terms there. Unfortunately, merging permissions are more restricted over there, thus I am not able to merge anything there myself.

The usual approach we take apart from this is to upload to an issue or PR and download the corresponding files from the GitHub URLs. Unfortunately, this will not work for image files any more as far as I know as GitHub started to add JWTs to their URLs which expire accordingly and are user-specific.

pubpub-zz · 2024-05-12T08:39:11Z

Can you rename the title of your PR : this is not a a STYle but an ENHancement.
You should also edit your first thread to set you "closes " info to ease closing

you should be able to "factorize" your code to use the code for image extraction. the current code looking for "/Resources/Xobjects/'Forms'" should be very similar to "/Resources/Patterns/" adjusting x_object

You should also confirm your code works with patterns within forms

Also, It might be interesting to extract the thumbnail and confirm that no other images are ignored

0xNath · 2024-05-12T08:56:30Z

Instead of putting the files directly into this repository, should I host the files on github or somewhere else so they get downloaded?

In general, files should go into the sample files if you own all copyrights and are okay with the license terms there. Unfortunately, merging permissions are more restricted over there, thus I am not able to merge anything there myself.

The usual approach we take apart from this is to upload to an issue or PR and download the corresponding files from the GitHub URLs. Unfortunately, this will not work for image files any more as far as I know as GitHub started to add JWTs to their URLs which expire accordingly and are user-specific.

I reused images that were already present in the repo so it should be good in term of rights. The issue is that only office convert images to jpeg when included inside a document, so I had to re-upload them once converted to pass content verification in the test unit.

Let's wait to see if my PR on the other repo get merged then, thanks.

0xNath · 2024-05-12T09:09:06Z

Can you rename the title of your PR : this is not a a STYle but an ENHancement. You should also edit your first thread to set you "closes " info to ease closing

I was not sure which one to use, thanks.

you should be able to "factorize" your code to use the code for image extraction. the current code looking for "/Resources/Xobjects/'Forms'" should be very similar to "/Resources/Patterns/" adjusting x_object

You're right, I didn't thought about it, let's do that.

You should also confirm your code works with patterns within forms

It will probably not in the current state, I'll check what are forms, thank you.

Also, It might be interesting to extract the thumbnail and confirm that no other images are ignored

Yes I agree. I'll need to read a bit more the PDF standard, currently I've only read the part about patterns and image objects so I'm lacking a lot about how images object are used/located. Maybe it would be nicer to have an issue for each case where images are missing and have their matching PR ?

0xNath · 2024-05-17T16:39:36Z

I have tested onlyoffice forms and images inside it are not taken into account.

Images inside PDF from libreoffice without and with forms are taken into account, as well as images from word forms.

pubpub-zz · 2024-05-17T17:58:00Z

@0xNath
Can you please save your test files in this thread and modify your PR to get them from the URLs. for the name, use iss2613a.pdf, iss2613b.pdf, ...

0xNath · 2024-05-17T18:16:18Z

I'll add the code for extracting the images from the form PDF and the corresponding test. I'm going to follow the same logic as for the patterns, but with "/Annots/.../..." as image identifiers

iss2613-onlyoffice-form.pdf
iss2613-onlyoffice-standardImages.pdf

0xNath · 2024-05-17T18:25:12Z

Images to test against

codecov · 2024-05-17T20:56:26Z

Codecov Report

Attention: Patch coverage is 92.00000% with 4 lines in your changes missing coverage. Please review.

Project coverage is 95.10%. Comparing base (582557e) to head (cc1600d).

Files	Patch %	Lines
pypdf/_page.py	92.00%	0 Missing and 4 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2637      +/-   ##
==========================================
- Coverage   95.12%   95.10%   -0.03%     
==========================================
  Files          51       51              
  Lines        8557     8593      +36     
  Branches     1707     1725      +18     
==========================================
+ Hits         8140     8172      +32     
  Misses        263      263              
- Partials      154      158       +4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Added code to detect patterns in "_get_ids_image". To avoid any conflicts with images that could be located directly in a page or images using the same ID in differents patterns, images ids under patterns are returned in this form : "/Pattern/patternNameHere/imageNameHere" Added code to deal with Pattern images in "_get_image".

fixed code style

Fix typo

commit d0493ae Author: Nathanaël Renaud <perso@renaudna.fr> Date: Fri May 17 23:25:13 2024 +0200 Modified _get_ids_image and _get_image so they work with onlyoffice images commit 53a3781 Author: Nathanaël Renaud <perso@renaudna.fr> Date: Fri May 17 23:22:27 2024 +0200 Added tests units about images extractions from PDF generated using onlyoffice

MartinThoma · 2024-06-23T09:17:48Z

Let's wait to see if my PR on the other repo get merged then, thanks.

@0xNath I don't see a PR in https://github.com/py-pdf/sample-files

MartinThoma · 2024-06-23T09:18:11Z

Unfortunately, merging permissions are more restricted over there, thus I am not able to merge anything there myself.

@stefan6419846 I'll give you the permissions in a second :-)

0xNath · 2024-06-23T09:21:05Z

Let's wait to see if my PR on the other repo get merged then, thanks.

@0xNath I don't see a PR in https://github.com/py-pdf/sample-files

Hello,
I removed my PR on the samples repo and used the files in this conversation instead

stefan6419846 · 2024-06-25T08:02:25Z

I am currently trying to understand what the current state of this PR this:

There currently is a merge conflict which should be easy to solve.
Are there any benefits of using ENH: add decode_as_image() to ContentStreams #2615 here?
What is the current state of the remarks from ENH: consider images inside PDF made with onlyoffice #2637 (comment)?

…-patterns

0xNath · 2024-08-03T15:29:50Z

Hey

I am currently trying to understand what the current state of this PR this:
* There currently is a merge conflict which should be easy to solve.

It's done.

* Are there any benefits of using [ENH: add decode_as_image() to ContentStreams #2615](https://github.com/py-pdf/pypdf/pull/2615) here?

I don't think PageObject.images and decode_as_image have the same purpose, and consequent changes would need to be done in the first one to implement the second one in it, with no real advantages in my opinion.

decode_as_image is great when an image is not covered by PageObject.images.

PageObject.images should be expended to improve cases were images are not detected so that user don't have to rely on decode_as_image.

* What is the current state of the remarks from [ENH: consider images inside PDF made with onlyoffice #2637 (comment)](https://github.com/py-pdf/pypdf/pull/2637#issuecomment-2106168376)?

Can you rename the title of your PR : this is not a a STYle but an ENHancement.

This is done.

you should be able to "factorize" your code to use the code for image extraction. the current code looking for "
Resources/Xobjects/'Forms'" should be very similar to "/Resources/Patterns/" adjusting x_object

This as already been done in commit 0e81eee if I'm not mistaken.

You should also confirm your code works with patterns within forms

This has been implemented as well with the test that goes with it.

Also, It might be interesting to extract the thumbnail

In my opinion thumbnails should be extracted by another method than PageObject.images, like PageObject.thumbnail.

and confirm that no other images are ignored

I had tried other softwares like Word and Libreoffice, and they were fine, now OnlyOffice should be fine too. Maybe there are other special case of images that we still don't cover but that's gonna be it for my PR. If i ever encounter those other case I will probably submit issue and PR if it's at my reach.

pubpub-zz · 2024-08-04T10:34:33Z

pypdf/_page.py

+ if PG.ANNOTS in obj:
+ for annot in cast(DictionaryObject, obj[PG.ANNOTS]):


I would more likey propose

Suggested change

if PG.ANNOTS in obj:

for annot in cast(DictionaryObject, obj[PG.ANNOTS]):

for annot_idx,annot in enumerate(cast(DictionaryObject, obj.get(PG.ANNOTS, {}))):

annot_idx should be use to name of the image instead of ["/T"]
Also I would prefer to parse annotation after processing the page original resources in order to keeps "direct" images and inlines before annots'.

pubpub-zz · 2024-08-04T10:39:04Z

pypdf/_page.py

+
+ if PG.ANNOTS in obj:
+ for annot in cast(DictionaryObject, obj[PG.ANNOTS]):
+ if (


add:

annot = annot.get_object()

and simplify your code below (specially the .keys() should not be required
I agree here the wallrus operator would ease readability 😉

walrus can now be used as python 3.7 is over. if it helps you

pubpub-zz · 2024-08-04T10:52:47Z

pypdf/_page.py

+ DictionaryObject, annot["/AP"]["/N"][PG.RESOURCES][RES.XOBJECT]
+ )
+ ):
+ frame = annot["/AP"]["/N"][PG.RESOURCES][RES.XOBJECT]["/FRM"]


I do not understand the ["\FRM"]
I've produced a test with a stamp and the image is called ["\Icon"]:
tttt.pdf

pubpub-zz · 2024-08-04T10:54:42Z

pypdf/_page.py

+ )
+ ):
+ frame = annot["/AP"]["/N"][PG.RESOURCES][RES.XOBJECT]["/FRM"]
+


The best would be to call recursively _get_image_ids()

pubpub-zz · 2024-08-04T10:55:16Z

pypdf/_page.py

+ lst.extend(
+ [
+ (
+ f"{PG.ANNOTS}/{annot['/T']}{o}"


see comment about naming

pubpub-zz · 2024-08-04T11:26:15Z

tests/example_files.yaml

@@ -112,5 +112,7 @@
 url: https://github.com/py-pdf/pypdf/files/12050253/tt.pdf
 - local_filename: Pesquisa-de-Precos-Combustiveis-novembro-2023.pdf
 url: https://www.joinville.sc.gov.br/wp-content/uploads/2023/11/Pesquisa-de-Precos-Combustiveis-novembro-2023.pdf
- local_filename: iss2138.pdf


why these lines dissaperared?

pubpub-zz · 2024-08-04T11:26:58Z

tests/test_images.py

@@ -214,6 +214,66 @@ def test_image_extraction(src, page_index, image_key, expected):
 assert image_similarity(BytesIO(actual_image.data), expected) >= 0.99


+@pytest.mark.enable_socket()
+def test_onlyoffice_standard_images_extraction():
+ reader = PdfReader(


a reference to the issue numbers would be apreciated to ease later analysis

pubpub-zz · 2024-08-04T11:28:55Z

tests/test_images.py

+ str(reader.pages[0].images)
+ == "[Image_0=/Pattern/P1/X1, Image_1=/Pattern/P2/X1, Image_2=/Pattern/P3/X1]"
+ )


Suggested change

str(reader.pages[0].images)

== "[Image_0=/Pattern/P1/X1, Image_1=/Pattern/P2/X1, Image_2=/Pattern/P3/X1]"

)

reader.pages[0].images

== ["Image_0":"/Pattern/P1/X1", "Image_1":"/Pattern/P2/X1", "Image_2"="/Pattern/P3/X1"]

)

pubpub-zz · 2024-08-04T11:32:01Z

tests/test_images.py

+@pytest.mark.enable_socket()
+def test_onlyoffice_standard_images_extraction():
+ reader = PdfReader(
+ BytesIO(get_data_from_url(name="iss2613-onlyoffice-standardImages.pdf"))


as you are having url/name defined locally I recommend to do the same for the pdf and not change the yaml file above

pubpub-zz · 2024-08-04T11:32:21Z

tests/test_images.py

+
+@pytest.mark.enable_socket()
+def test_onlyoffice_form_images_extraction():
+ reader = PdfReader(BytesIO(get_data_from_url(name="iss2613-onlyoffice-form.pdf")))


pubpub-zz · 2024-08-14T16:30:06Z

@0xNath do you know when you will be able to complete the PR ?

0xNath · 2024-08-14T17:37:36Z

@0xNath do you know when you will be able to complete the PR ?

Hey, my apologies for not replying, thank you a lot for all those advices

I will tackle this down this weekend.

0xNath · 2024-08-18T18:24:29Z

Sorry, I have not found the motivation to work on it this weekend...

pubpub-zz · 2024-09-17T21:17:01Z

@0xNath
any time to finish your PR ?

0xNath · 2024-09-25T21:04:55Z

@0xNath
any time to finish your PR ?

I have the time but I lack motivation. I have started to get back on the little project I have with this lib so it will get back eventually

pubpub-zz · 2024-09-25T21:12:31Z

@0xNath

Heigh ho, Heigh ho, it's on PR to work we go!

take courage😉, I will be happy to see you as a contributor

0xNath marked this pull request as draft May 11, 2024 22:05

0xNath changed the title ~~STY: consider images inside patterns~~ ENH: consider images inside patterns May 12, 2024

0xNath marked this pull request as ready for review May 17, 2024 16:19

0xNath marked this pull request as draft May 17, 2024 16:38

0xNath changed the title ~~ENH: consider images inside patterns~~ ENH: consider images inside PDF made with onlyoffice May 17, 2024

Nathanaël Renaud and others added 12 commits May 17, 2024 23:09

Update _page.py

557d77d

fixed code style

Update _page.py

524ffc3

fixed code style

fix code style...

46adbe1

fix typing

76b545f

fixed typo

6edfa2d

refactored code

0e81eee

fixed typo

b6ba820

Add test for forms and standard onlyoffice pages

07752a1

Fix decorator for form test

a5b980b

Fix test for form

a23b272

Add Annot parsing for onlyoffice forms

d524672

Fix typo

0xNath force-pushed the ENH-#2613-integration-of-images-inside-patterns branch from fcb3b1f to d524672 Compare May 17, 2024 21:09

0xNath marked this pull request as ready for review May 17, 2024 21:16

Nathanaël Renaud and others added 3 commits August 3, 2024 15:43

Try fix issue with codecov and PERF401 ruff warning

e5585d7

Merge branch 'main' into ENH-py-pdf#2613-integration-of-images-inside…

88d2223

…-patterns

Ruff not happy about whitespaces

cc1600d

pubpub-zz reviewed Aug 4, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: consider images inside PDF made with onlyoffice #2637

ENH: consider images inside PDF made with onlyoffice #2637

0xNath commented May 11, 2024 •

edited

Loading

0xNath commented May 11, 2024

stefan6419846 commented May 12, 2024 •

edited

Loading

pubpub-zz commented May 12, 2024

0xNath commented May 12, 2024

0xNath commented May 12, 2024

0xNath commented May 17, 2024 •

edited

Loading

pubpub-zz commented May 17, 2024

0xNath commented May 17, 2024

0xNath commented May 17, 2024

codecov bot commented May 17, 2024 •

edited

Loading

MartinThoma commented Jun 23, 2024

MartinThoma commented Jun 23, 2024

0xNath commented Jun 23, 2024

stefan6419846 commented Jun 25, 2024

0xNath commented Aug 3, 2024

pubpub-zz Aug 4, 2024 •

edited by stefan6419846

Loading

pubpub-zz Aug 4, 2024

pubpub-zz Aug 14, 2024

pubpub-zz Aug 4, 2024

pubpub-zz Aug 4, 2024

pubpub-zz Aug 4, 2024

pubpub-zz Aug 4, 2024

pubpub-zz Aug 4, 2024

pubpub-zz Aug 4, 2024

pubpub-zz Aug 4, 2024

pubpub-zz Aug 4, 2024

pubpub-zz commented Aug 14, 2024

0xNath commented Aug 14, 2024

0xNath commented Aug 18, 2024

pubpub-zz commented Sep 17, 2024

0xNath commented Sep 25, 2024

pubpub-zz commented Sep 25, 2024

		if PG.ANNOTS in obj:
		for annot in cast(DictionaryObject, obj[PG.ANNOTS]):

	if PG.ANNOTS in obj:
	for annot in cast(DictionaryObject, obj[PG.ANNOTS]):
	for annot_idx,annot in enumerate(cast(DictionaryObject, obj.get(PG.ANNOTS, {}))):

ENH: consider images inside PDF made with onlyoffice #2637

Are you sure you want to change the base?

ENH: consider images inside PDF made with onlyoffice #2637

Conversation

0xNath commented May 11, 2024 • edited Loading

0xNath commented May 11, 2024

stefan6419846 commented May 12, 2024 • edited Loading

pubpub-zz commented May 12, 2024

0xNath commented May 12, 2024

0xNath commented May 12, 2024

0xNath commented May 17, 2024 • edited Loading

pubpub-zz commented May 17, 2024

0xNath commented May 17, 2024

0xNath commented May 17, 2024

codecov bot commented May 17, 2024 • edited Loading

Codecov Report

MartinThoma commented Jun 23, 2024

MartinThoma commented Jun 23, 2024

0xNath commented Jun 23, 2024

stefan6419846 commented Jun 25, 2024

0xNath commented Aug 3, 2024

pubpub-zz Aug 4, 2024 • edited by stefan6419846 Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pubpub-zz commented Aug 14, 2024

0xNath commented Aug 14, 2024

0xNath commented Aug 18, 2024

pubpub-zz commented Sep 17, 2024

0xNath commented Sep 25, 2024

pubpub-zz commented Sep 25, 2024

0xNath commented May 11, 2024 •

edited

Loading

stefan6419846 commented May 12, 2024 •

edited

Loading

0xNath commented May 17, 2024 •

edited

Loading

codecov bot commented May 17, 2024 •

edited

Loading

pubpub-zz Aug 4, 2024 •

edited by stefan6419846

Loading