Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merging of rotated pages leads to unexpected results #1280

Closed
Nikita7x opened this issue Aug 26, 2022 · 38 comments
Closed

Merging of rotated pages leads to unexpected results #1280

Nikita7x opened this issue Aug 26, 2022 · 38 comments
Labels
Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF

Comments

@Nikita7x
Copy link

Nikita7x commented Aug 26, 2022

When I try to get width and height of the page, I encounter a problem: width and height are misplaced in SOME (but not all!) of the PDFs. The page looks like completely normal A4 portrait document, but using reader.pages[0].mediabox returns wrong dimensions [0, 0, 841.68, 595.2] instead of [0, 0, 595.2, 841.68]

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
# macOS-12.5.1-arm64-arm-64bit
# and also on every system on docker (Ubuntu, CentOS)

$ python -c "import PyPDF2;print(PyPDF2.__version__)"
# 2.10.3

Code + PDF

This is a minimal, complete example that shows the issue:

from PyPDF2 import PdfReader

reader = PdfReader(path)
dimensions = reader.pages[0].mediabox

width = dimensions.width
height = dimensions.height

dim.pdf

I removed sensitive data.

Output

Xnapper-2022-08-26-12 11 02

@pubpub-zz
Copy link
Collaborator

When you check all the entries in the page, you have:
{'/Type': '/Page', '/Parent': IndirectObject(2, 0, 2357934091920), '/Resources': IndirectObject(4, 0, 2357934091920), '/Contents': IndirectObject(3, 0, 2357934091920), '/MediaBox': [0, 0, 841.68, 595.2], '/Rotate': 270}
The page is rotated at display. the MediaBox reported is correct

@Nikita7x
Copy link
Author

The page is rotated at display. the MediaBox reported is correct
@pubpub-zz

Is there a way to "rotate" it correctly? Because If I merge this PDF with my stamp, it will be misplaced and rotated. But still, the PDF itself is looking as it should, portrait orientation, no any rotation

@Nikita7x
Copy link
Author

Nikita7x commented Aug 26, 2022

Because this is what happens now with this PDF
Снимок экрана 2022-08-26 в 14 44 56

@pubpub-zz
Copy link
Collaborator

From my point of view you issue is with the stamp file.
My first thoughts :
a) try to change your paper orientation at a different position (at 'printer' level for example)
b) create a small page just to fit the stamp with a non standard file, and then merge it using translations (https://pypdf2.readthedocs.io/en/latest/user/cropping-and-transforming.html)
c) use your existing file and find the adequate transformation before merging/rotating

@MartinThoma MartinThoma added the Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests label Aug 27, 2022
@MartinThoma MartinThoma changed the title Wrong page dimensions in MediaBox Merging of rotated pages leads to unexpected results Aug 27, 2022
@MartinThoma
Copy link
Member

@Nikita7x Thank you for reporting this issue 🙏

I took the freedom to adjust your code so that it uses the new syntax. You might want to check the deprecation warnings (python -Wall yourscript.py).

I've also adjusted the title of the issue as I think it describes the issue better. Do you think so as well or did I get something wrong? If the new title is ok, maybe you can adjust the example code (I guess your're using something similar to https://pypdf2.readthedocs.io/en/latest/user/add-watermark.html )

I fully understand why the current behavior of PyPDF2 is not as expected. At the moment, PyPDF2 is essentially doing the simplest way. I think we could (and likely should) make PyPDF2 behave as you expect, but we need to make sure that we don't break existing code if we adjust it.

@MartinThoma
Copy link
Member

MartinThoma commented Aug 27, 2022

By the way: I love your schematic image for the current stamp behavior 😍 Did you create them yourself? May we use them in the official docs?

@MartinThoma
Copy link
Member

@pubpub-zz We could add an orientation property to the boxes (mediabox, cropbox, bleedbox, trimbox, artbox). That would simply be the orientation of the page itself. Adding this property would come with almost no computational / memory / maintenance cost, but it might make this error source easier to spot for users. I'm undecided what I think about the idea myself... what do you think about it?

@Nikita7x
Copy link
Author

From my point of view you issue is with the stamp file.

I thought so, but it turns out, it's not about the stamp, it's about the first document. What I've tried also is:

  • Rotating main document (using .rotate, .rotateCounterClockwise)
  • Rotating the stamp file (using .rotate, .rotateCounterClockwise)
  • Rotating main document and the stamp file

The result is always the same.
Снимок экрана 2022-08-27 в 14 24 11

@Nikita7x
Copy link
Author

As far as I understand the origin of these files: they are being put into a scanner, but with wrong rotation. If you had an experience scanning documents, then you probably remember the guides on the scanning plate, that asks to lay paper vertically. So, some people lay paper down horizontally and after the scanning procedure it appears to be landscape. So they rotate it using software. And after that PDF “looks” normal: vertical (portrait) orientation, but on deeper lever it is still a landscape one. And when I use PyPDF2, it grabs information not from “upper level” of this document, but from “deeper level”, what causes these problems.
All of this is only my suspicion and deduction.

@Nikita7x
Copy link
Author

Did you create them yourself? May we use them in the official docs?

Yeah, I made it myself in Sketch. Of course, feel free to use it everywhere

@Nikita7x
Copy link
Author

Nikita7x commented Aug 27, 2022

@MartinThoma Apparently the business I'm working at really relies on PyPDF2 and stamping ability. Maybe I can help somehow to resolve this issue? What can I do so that stamp will be applied in place in any document "rotation"? @pubpub-zz mentioned that

'/Rotate': 270
The page is rotated at display.

So, what can I do with this Rotate parameter? It seems like applying .rotate() (.rotateClockwise) rotates it, but after .merge everything is as messed up as it was, just rotated.

@MartinThoma
Copy link
Member

@Nikita7x Would you mind sharing the "Rotated PDF" (without the text) and the "Stamp" image in a bit higher resolution with me? I will create some examples + add them to the docs.

I also get confused by this and I need to give it a try. But it's important to note that PDF has two mechanisms for rotating stuff: https://pypdf2.readthedocs.io/en/latest/user/cropping-and-transforming.html#page-rotation

@Nikita7x
Copy link
Author

@MartinThoma to make a "Rotated PDF" I simply use .rotate() function. So you can download original PDF and apply .rotate() function.
And here comes the Stamp, this is the exact file I use. I just removed private data
stamp.pdf

@MartinThoma
Copy link
Member

And here comes the Stamp, this is the exact file I use. I just removed private data

I was talking about the sketch images :-) You could share them just as PNG; I can convert them to PDF and show the effects

@Nikita7x
Copy link
Author

archive.zip

@carecavoador
Copy link

Nice! I think I have an issue related to this topic. So far, I understood that when I use .rotate() only the page display is rotated. That's the reason it doesn't work when merged. But I haven't figured out how to make it work.

@MartinThoma
Copy link
Member

It seems as if there is a bug and thus the solution I proposed doesn't work: #1301

@carecavoador
Copy link

Thank you! I wish I could contribute to solve the issue, but my python knowledge is not enough so far. But I really appreciate the effort you and all the contributors are putting in this project.

@Nikita7x
Copy link
Author

Nikita7x commented Sep 6, 2022

@MartinThoma how's it going? Do you know maybe how can I fix this issue myself? My work really relies on that functionality. Maybe I need to rotate it differently somehow?

@pubpub-zz
Copy link
Collaborator

@Nikita7x
Under Analysis

@MartinThoma
Copy link
Member

@Nikita7x I'm out of ideas for this one (and currently I only have the energy for simple changes in PyPDF2; this one needs more than I can give at the moment)

But you're lucky - @pubpub-zz is the top contributor to PyPDF2 :-)

pubpub-zz added a commit to pubpub-zz/pypdf that referenced this issue Sep 10, 2022
Error detected during analysis of py-pdf#1280
@pubpub-zz
Copy link
Collaborator

pubpub-zz commented Sep 10, 2022

this are the test files I've used:
Stamp files
MyStampFin.pdf : vertical page with no /Rotate
MyStampFin90a.pdf : vertical displayed, landscape layout with /Rotate =90

Page to be stamped
0004.pdf : Layout page
998719.pdf : Vertical page

the use of page.rotate() is not adequate as the flag is not taken into account in the merging.
we have to apply transformation before merging. Many points have to be considered for that:

  • an error has been found in in transformation.ctm (fixed in FIX : Fix Error in transformations #1341)
  • be carefull about signs /Rotate is clockwise positive where as transformation are trigonometric (counterclockwise positive)
  • rotate transformation center is at 0,0 therefore you need to translate before and after
  • the (visual) origin 0,0 is normally on lower/left corner but when /Rotate exists the center moves to upper/left (Rotate=90), upper/right(Rotate =180), lower/right (Rotate=270) and be aware that the axis are turning simultaneously
  • when applying a rotation, the cropping area is introduced in content. To prevent that the easiest is to set expand to True

with the following code, I get:

import PyPDF2

reader = PyPDF2.PdfReader("e:/Downloads/MyStampFin.pdf")
st = r.pages[0]
writer = PyPDF2.PdfWriter()
org = PyPDF2.PdfReader("e:/Downloads/0004.pdf")
page = org.pages[0]
st.add_transformation(
    PyPDF2.Transformation()
    .translate(-float(st.mediabox.getWidth()) / 2, -float(st.mediabox.getHeight()) / 2)
    .rotate(90)
    .translate(float(st.mediabox.getHeight()) / 2, float(st.mediabox.getWidth()) / 2),
    True,
)
page.merge_page(st)
writer.add_page(page)
with open("e:/Downloads/MyStamp0a.pdf", "wb") as fo:
    writer.write(fo)

image

This can also be applied with the stamp with /Rotate=90 (note the rotation with is reversed:

import PyPDF2

reader = PyPDF2.PdfReader("e:/Downloads/MyStampFin90a.pdf")
st = r.pages[0]
writer = PyPDF2.PdfWriter()
org = PyPDF2.PdfReader("e:/Downloads/998719.pdf")
page = org.pages[0]
st.add_transformation(
    PyPDF2.Transformation()
    .translate(-float(st.mediabox.getWidth()) / 2, -float(st.mediabox.getHeight()) / 2)
    .rotate(-90)
    .translate(float(st.mediabox.getHeight()) / 2, float(st.mediabox.getWidth()) / 2),
    True,
)
page.merge_page(st)
writer.add_page(page)
with open("e:/Downloads/MyStamp90a.pdf", "wb") as fo:
    writer.write(fo)

image

So @Nikita7x you should be able to get a quick solution

@MartinThoma, @MasterOdin
In order to improve the API I propose 2 new functions:
a) add a page.get_rotate() function : to retrieve the page /Rotate value
b) add a page.transfer_rotate_to_content() function : to transfer the /Rotate flag into content (apply the good translation/rotations, change MediaBox/... and reset the translation field

Your opinion

MartinThoma pushed a commit that referenced this issue Sep 10, 2022
Error detected during analysis of #1280
@Nikita7x
Copy link
Author

@pubpub-zz thanks a lot for your effort! I'll try this on my examples and see if that will fix it

@MartinThoma
Copy link
Member

First of all: VERY nice analysis and explanation @pubpub-zz ! Thank you so much for that 🙏

a) add a page.get_rotate() function : to retrieve the page /Rotate value

I would prefer a rotation property as this feels overall more consistent / cleaner to me. Especially if we just use the /Rotate value. We could make that property even writable.

b) add a page.transfer_rotate_to_content() function

That is interesting! I was just wondering if there are other "standardizations" that people typically want to do. That function would return None, right?

Sounds reasonable to me 👍

@pubpub-zz
Copy link
Collaborator

I would prefer a rotation property as this feels overall more consistent / cleaner to me. Especially if we just use the /Rotate value. We could make that property even writable.

I then propose to use rotation with a getter and a setter . .rotate would be possible with .rotation += 90. Just wanting to avoid slightly different wording for getting and setting.

b) add a page.transfer_rotate_to_content() function

That is interesting! I was just wondering if there are other "standardizations" that people typically want to do. That function would return None, right?

Morelickely returning self to allow cascading

@carecavoador
Copy link

carecavoador commented Sep 14, 2022

Wow! This is going, people. Thank you so much for your time on this problem. The last modification almost solved my issue.

I did this based on @pubpub-zz example:

import PyPDF2

a4_file = PyPDF2.PdfReader('a4_file.pdf')
a4_page = a4_file.pages[0]

blue_file = PyPDF2.PdfReader('blue_file.pdf')
blue_page = blue_file.pages[0]

pink_file = PyPDF2.PdfReader('pink_file.pdf')
pink_page = pink_file.pages[0]

pink_page.add_transformation(
    PyPDF2.Transformation()
    .translate(-float(pink_page.mediabox.getWidth()) / 2, -float(pink_page.mediabox.getHeight()) / 2)
    .rotate(90)
    .translate(float(pink_page.mediabox.getHeight()) / 2, float(pink_page.mediabox.getWidth()) / 2)
    .translate(float(blue_page.mediabox.getWidth()), 0),
    True,
)
a4_page.merge_page(blue_page)
a4_page.merge_page(pink_page)

writer = PyPDF2.PdfWriter()
writer.add_page(a4_page)

with open('result.pdf', 'wb') as final:
    writer.write(final)

Considering I have these files:
a4_file
blue_file
pink_file

This is the result I'd like to get:
expected

Insted, this is the output I'm getting with the code above:
result

Upon further inspection on several PDF editig software, I can see the transformation is happening, but somehow there is a clipping mask on the pink_page that cuts it. We can observe it on the file wireframe:
result_wf

At this point, I feel like I'm asking too much. Sorry for that. But, if somehow, someone could take some time on this issue, it would be much appreciated.

Anyway, thank you all for your time on this package. It is awesome, and I love it!

@carecavoador
Copy link

People, I think I found a solution! 😄

At least now I know what the issue is: when transformations are applied, the page boxes '/ArtBox', '/BleedBox', '/CropBox', '/MediaBox', '/TrimBox' are not being updated acordingly. So you have to do it manually.

Here is an example where I successfully centered an A4 page inside an A3 page on the X axis while keeping it aligned on the botton.

Example files:
A3.pdf
A4.pdf

Code:

from PyPDF2 import PdfReader, Transformation, PdfWriter
from PyPDF2.generic import RectangleObject

a3_file = PdfReader('A3.pdf')
a3_page = a3_file.pages[0]

a4_file = PdfReader('A4.pdf')
a4_page = a4_file.pages[0]

# Rotate 'a4_page.pdf' 90 degrees and centers it on 'a3_page.pdf':
a4_page.add_transformation(
    Transformation()
    .translate(-float(a4_page.mediaBox.getWidth()) / 2, -float(a4_page.mediaBox.getHeight()) / 2)
    .rotate(90)
    .translate(float(a4_page.mediaBox.getHeight()) / 2, float(a3_page.mediaBox.getLowerLeft_y() + a4_page.mediaBox.getWidth() / 2))
)

# Manually update page boxes:
new_width = a4_page.mediaBox.getHeight()
new_heigth = a4_page.mediaBox.getWidth()
new_y = a3_page.mediaBox.getLowerLeft_y()

a4_page.update({'/ArtBox': RectangleObject([0, new_y, new_width, new_y + new_heigth])})
a4_page.update({'/BleedBox': RectangleObject([0, new_y, new_width, new_y + new_heigth])})
a4_page.update({'/CropBox': RectangleObject([0, new_y, new_width, new_y + new_heigth])})
a4_page.update({'/MediaBox': RectangleObject([0, new_y, new_width, new_y + new_heigth])})
a4_page.update({'/TrimBox': RectangleObject([0, new_y, new_width, new_y + new_heigth])})

a3_page.merge_page(a4_page)

writer = PdfWriter()
writer.add_page(a3_page)
with open('merged_a3.pdf', 'wb') as fo:
    writer.write(fo)

@Nikita7x
Copy link
Author

@pubpub-zz thank you! Your solution with rotating the stamp file works perfectly!

@pubpub-zz
Copy link
Collaborator

pubpub-zz commented Sep 16, 2022

@pubpub-zz thank you! Your solution with rotating the stamp file works perfectly!

@Nikita7x
Have you applied the change from the PR ?

MartinThoma pushed a commit that referenced this issue Sep 17, 2022
@Nikita7x
Copy link
Author

Nikita7x commented Sep 19, 2022

I updated PyPDF2 to a newer version which supports method **transfer_rotation_to_content** and now I'm getting Exception while trying to use this method on pages with rotation

first_page = original_pdf.getPage(0)
first_page.transfer_rotation_to_content()
transfer_rotation_to_content

    pt1 = trsf.apply_on(cast(RectangleObject, self[b]).lower_left)
AttributeError: 'ArrayObject' object has no attribute 'lower_left'

The exact file: rot_file.pdf

@pubpub-zz
Copy link
Collaborator

pubpub-zz commented Sep 19, 2022

@Nikita7x ,
I've just checked with version 2.10.9 and I'm not getting any issue.
using pdb.pm(), can you display the content of self, b and type(b)
Can you indicate also with version of Python are you using

@Nikita7x
Copy link
Author

@Nikita7x ,
I've just checked with version 2.10.9 and I'm not getting any issue.
using pdb.pm(), can you display the content of self, b and type(b)
Can you indicate also with version of Python are you using

I've checked and the version is 2.10.9, Python 3.9.
Now I see. When I cover sensitive data on this PDF, it is then processed normally. But original PDF is still raises Exception. I unfortunately can't send this PDF here, because it has a lot of sensitive data, but I can send it privately and directly to you so you can see it yourself, @pubpub-zz

@Nikita7x
Copy link
Author

I've noticed I can't reproduce this problem while debugging. But using Django trace I was able to get local variables.
This Exception occurs when b is '/CropBox' while iterating through line 402 in file _page.py.

b | '/CropBox'
mb | RectangleObject([0, 0, 841.68, 595.08])
pt1 | (595.0800000000002, 0.0)
pt2 | (0.0, 841.6799999999998)
r | -270
self | {'/Annots': [],  '/Contents': {},  '/CropBox': [0, 0, 841.68, 595.08],  '/MediaBox': RectangleObject([0, 0, 595.080000000000154614099301397800445556640625, 841.67999999999983629095368087291717529296875]),  '/Parent': IndirectObject(1, 0, 4419751216),  '/Resources': {'/ProcSet': ['/PDF', '/ImageC'],                 '/XObject': {'/Im0': IndirectObject(6, 0, 4419751216)}},  '/Rotate': 0,  '/Type': '/Page'}

@Nikita7x
Copy link
Author

Nikita7x commented Sep 20, 2022

Maybe this is some kind of a system bug, because this Exception raises sometimes while running the same code and the same file. I've tried rebooting the system, but it didn't help, it's really strange.

Django Version: 4.1.1
Python Version: 3.9.5

@pubpub-zz
Copy link
Collaborator

Understood, will come back with a fix tonight

@pubpub-zz
Copy link
Collaborator

@Nikita7x,
can you try this patch in _page.py, at about line 400, can you replace the for look with this code and confirm the error goes away:

        for b in ["/MediaBox", "/CropBox", "/BleedBox", "/TrimBox", "/ArtBox"]:
            if b in self:
                r=RectangleObject(self[b])
                pt1 = trsf.apply_on(r.lower_left)
                pt2 = trsf.apply_on(r.upper_right)
                self[NameObject(b)] = RectangleObject(
                    (
                        min(pt1[0], pt2[0]),
                        min(pt1[1], pt2[1]),
                        max(pt1[0], pt2[0]),
                        max(pt1[1], pt2[1]),
                    )
                )

@Nikita7x
Copy link
Author

can you try this patch in _page.py, at about line 400, can you replace the for look with this code and confirm the error goes away:

Yeah, it seems like it resolves the issue!

@MartinThoma MartinThoma added the is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF label Sep 24, 2022
@pubpub-zz
Copy link
Collaborator

PR delivered and issued

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Has MCVE A minimal, complete and verifiable example helps a lot to debug / understand feature requests is-bug From a users perspective, this is a bug - a violation of the expected behavior with a compliant PDF
Projects
None yet
Development

No branches or pull requests

4 participants