Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Avoid isolating the graphics state multiple times #2224

Closed

Conversation

stefan6419846
Copy link
Collaborator

Fixes #2219.

@stefan6419846
Copy link
Collaborator Author

The test failures appear to be unrelated to this PR.

@pubpub-zz
Copy link
Collaborator

We have to be careful: we may be in a situation where we are on two isolated sections next to each other's

@stefan6419846
Copy link
Collaborator Author

Do you have an example? Judging from the fact that PyMuPDF basically recommends the approach implemented in this PR, I would assume that this indeed is the correct way.

Citing from https://pymupdf.readthedocs.io/en/latest/recipes-common-issues-and-their-solutions.html#id3:

Solution 3. is completely under your control and only does the minimum corrective action. There is a handy utility method Page.wrap_contents() which – as twe [sic] name suggests – wraps the page’s contents object(s) by the PDF commands q and Q.

This solution is extremely fast and the changes to the PDF are minimal. This is useful in situations where incrementally saving the file is desirable – or even a must when the PDF has been digitally signed and you cannot change this status.

We recommend the following snippet to get the situation under control:

>>> if not page.is_wrapped:
        page.wrap_contents()
>>> # start inserting text, images and other objects here

@pubpub-zz
Copy link
Collaborator

here you are an example
tt.pdf
here is the output with Pdf Debugger:
(beginning)
image
(end)
image

the two q/Q sequences are at same level and even separated with some operations

@stefan6419846
Copy link
Collaborator Author

This PDF file will not be considered to have an isolated graphics state at the moment due to q 1 1 0.94 rg 1 [...] 144.727 704.276 cm\nQ \nQ \nQ \nn \nQ \n', but manually editing the content stream makes this PDF being considered as wrapped, yes. The page geometry does not seem to get modified in your example in a weird manner, thus this would not cause any real issue when adding a watermark for example.

The current check is not cheap, but rather simple to do. What you are proposing seems to be that this check should look at every byte and wrap under the following conditions:

  • The file does not start with a variation of q and whitespace.
  • The file does not end with a variation of Q and whitespace.
  • The file contains multiple top-level q commands.

Am I correct? While this is doable, this will have to scan the whole page content in the worst case (= correct PDF). With the current solution (without the fix from this PR), the PDF file might grow each time I add another layer, which does not seem to be a good solution either, although much easier to implement.

@stefan6419846
Copy link
Collaborator Author

The following code snippets would correctly detect the most common variations of the tt.pdf example from above as well, although introducing quite some computation overhead and thus I did not yet commit it:

def has_isolated_graphics_state_in_operations(operations: List[Tuple[Any, Any]]) -> bool:
    level = 0
    last_operation_index = len(operations) - 1
    if last_operation_index == -1:
        # There are no operations, thus no isolation required.
        return True

    for index, (_, operator) in enumerate(operations):
        if index == 0 and operator != b"q":
            # Not isolated at the start.
            return False
        if index == last_operation_index and operator != b"Q":
            # Not isolated at the end.
            return False
        if operator not in {b"q", b"Q"}:
            continue
        # `q` commands are at the start, thus increasing the level.
        level += (1 if operator == b"q" else -1)
        if level <= 0:
            # We have detected a second `q ... Q` pair on the root level.
            return False
    return False


def has_isolated_graphics_state_in_bytes(data: bytes) -> bool:
    data = data.strip(b"".join(WHITESPACES))
    byte_count = len(data)
    if byte_count == 0:
        # Whitespace-only data is considered isolated.
        return True
    isolation_commands = re.finditer(rb"(?P<character>[qQ])" + WHITESPACES_AS_REGEXP + rb"*", data)

    # Fast path: If it does not start isolated, return directly.
    first_command = next(isolation_commands, None)
    if first_command is None:
        # We have some non-whitespace content, but no isolation call at all.
        return False
    if first_command.start("character") != 0 or first_command.group("character") != b"q":
        # The first non-whitespace character should be the isolation call.
        return False
    # We have consumed the first `q` command, thus start at level 1.
    level = 1

    for command in isolation_commands:
        character = command.group("character")
        if command.end("character") == byte_count and character == b"Q":
            # We have a `Q` command at the end. Due to having stopped for non-`q` at the
            # start, this is the expected case.
            return True
        # `q` commands are at the start, thus increasing the level.
        level += (1 if character == b"q" else -1)
        if level <= 0:
            # We have detected a second `q ... Q` pair on the root level.
            return False
    return False

@codecov
Copy link

codecov bot commented Sep 30, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (8d3f879) 94.37% compared to head (a7883b5) 94.42%.
Report is 5 commits behind head on main.

❗ Current head a7883b5 differs from pull request most recent head 4cbd1ab. Consider uploading reports for the commit 4cbd1ab to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2224      +/-   ##
==========================================
+ Coverage   94.37%   94.42%   +0.05%     
==========================================
  Files          43       43              
  Lines        7660     7604      -56     
  Branches     1515     1502      -13     
==========================================
- Hits         7229     7180      -49     
+ Misses        267      260       -7     
  Partials      164      164              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@pubpub-zz
Copy link
Collaborator

@MartinThoma
I'm still not convince with this PR :
The code required to identify if isolation is required or not, requires to parse the whole contents even if it is spreaded within multiple stream objects, which will have a strong impact on performances.
The only advantage is to stay acceptable nesting blocks (used to be >28 ? no limit found in PDF2.0). not sure it worth it as no errors have been reported up to now

@MartinThoma
Copy link
Member

@pubpub-zz Thank you for your assessment of the situation 🙏

@stefan6419846 It makes sense to me what pubpub-zz is writing. I tend to close this PR. Is there anything I miss / should additionally consider?

@stefan6419846
Copy link
Collaborator Author

pubpub-zz usually has more insights into the PDF spec, so I am going to trust the aforementioned comments in this case.

As some context, which is outlined in #2219: I have seen this behavior with the graphics state isolation only being done once in PyMuPDF and from my experience, MuPDF being one of the most stable/fault tolerant PDF library, I considered this to be a good solution. There they only check the first and last character of the content stream, although I have not looked at how they handle content streams which are split up.

In some of my use cases, I am using page merges with three or more overlays for one page, so this will quickly lead to quite some nested graphics state isolation calls, which might not always be necessary.

Having said that, I am fine with closing this PR and the corresponding issue as long as their is no real need and you as the maintainer decide that it is not worth the hassle or lead to quite some unnecessary overhead. As a side effect, implementing this PR allowed me to have a quick look at the inner workings of PDF/pypdf and allow me to get a basic grasp of the low-level primitives of PDF, so this PR will always have a positive learning effect for me nevertheless.

@MartinThoma
Copy link
Member

Thank you for your work with this PR and thank you for taking the time to respond.

I do agree with your comment - MuPDF is good and pubpub-zz has very good understanding of the PDF specs + pypdf.

For this reason, I'm closing this PR and the issue now. If people stumble over that issue, we can re-evaluate / re-open :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Each page merge will add another graphics state isolation call
3 participants