BUG: Avoid isolating the graphics state multiple times #2224

stefan6419846 · 2023-09-28T06:56:13Z

stefan6419846 · 2023-09-28T07:07:08Z

The test failures appear to be unrelated to this PR.

pubpub-zz · 2023-09-28T11:18:53Z

We have to be careful: we may be in a situation where we are on two isolated sections next to each other's

stefan6419846 · 2023-09-28T14:21:35Z

Do you have an example? Judging from the fact that PyMuPDF basically recommends the approach implemented in this PR, I would assume that this indeed is the correct way.

Citing from https://pymupdf.readthedocs.io/en/latest/recipes-common-issues-and-their-solutions.html#id3:

Solution 3. is completely under your control and only does the minimum corrective action. There is a handy utility method Page.wrap_contents() which – as twe [sic] name suggests – wraps the page’s contents object(s) by the PDF commands q and Q.

This solution is extremely fast and the changes to the PDF are minimal. This is useful in situations where incrementally saving the file is desirable – or even a must when the PDF has been digitally signed and you cannot change this status.

We recommend the following snippet to get the situation under control:
>>> if not page.is_wrapped:
        page.wrap_contents()
>>> # start inserting text, images and other objects here

pubpub-zz · 2023-09-28T17:21:53Z

here you are an example
tt.pdf
here is the output with Pdf Debugger:
(beginning)

(end)

the two q/Q sequences are at same level and even separated with some operations

stefan6419846 · 2023-09-28T17:46:15Z

This PDF file will not be considered to have an isolated graphics state at the moment due to q 1 1 0.94 rg 1 [...] 144.727 704.276 cm\nQ \nQ \nQ \nn \nQ \n', but manually editing the content stream makes this PDF being considered as wrapped, yes. The page geometry does not seem to get modified in your example in a weird manner, thus this would not cause any real issue when adding a watermark for example.

The current check is not cheap, but rather simple to do. What you are proposing seems to be that this check should look at every byte and wrap under the following conditions:

The file does not start with a variation of q and whitespace.
The file does not end with a variation of Q and whitespace.
The file contains multiple top-level q commands.

Am I correct? While this is doable, this will have to scan the whole page content in the worst case (= correct PDF). With the current solution (without the fix from this PR), the PDF file might grow each time I add another layer, which does not seem to be a good solution either, although much easier to implement.

stefan6419846 · 2023-09-30T10:04:46Z

The following code snippets would correctly detect the most common variations of the tt.pdf example from above as well, although introducing quite some computation overhead and thus I did not yet commit it:

def has_isolated_graphics_state_in_operations(operations: List[Tuple[Any, Any]]) -> bool:
    level = 0
    last_operation_index = len(operations) - 1
    if last_operation_index == -1:
        # There are no operations, thus no isolation required.
        return True

    for index, (_, operator) in enumerate(operations):
        if index == 0 and operator != b"q":
            # Not isolated at the start.
            return False
        if index == last_operation_index and operator != b"Q":
            # Not isolated at the end.
            return False
        if operator not in {b"q", b"Q"}:
            continue
        # `q` commands are at the start, thus increasing the level.
        level += (1 if operator == b"q" else -1)
        if level <= 0:
            # We have detected a second `q ... Q` pair on the root level.
            return False
    return False


def has_isolated_graphics_state_in_bytes(data: bytes) -> bool:
    data = data.strip(b"".join(WHITESPACES))
    byte_count = len(data)
    if byte_count == 0:
        # Whitespace-only data is considered isolated.
        return True
    isolation_commands = re.finditer(rb"(?P<character>[qQ])" + WHITESPACES_AS_REGEXP + rb"*", data)

    # Fast path: If it does not start isolated, return directly.
    first_command = next(isolation_commands, None)
    if first_command is None:
        # We have some non-whitespace content, but no isolation call at all.
        return False
    if first_command.start("character") != 0 or first_command.group("character") != b"q":
        # The first non-whitespace character should be the isolation call.
        return False
    # We have consumed the first `q` command, thus start at level 1.
    level = 1

    for command in isolation_commands:
        character = command.group("character")
        if command.end("character") == byte_count and character == b"Q":
            # We have a `Q` command at the end. Due to having stopped for non-`q` at the
            # start, this is the expected case.
            return True
        # `q` commands are at the start, thus increasing the level.
        level += (1 if character == b"q" else -1)
        if level <= 0:
            # We have detected a second `q ... Q` pair on the root level.
            return False
    return False

codecov · 2023-09-30T10:12:33Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (8d3f879) 94.37% compared to head (a7883b5) 94.42%.
Report is 5 commits behind head on main.

❗ Current head a7883b5 differs from pull request most recent head 4cbd1ab. Consider uploading reports for the commit 4cbd1ab to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2224      +/-   ##
==========================================
+ Coverage   94.37%   94.42%   +0.05%     
==========================================
  Files          43       43              
  Lines        7660     7604      -56     
  Branches     1515     1502      -13     
==========================================
- Hits         7229     7180      -49     
+ Misses        267      260       -7     
  Partials      164      164

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

pubpub-zz · 2023-11-29T21:57:41Z

@MartinThoma
I'm still not convince with this PR :
The code required to identify if isolation is required or not, requires to parse the whole contents even if it is spreaded within multiple stream objects, which will have a strong impact on performances.
The only advantage is to stay acceptable nesting blocks (used to be >28 ? no limit found in PDF2.0). not sure it worth it as no errors have been reported up to now

MartinThoma · 2023-12-09T18:38:54Z

@pubpub-zz Thank you for your assessment of the situation 🙏

@stefan6419846 It makes sense to me what pubpub-zz is writing. I tend to close this PR. Is there anything I miss / should additionally consider?

stefan6419846 · 2023-12-09T18:51:23Z

pubpub-zz usually has more insights into the PDF spec, so I am going to trust the aforementioned comments in this case.

As some context, which is outlined in #2219: I have seen this behavior with the graphics state isolation only being done once in PyMuPDF and from my experience, MuPDF being one of the most stable/fault tolerant PDF library, I considered this to be a good solution. There they only check the first and last character of the content stream, although I have not looked at how they handle content streams which are split up.

In some of my use cases, I am using page merges with three or more overlays for one page, so this will quickly lead to quite some nested graphics state isolation calls, which might not always be necessary.

Having said that, I am fine with closing this PR and the corresponding issue as long as their is no real need and you as the maintainer decide that it is not worth the hassle or lead to quite some unnecessary overhead. As a side effect, implementing this PR allowed me to have a quick look at the inner workings of PDF/pypdf and allow me to get a basic grasp of the low-level primitives of PDF, so this PR will always have a positive learning effect for me nevertheless.

MartinThoma · 2023-12-10T11:45:32Z

Thank you for your work with this PR and thank you for taking the time to respond.

I do agree with your comment - MuPDF is good and pubpub-zz has very good understanding of the PDF specs + pypdf.

For this reason, I'm closing this PR and the issue now. If people stumble over that issue, we can re-evaluate / re-open :-)

BUG: Avoid isolating the graphics state multiple times (fixes py-pdf#…

89debe6

…2219)

stefan6419846 added 2 commits September 30, 2023 11:57

fix detectionn in operation-based ContentStream

17ca0ce

fix line length

a7883b5

MartinThoma added 2 commits October 8, 2023 12:17

Merge branch 'main' into graphics_state_isolation

42aaf66

Merge branch 'main' into graphics_state_isolation

74dc3fd

Merge branch 'main' into graphics_state_isolation

4cbd1ab

MartinThoma closed this Dec 10, 2023

MartinThoma reopened this Dec 10, 2023

MartinThoma closed this Dec 10, 2023

MartinThoma mentioned this pull request Dec 10, 2023

Each page merge will add another graphics state isolation call #2219

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Avoid isolating the graphics state multiple times #2224

BUG: Avoid isolating the graphics state multiple times #2224

stefan6419846 commented Sep 28, 2023

stefan6419846 commented Sep 28, 2023

pubpub-zz commented Sep 28, 2023

stefan6419846 commented Sep 28, 2023

pubpub-zz commented Sep 28, 2023

stefan6419846 commented Sep 28, 2023

stefan6419846 commented Sep 30, 2023

codecov bot commented Sep 30, 2023 •

edited

Loading

pubpub-zz commented Nov 29, 2023

MartinThoma commented Dec 9, 2023

stefan6419846 commented Dec 9, 2023

MartinThoma commented Dec 10, 2023

BUG: Avoid isolating the graphics state multiple times #2224

BUG: Avoid isolating the graphics state multiple times #2224

Conversation

stefan6419846 commented Sep 28, 2023

stefan6419846 commented Sep 28, 2023

pubpub-zz commented Sep 28, 2023

stefan6419846 commented Sep 28, 2023

pubpub-zz commented Sep 28, 2023

stefan6419846 commented Sep 28, 2023

stefan6419846 commented Sep 30, 2023

codecov bot commented Sep 30, 2023 • edited Loading

Codecov Report

pubpub-zz commented Nov 29, 2023

MartinThoma commented Dec 9, 2023

stefan6419846 commented Dec 9, 2023

MartinThoma commented Dec 10, 2023

codecov bot commented Sep 30, 2023 •

edited

Loading