Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow over-writing entries, in XRef.indexObjects, only when the generation number matches (issues 11230, 11139, 9552, 9129, 7303) #11231

Merged

Conversation

Snuffleupagus
Copy link
Collaborator

@Snuffleupagus Snuffleupagus commented Oct 11, 2019

TODO:

  • Manually check open/closed issues tagged with 4-corrupted-pdf such that there's no obvious regressions.
    I've done a quick pass already, but additional help would be very much appreciated here.
  • Convert (relevant) pre-existing load tests to eq tests instead, to help avoid unnecessary regressions. Note that some files will render, but not as intended when compared with e.g. Adobe Reader, depending on which objects are chosen (see e.g. issues with incorrect total number of pages).
  • Add as many new eq tests as possible, based on the list above.

This patch is making me somewhat worried about future regressions, since it's certainly easy to imagine this completely breaking certain kinds of corrupt/edited PDF documents while fixing others.[1]

Obviously it passes all existing reference tests (and even improves one), however compared to many other patches there's no telling how much it could break. Edit: Obviously any breakage would be limited to corrupt PDF documents, since valid ones does not enter this method at all. Edit2: Also, this patch should mostly affect the "multiple documents in one PDF file" case, but not necessarily PDF documents with just a broken XRef table.
The only reason that I'm even submitting this patch, is because of the number of open issues that it would address.

Generally speaking though, the best course of action would probably be if XRef.indexObjects was re-written to be much more robust (since it currently feels somewhat hand-wavy in parts). E.g. by actually checking/validating more of the objects before committing to them.

Fixes #11230
Fixes #11139
Fixes #9552
Fixes #9129
Fixes #7303

Also partially improves #6243, however parts of it regresses as well. (A couple of pages now renders "better", but one page no longer renders and an image looks worse.)
Generally though it seems that none of the PDF viewers I've tried agree completely on how the document should look, and this might be a case where the document is just too damaged to "fix".


[1] Especially given that it's reverting part of PR #5910, however in the case of issue #5909 it seems that other (more recent) changes have actually made that PR redundant.

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_preview from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.67.70.0:8877/e76013aabbe15e1/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Success

Full output at http://54.67.70.0:8877/e76013aabbe15e1/output.txt

Total script time: 1.71 mins

Published

@Snuffleupagus
Copy link
Collaborator Author

Snuffleupagus commented Oct 12, 2019

Temporarily dumping some PDF files here, to have a (slightly) better place to point linked test-cases to.
issue5909_original.pdf
issue9552.pdf
issue9129.pdf

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_preview from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.67.70.0:8877/e61ebbad97b67e1/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Success

Full output at http://54.67.70.0:8877/e61ebbad97b67e1/output.txt

Total script time: 1.69 mins

Published

@Snuffleupagus
Copy link
Collaborator Author

/botio test

@pdfjsbot
Copy link

From: Bot.io (Windows)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.215.176.217:8877/66a1741ad676488/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.67.70.0:8877/5012ec3eb0199db/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Failed

Full output at http://54.67.70.0:8877/5012ec3eb0199db/output.txt

Total script time: 17.91 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: FAILED

Image differences available at: http://54.67.70.0:8877/5012ec3eb0199db/reftest-analyzer.html#web=eq.log

@pdfjsbot
Copy link

From: Bot.io (Windows)


Failed

Full output at http://54.215.176.217:8877/66a1741ad676488/output.txt

Total script time: 26.31 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: FAILED

Image differences available at: http://54.215.176.217:8877/66a1741ad676488/reftest-analyzer.html#web=eq.log

@Snuffleupagus Snuffleupagus force-pushed the indexObjects-entries-gen branch 2 times, most recently from 794c7e9 to ff4593e Compare October 13, 2019 14:33
@Snuffleupagus Snuffleupagus marked this pull request as ready for review October 13, 2019 14:40
@Snuffleupagus Snuffleupagus changed the title Allow over-writing entries, in XRef.indexObject, only when the generation number matches (issues 11230, 11139, 9552, 9129, 7303) Allow over-writing entries, in XRef.indexObjects, only when the generation number matches (issues 11230, 11139, 9552, 9129, 7303) Oct 13, 2019
…eration number matches (issues 11230, 11139, 9552, 9129, 7303)

This patch is making me somewhat worried about future regressions, since it's certainly easy to imagine this completely breaking certain kinds of corrupt/edited PDF documents while fixing others.[1]

Obviously it passes all existing reference tests (and even improves one), however compared to many other patches there's no telling how much it could break.
The only reason that I'm even submitting this patch, is because of the number of open issues that it would address.

Generally speaking though, the best course of action would probably be if `XRef.indexObjects` was re-written to be much more robust (since it currently feels somewhat hand-wavy in parts). E.g. by actually checking/validating more of the objects before committing to them.

---
[1] Especially given that it's reverting part of PR 5910, however in the case of issue 5909 it seems that other (more recent) changes have actually made that PR redundant.
@timvandermeij
Copy link
Contributor

/botio-linux preview

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_preview from @timvandermeij received. Current queue size: 0

Live output at: http://54.67.70.0:8877/cb83521b5c0723f/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Success

Full output at http://54.67.70.0:8877/cb83521b5c0723f/output.txt

Total script time: 1.70 mins

Published

@timvandermeij timvandermeij merged commit c54bb22 into mozilla:master Oct 17, 2019
@timvandermeij
Copy link
Contributor

I think this is a very nice improvement. Even though you're absolutely right about other breakage that may still occur, given that those PDF files are usually quite broken, we have a lot more test coverage now than we had before, so future changes will be less risky. Thanks!

@timvandermeij
Copy link
Contributor

/botio makeref

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_makeref from @timvandermeij received. Current queue size: 0

Live output at: http://54.67.70.0:8877/875b84455911221/output.txt

@pdfjsbot
Copy link

From: Bot.io (Windows)


Received

Command cmd_makeref from @timvandermeij received. Current queue size: 2

Live output at: http://54.215.176.217:8877/9707c85a90f21da/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Success

Full output at http://54.67.70.0:8877/875b84455911221/output.txt

Total script time: 16.74 mins

  • Lint: Passed
  • Make references: Passed
  • Check references: Passed

@pdfjsbot
Copy link

From: Bot.io (Windows)


Success

Full output at http://54.215.176.217:8877/9707c85a90f21da/output.txt

Total script time: 23.93 mins

  • Lint: Passed
  • Make references: Passed
  • Check references: Passed

@Snuffleupagus Snuffleupagus deleted the indexObjects-entries-gen branch October 17, 2019 22:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants