Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get "unnecessary" range on first page #14570

Closed
liu-dongyu opened this issue Feb 16, 2022 · 6 comments
Closed

Get "unnecessary" range on first page #14570

liu-dongyu opened this issue Feb 16, 2022 · 6 comments

Comments

@liu-dongyu
Copy link

liu-dongyu commented Feb 16, 2022

Attach (recommended) or Link to PDF file here:
https://github.com/liu-dongyu/demo/blob/main/pdf/test.pdf

Configuration:

  • Web browser and its version: Chrome 98.0.4758.80
  • Operating system and its version: MacOS 10.15.5
  • PDF.js version: v2.12.313 and v2.3.200
  • Is a browser extension: no

Steps to reproduce the problem:

  1. A pdf online url which support range request
  2. set disableAutoFetch: true disableStream: true rangeChunkSize: 65536 * 16 to getDocument
  3. getPage(1) then render

What is the expected behavior? (add screenshot)
With v2.12.313,first page need 60 range request,online demo

source code

<script src="https://cdn.bootcdn.net/ajax/libs/pdf.js/2.12.313/pdf.min.js"></script>
pdfjsLib.GlobalWorkerOptions.workerSrc = "https://cdn.bootcdn.net/ajax/libs/pdf.js/2.12.313/pdf.worker.min.js";

const loadingTask = pdfjsLib.getDocument({
  url: "xx",
  disableAutoFetch: true,
  disableStream: true,
  rangeChunkSize: 65536 * 16,
});

(async function () {
  const pdfDocument = await loadingTask.promise;
  const page = await pdfDocument.getPage(1);

  page.render({
    canvasContext: document
      .getElementById("pageContainer")
      .getContext("2d"),
    viewport: page.getViewport({ scale: 1 }),
  });
})();

With v2.3.200,first page only need 3 range request (3 is the expected request times) online demo

source code

<script src="https://cdn.bootcdn.net/ajax/libs/pdf.js/2.3.200/pdf.min.js"></script>
pdfjsLib.GlobalWorkerOptions.workerSrc = "https://cdn.bootcdn.net/ajax/libs/pdf.js/2.3.200/pdf.worker.min.js";

const loadingTask = pdfjsLib.getDocument({
  url: "xx",
  disableAutoFetch: true,
  disableStream: true,
  rangeChunkSize: 65536 * 16,
});

(async function () {
  const pdfDocument = await loadingTask.promise;
  const page = await pdfDocument.getPage(1);

  page.render({
    canvasContext: document
      .getElementById("pageContainer")
      .getContext("2d"),
    viewport: page.getViewport({ scale: 1 }),
  });
})();

What went wrong? (add screenshot)
Why v2.12.313 need too many chunks to render first page and it is possible to optimize it ?

@Snuffleupagus
Copy link
Collaborator

set disableAutoFetch: true disableStream: true rangeChunkSize: 65536 * 16 to getDocument

Please note that disabling of streaming will generally lead to worse performance, compared to the default values.
Furthermore, note that using such a large chunk-size may also affect things negatively.

Why v2.12.313 need so many chunks to render first page and it is possible to optimize it?

Unfortunately this is the expected result of PRs such as #14311, #14335, #14358, and #14411 which were necessary in order to avoid serious problems in corrupt PDF documents.
Note in particular #14311 (comment), which contains the following passage where the second point applies to your PDF document:

Unfortunately these changes will have a number of somewhat negative side-effects, please see a possibly incomplete list below, however I cannot see a better way to address this bug.

  • This will slow down initial loading/rendering of all documents, at least by some amount, since we now need to fetch/parse more of the /Pages-tree in order to be able to access the last page of the PDF documents.
  • For poorly generated PDF documents, where the entire /Pages-tree only has one level, we'll unfortunately need to fetch/parse the entire /Pages-tree to get to the last page. While there's a cache to help reduce repeated data lookups, this will affect initial loading/rendering of some long PDF documents,
  • This will affect the disableAutoFetch = true mode negatively, since we now need to fetch/parse more data during document initialization. While the disableAutoFetch = true mode should still be helpful in larger/longer PDF documents, for smaller ones the effect/usefulness may unfortunately be lost.

Closing as INVALID, since we cannot "fix" this without breaking other PDF documents and that this issue is specific to badly generated PDF documents.

@liu-dongyu
Copy link
Author

liu-dongyu commented Feb 18, 2022

@Snuffleupagus I already knew that pdfjs need checkLastPage during document initialization, to ensure numPages is correct.

pdf.js/src/core/worker.js

Lines 177 to 179 in c37d785

// Check that the last page can be sucessfully loaded, to ensure that
// `numPages` is correct, and fallback to walking the entire /Pages-tree.
await pdfManager.ensureDoc("checkLastPage", [recoveryMode]);

In order to improve initialization performance, If I can probably get a correct numPages from somewhere such as a customer backend request,can i just run catalog.setActualNumPages(correctNumPage); then return directly?

pdf.js/src/core/document.js

Lines 1362 to 1366 in 263c895

async checkLastPage(recoveryMode = false) {
const { catalog, pdfManager } = this;
catalog.setActualNumPages(); // Ensure that it's always reset.
let numPages;

become 👇

@legistek
Copy link

legistek commented Mar 17, 2023

Having the same issue with a 2000+ page PDF that is definitely linearized. I'm also in the same position as @liu-dongyu in that we do a server-side integirty check already. Removing the line

await pdfManager.ensureDoc("checkLastPage", [recoveryMode]);

solves the problem for the first page but not for subsequent pages. For example loading page 1,000 requires loading everything piror to 1,000.

I think it's hard to say this is not a bug that is worth addressing. I think most people would prefer that all linearized PDFs load pages on demand - which is the point of linearizing - than have foolproof error checking to deal with corrupt PDFs. Couldn't we be given the option to skip this integrity check as one of the getDocument parameters?

@yihui1021

This comment was marked as duplicate.

@GStPierreMT
Copy link

GStPierreMT commented Jul 27, 2023

Leaving a comment for those that end up here from a search. Or a link on stack overflow.

We also had this issue with PDFs that we were creating and linearizing with QPDF.

We found that if we force min-version=1.6 while creating the document, then the document loads as expected in PDF..js with settings disableAutoFetch: true and disableStream: true

The test.pdf in the original post is at PDF version 1.3. I downloaded it and used QPDF to transform it to version 1.6
qpdf --min-version=1.6 test.pdf test1_6.pdf

That document renders the first page after 3 range requests.

For anyone looking at this I believe you should still consider Snuffleupagus' comment:

Please note that disabling of streaming will generally lead to worse performance, compared to the default values.
Furthermore, note that using such a large chunk-size may also affect things negatively.

If the user is likely to read/scroll through most of the PDF, or perform a search then the streaming mode is probably better.

Edit: My testing performed with Windows 11, PDF.js 3.7.107, Chrome 115.0.5790.110, Apache 2.4 webserver

@richard-smith-preservica
Copy link

richard-smith-preservica commented Aug 21, 2024

Leaving a comment here as well, as I have a similar issue.

Rewriting problem PDFs with QPDF does help, but it's not related to the PDF version. It's because QPDF restructures the file.

A summary of the cause of the problem:

  • pdf.js loads the "last" page (i.e. page "count - 1") to validate that the PDF is at least somewhat correct
  • because of how /Pages dicts' children work, finding information about page N requires traversing the /Pages tree for all previous pages. In the case of a PDF with all the pages in a flat structure this means reading every /Page dictionary for pages 0-N (i.e., for the last page, every /Page dictionary)
  • problem PDFs (at least the ones I have problems with) interleave /Page dictionaries and content, i.e. the PDF looks like this

/Pages << /Kids [ all the /Page refs ] >>
/Page for page 0
Content for page 0 (big)
/Page for page 1
Content for page 1 (big)
etc

  • so every Page dictionary is in a different chunk and needs to be fetched separately.

In addition, pdf.js awaits each of those requests in turn, so for a 300 page document you are waiting for 300 XHRs one after the other. I recently had a PR accepted #18627 which will send the XHRs in parallel, which will mitigate the issue, but you'll still be sending 300 XHRs.

QPDF fixes this because when it rewrites the PDF the structure is

/Pages << /Kids [ all the /Page refs ] >>
/Page for page 0
/Page for page 1
etc (other /Page dicts)
Content for page 0 (big)
Content for page 1 (big)
etc (other content objects)

... and so all the /Page dicts are within a small number of chunks.

Although there's an argument that such interleaved PDFs are 'bad', they seem to be quite common from digitisation. I will start a discussion about whether there are circumstances where we can trust a PDF enough to use the /Pages->/Kids as a random access lookup table.

Edit: #18637

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants