Get "unnecessary" range on first page #14570

liu-dongyu · 2022-02-16T09:25:00Z

Attach (recommended) or Link to PDF file here:
https://github.com/liu-dongyu/demo/blob/main/pdf/test.pdf

Configuration:

Web browser and its version: Chrome 98.0.4758.80
Operating system and its version: MacOS 10.15.5
PDF.js version: v2.12.313 and v2.3.200
Is a browser extension: no

Steps to reproduce the problem:

A pdf online url which support range request
set disableAutoFetch: true disableStream: true rangeChunkSize: 65536 * 16 to getDocument
getPage(1) then render

What is the expected behavior? (add screenshot)
With v2.12.313，first page need 60 range request，online demo

<script src="https://cdn.bootcdn.net/ajax/libs/pdf.js/2.12.313/pdf.min.js"></script>

pdfjsLib.GlobalWorkerOptions.workerSrc = "https://cdn.bootcdn.net/ajax/libs/pdf.js/2.12.313/pdf.worker.min.js";

const loadingTask = pdfjsLib.getDocument({
  url: "xx",
  disableAutoFetch: true,
  disableStream: true,
  rangeChunkSize: 65536 * 16,
});

(async function () {
  const pdfDocument = await loadingTask.promise;
  const page = await pdfDocument.getPage(1);

  page.render({
    canvasContext: document
      .getElementById("pageContainer")
      .getContext("2d"),
    viewport: page.getViewport({ scale: 1 }),
  });
})();

With v2.3.200，first page only need 3 range request (3 is the expected request times) online demo

source code

<script src="https://cdn.bootcdn.net/ajax/libs/pdf.js/2.3.200/pdf.min.js"></script>

pdfjsLib.GlobalWorkerOptions.workerSrc = "https://cdn.bootcdn.net/ajax/libs/pdf.js/2.3.200/pdf.worker.min.js";

const loadingTask = pdfjsLib.getDocument({
  url: "xx",
  disableAutoFetch: true,
  disableStream: true,
  rangeChunkSize: 65536 * 16,
});

(async function () {
  const pdfDocument = await loadingTask.promise;
  const page = await pdfDocument.getPage(1);

  page.render({
    canvasContext: document
      .getElementById("pageContainer")
      .getContext("2d"),
    viewport: page.getViewport({ scale: 1 }),
  });
})();

What went wrong? (add screenshot)
Why v2.12.313 need too many chunks to render first page and it is possible to optimize it ?

The text was updated successfully, but these errors were encountered:

Snuffleupagus · 2022-02-16T10:05:21Z

set disableAutoFetch: true disableStream: true rangeChunkSize: 65536 * 16 to getDocument

Please note that disabling of streaming will generally lead to worse performance, compared to the default values.
Furthermore, note that using such a large chunk-size may also affect things negatively.

Why v2.12.313 need so many chunks to render first page and it is possible to optimize it?

Unfortunately this is the expected result of PRs such as #14311, #14335, #14358, and #14411 which were necessary in order to avoid serious problems in corrupt PDF documents.
Note in particular #14311 (comment), which contains the following passage where the second point applies to your PDF document:

Unfortunately these changes will have a number of somewhat negative side-effects, please see a possibly incomplete list below, however I cannot see a better way to address this bug.

This will slow down initial loading/rendering of all documents, at least by some amount, since we now need to fetch/parse more of the /Pages-tree in order to be able to access the last page of the PDF documents.

For poorly generated PDF documents, where the entire /Pages-tree only has one level, we'll unfortunately need to fetch/parse the entire /Pages-tree to get to the last page. While there's a cache to help reduce repeated data lookups, this will affect initial loading/rendering of some long PDF documents,

This will affect the disableAutoFetch = true mode negatively, since we now need to fetch/parse more data during document initialization. While the disableAutoFetch = true mode should still be helpful in larger/longer PDF documents, for smaller ones the effect/usefulness may unfortunately be lost.

Closing as INVALID, since we cannot "fix" this without breaking other PDF documents and that this issue is specific to badly generated PDF documents.

liu-dongyu · 2022-02-18T02:38:08Z

@Snuffleupagus I already knew that pdfjs need checkLastPage during document initialization, to ensure numPages is correct.

pdf.js/src/core/worker.js

Lines 177 to 179 in c37d785

    
           // Check that the last page can be sucessfully loaded, to ensure that 
        
           // `numPages` is correct, and fallback to walking the entire /Pages-tree. 
        
           await pdfManager.ensureDoc("checkLastPage", [recoveryMode]);

In order to improve initialization performance, If I can probably get a correct numPages from somewhere such as a customer backend request，can i just run catalog.setActualNumPages(correctNumPage); then return directly?

pdf.js/src/core/document.js

Lines 1362 to 1366 in 263c895

    
           async checkLastPage(recoveryMode = false) { 
        
             const { catalog, pdfManager } = this; 
        
             catalog.setActualNumPages(); // Ensure that it's always reset. 
        
             let numPages;

become 👇

legistek · 2023-03-17T19:38:21Z

Having the same issue with a 2000+ page PDF that is definitely linearized. I'm also in the same position as @liu-dongyu in that we do a server-side integirty check already. Removing the line

await pdfManager.ensureDoc("checkLastPage", [recoveryMode]);

solves the problem for the first page but not for subsequent pages. For example loading page 1,000 requires loading everything piror to 1,000.

I think it's hard to say this is not a bug that is worth addressing. I think most people would prefer that all linearized PDFs load pages on demand - which is the point of linearizing - than have foolproof error checking to deal with corrupt PDFs. Couldn't we be given the option to skip this integrity check as one of the getDocument parameters?

GStPierreMT · 2023-07-27T18:24:19Z

Leaving a comment for those that end up here from a search. Or a link on stack overflow.

We also had this issue with PDFs that we were creating and linearizing with QPDF.

We found that if we force min-version=1.6 while creating the document, then the document loads as expected in PDF..js with settings disableAutoFetch: true and disableStream: true

The test.pdf in the original post is at PDF version 1.3. I downloaded it and used QPDF to transform it to version 1.6
qpdf --min-version=1.6 test.pdf test1_6.pdf

That document renders the first page after 3 range requests.

For anyone looking at this I believe you should still consider Snuffleupagus' comment:

Please note that disabling of streaming will generally lead to worse performance, compared to the default values.
Furthermore, note that using such a large chunk-size may also affect things negatively.

If the user is likely to read/scroll through most of the PDF, or perform a search then the streaming mode is probably better.

Edit: My testing performed with Windows 11, PDF.js 3.7.107, Chrome 115.0.5790.110, Apache 2.4 webserver

richard-smith-preservica · 2024-08-21T14:03:06Z

Leaving a comment here as well, as I have a similar issue.

Rewriting problem PDFs with QPDF does help, but it's not related to the PDF version. It's because QPDF restructures the file.

A summary of the cause of the problem:

pdf.js loads the "last" page (i.e. page "count - 1") to validate that the PDF is at least somewhat correct
because of how /Pages dicts' children work, finding information about page N requires traversing the /Pages tree for all previous pages. In the case of a PDF with all the pages in a flat structure this means reading every /Page dictionary for pages 0-N (i.e., for the last page, every /Page dictionary)
problem PDFs (at least the ones I have problems with) interleave /Page dictionaries and content, i.e. the PDF looks like this

/Pages << /Kids [ all the /Page refs ] >>
/Page for page 0
Content for page 0 (big)
/Page for page 1
Content for page 1 (big)
etc

so every Page dictionary is in a different chunk and needs to be fetched separately.

In addition, pdf.js awaits each of those requests in turn, so for a 300 page document you are waiting for 300 XHRs one after the other. I recently had a PR accepted #18627 which will send the XHRs in parallel, which will mitigate the issue, but you'll still be sending 300 XHRs.

QPDF fixes this because when it rewrites the PDF the structure is

/Pages << /Kids [ all the /Page refs ] >>
/Page for page 0
/Page for page 1
etc (other /Page dicts)
Content for page 0 (big)
Content for page 1 (big)
etc (other content objects)

... and so all the /Page dicts are within a small number of chunks.

Although there's an argument that such interleaved PDFs are 'bad', they seem to be quite common from digitisation. I will start a discussion about whether there are circumstances where we can trust a PDF enough to use the /Pages->/Kids as a random access lookup table.

Edit: #18637

Snuffleupagus added the information-requested label Feb 16, 2022

Snuffleupagus closed this as completed Feb 16, 2022

Snuffleupagus mentioned this issue Nov 24, 2022

Rendering large pdf downloading whole pdf #15735

Closed

This comment was marked as duplicate.

Sign in to view

Snuffleupagus mentioned this issue Oct 16, 2023

pdf.js displaying PDF after fetching complete file while range request enabled #17128

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get "unnecessary" range on first page #14570

Get "unnecessary" range on first page #14570

liu-dongyu commented Feb 16, 2022 •

edited

Loading

Snuffleupagus commented Feb 16, 2022

liu-dongyu commented Feb 18, 2022 •

edited

Loading

legistek commented Mar 17, 2023 •

edited

Loading

This comment was marked as duplicate.

GStPierreMT commented Jul 27, 2023 •

edited

Loading

richard-smith-preservica commented Aug 21, 2024 •

edited

Loading

Get "unnecessary" range on first page #14570

Get "unnecessary" range on first page #14570

Comments

liu-dongyu commented Feb 16, 2022 • edited Loading

Snuffleupagus commented Feb 16, 2022

liu-dongyu commented Feb 18, 2022 • edited Loading

legistek commented Mar 17, 2023 • edited Loading

This comment was marked as duplicate.

GStPierreMT commented Jul 27, 2023 • edited Loading

richard-smith-preservica commented Aug 21, 2024 • edited Loading

liu-dongyu commented Feb 16, 2022 •

edited

Loading

liu-dongyu commented Feb 18, 2022 •

edited

Loading

legistek commented Mar 17, 2023 •

edited

Loading

GStPierreMT commented Jul 27, 2023 •

edited

Loading

richard-smith-preservica commented Aug 21, 2024 •

edited

Loading