Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Browser caching does not work for the first range request on Chrome and Safari #11624

Closed
alexandre-sarfati opened this issue Feb 24, 2020 · 11 comments

Comments

@alexandre-sarfati
Copy link

alexandre-sarfati commented Feb 24, 2020

Dear community,

When one activates range requests with disableAutoFetch: true and disableStream: true and http server configured to cache contents with public, max-age=31536000, immutable, browser caching does work on Chrome, nor on Safari.

But the first range request return a code 200 and is not cached by the browser, until all the ranges has been download.

Attach (recommended) or Link to PDF file here: not file specific, but one can use https://public.fays.io/public/580af700-5f79-44c4-8a0c-f7e976ee528f.pdf

The server configuration is:

Access-Control-Allow-Origin: *
Access-Control-Allow-Methods: GET, HEAD, PUT, POST
Access-Control-Expose-Headers: Accept-Ranges, Content-Range, Content-Length, x-amz-request-id, x-amz-id-2, ETag
Access-Control-Max-Age: 31536000
Vary: Origin, Access-Control-Request-Headers, Access-Control-Request-Method
Cache-Control: public, max-age=31536000, immutable

Configuration:

Web browser and its version: Chrome Version 80.0.3987.116. Firefox does not yet cache range requests, so it is out of scope.
Operating system and its version: mac os x 10.15.3 
PDF.js version: 2.2.228
Is a browser extension: no

Steps to reproduce the problem:

  1. Clear the browser cache
  2. Go to https://app.fays.io/bf205bf6-43d2-4140-8140-ab4b6f48c2a5 and open the developper tool. This viewer should work like the integrated viewer of PDF.JS
  3. Refresh the tab

What is the expected behavior?
The expected behavior is that the browser caches all the HTTP requests, including the first one with response code 200.

What went wrong?
After step 3., one can see in the developer tool that the range requests have been cached but not the first request.
Capture d’écran 2020-02-24 à 13 21 15

The first request has specific header for CORS:

:authority: public.fays.io
:method: GET
:path: /public/2ce2766d-b491-43a0-a0a9-96c56c2bd667.pdf
:scheme: https
accept: */*
accept-encoding: gzip, deflate, br
accept-language: fr-FR,fr;q=0.9,en-US;q=0.8,en;q=0.7,de;q=0.6
if-modified-since: Sun, 23 Feb 2020 23:43:01 GMT
if-none-match: "dd09a498518c55f91fe48b29bec2ffde"
origin: https://app.fays.io
range: bytes=0-196607
referer: https://app.fays.io/bf205bf6-43d2-4140-8140-ab4b6f48c2a5
sec-fetch-dest: empty
sec-fetch-mode: cors
sec-fetch-site: same-site
user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.116 Safari/537.36

The request is not cached by the browser since PDF.JS cancels the first request.

if (!this._isStreamingSupported && this._isRangeSupported) {
this.cancel(new AbortException("Streaming is disabled."));
}

**Link to a viewer (if hosted on a site other than mozilla.github.io/pdf.js or as Firefox/Chrome extension): **
https://app.fays.io/bf205bf6-43d2-4140-8140-ab4b6f48c2a5

Best, A.

@Snuffleupagus
Copy link
Collaborator

The request is not cached by the browser since PDF.JS cancels the first request ?

Well, if you set disableStream = true then you're explicitly telling the PDF.js code to abort the initial (200) request and instead fetch the PDF document using range (206) requests. Quite frankly, if the initial request wasn't aborted there'd be no point in actually having the disableStream option.

Still, when the browser has downloaded all the ranges (step 4.), then the request is finally cached On Chrome (but not on Safari):

It seems that, again, your problem is related to the differences in how various browsers have chosen to implement caching of network requests.


All-in-all, I'm afraid that it does not seem to be much (if anything) actionable here from a PDF.js perspective.

@alexandre-sarfati
Copy link
Author

alexandre-sarfati commented Feb 24, 2020

Well, if you set disableStream = true then you're explicitly telling the PDF.js code to abort the initial (200) request and instead fetch the PDF document using range (206) requests. Quite frankly, if the initial request wasn't aborted there'd be no point in actually having the disableStream option.

Yes this is the expected behavior of disableStream in this scenario. I am wondering if this is possible to replace the first request by another type of request.

It seems that, again, your problem is related to the differences in how various browsers have chosen to implement caching of network requests.

Yes

@Snuffleupagus
Copy link
Collaborator

Snuffleupagus commented Feb 24, 2020

I am wondering if this is possible to replace the first request by another type of request.

First of all, you need to dispatch a regular (200) request to check if range requests are actually supported and correctly implemented by the server, see

const {
allowRangeRequests,
suggestedLength,
} = validateRangeRequestCapabilities({
getResponseHeader,
isHttp: this._stream.isHttp,
rangeChunkSize: this._rangeChunkSize,
disableRange: this._disableRange,
});
this._isRangeSupported = allowRangeRequests;
, or
const {
allowRangeRequests,
suggestedLength,
} = validateRangeRequestCapabilities({
getResponseHeader,
isHttp: this._manager.isHttp,
rangeChunkSize: this._rangeChunkSize,
disableRange: this._disableRange,
});
if (allowRangeRequests) {
this._isRangeSupported = true;
}
, and also
function validateRangeRequestCapabilities({
getResponseHeader,
isHttp,
rangeChunkSize,
disableRange,
}) {
assert(rangeChunkSize > 0, "Range chunk size must be larger than zero");
const returnValues = {
allowRangeRequests: false,
suggestedLength: undefined,
};
const length = parseInt(getResponseHeader("Content-Length"), 10);
if (!Number.isInteger(length)) {
return returnValues;
}
returnValues.suggestedLength = length;
if (length <= 2 * rangeChunkSize) {
// The file size is smaller than the size of two chunks, so it does not
// make any sense to abort the request and retry with a range request.
return returnValues;
}
if (disableRange || !isHttp) {
return returnValues;
}
if (getResponseHeader("Accept-Ranges") !== "bytes") {
return returnValues;
}
const contentEncoding = getResponseHeader("Content-Encoding") || "identity";
if (contentEncoding !== "identity") {
return returnValues;
}
returnValues.allowRangeRequests = true;
return returnValues;
}

Second of all, in the default case where streaming is used the initial (200) request is simply allowed to continue. Not only is that the most efficient solution, but it also avoids errors with servers that don't allow a PDF document to be fetched more than once.

@alexandre-sarfati
Copy link
Author

alexandre-sarfati commented Feb 24, 2020

I found that the following is not mandatory on my computer. I skip this part with an if:

fetch(
url,
createFetchOptions(
this._headers,
this._withCredentials,
this._abortController
)
)
.then(response => {
if (!validateResponseStatus(response.status)) {
throw createResponseStatusError(response.status, url);
}
this._reader = response.body.getReader();
this._headersCapability.resolve();
const getResponseHeader = name => {
return response.headers.get(name);
};
const {
allowRangeRequests,
suggestedLength,
} = validateRangeRequestCapabilities({
getResponseHeader,
isHttp: this._stream.isHttp,
rangeChunkSize: this._rangeChunkSize,
disableRange: this._disableRange,
});
this._isRangeSupported = allowRangeRequests;
// Setting right content length.
this._contentLength = suggestedLength || this._contentLength;
this._filename = extractFilenameFromHeader(getResponseHeader);
// We need to stop reading when range is supported and streaming is
// disabled.
if (!this._isStreamingSupported && this._isRangeSupported) {
this.cancel(new AbortException("Streaming is disabled."));
}
})
.catch(this._headersCapability.reject);

And an else condition this._headersCapability.resolve();
It works if the parameter "length" (e.g. { url: URL, length: 1000 }) is set to the contentLength of the file.

@Snuffleupagus
Copy link
Collaborator

Snuffleupagus commented Feb 24, 2020

I don't really understand what #11624 (comment) even suggests removing/skipping, but it's obviously not appropriate to simply remove a fetch call (assuming that's what was actually suggested).

Furthermore, as was clearly mentioned in #11624 (comment) and is hopefully evident if you carefully read through the validateRangeRequestCapabilities function, there's a fair amount of validation of the server response that needs to happen before range requests can actually be used.

I found that the following is not mandatory on my computer.

It's hopefully easy to see why a single example doesn't really mean anything in general, and why you cannot assume that random servers on the internet are always correctly configured.

And an else condition this._headersCapability.resolve();

It's not really clear what you're suggesting here, but it's obviously not a good idea to purposely add options that would allow users to essentially "shot themselves in the foot" by disabling validation which is necessary for the correct function of the PDF.js library.


All-in-all, I'm really not seeing anything that can be fixed here (from a PDF.js perspective) and would thus suggest that the issue should be closed.

@alexandre-sarfati
Copy link
Author

@Snuffleupagus I totally agree with you. I am looking for a specific solution for a specific server setting to cache this preflight 200 and be able to load it offline!

@Snuffleupagus
Copy link
Collaborator

Snuffleupagus commented Feb 24, 2020

I am looking for a specific solution for a specific server setting to cache this preflight 200 and be able to load it offline!

Given that caching of range requests are not well supported across browsers, have you instead considered not depending so heavily on that and simply load the PDF document with default values for the relevant disable... options.
With modern browsers (i.e. those supporting the Fetch API), that would thus imply that most (if not all) of the PDF document is fetched with the initial (200) request, which browsers should have no trouble caching. Note also that, generally speaking, streaming is usually faster than range requests too.

@alexandre-sarfati
Copy link
Author

alexandre-sarfati commented Feb 24, 2020

@Snuffleupagus In fact, I have successfully tested offline with #11624 (comment) for the following document and range request activated: https://public.fays.io/public/2ce2766d-b491-43a0-a0a9-96c56c2bd667.pdf 💯

I try with the PDF.JS integrated viewer. When one turns off Wifi and refreshes the tab 10 times it still loads the document and one may consult the previously requested pages. 🥇

Still, I did not yet manage to make it works when one shutdowns the browser. 👎 One may try to use a production environment and see what happens with the ServiceWorker.

Of course, as you mentionned, this is not a general solution at all. Just a try. This feature can not be defaulted for every PDF.JS users.

But I find out that the first request (code 200) is not mandatory because the preflight CORS check is also performed by the browser if the request has a "range" header (for all browser, this is in the specs, who knows ?).

As you said:

if you carefully read through the validateRangeRequestCapabilities function, there's a fair amount of validation of the server response that needs to happen before range requests can actually be used.

The first request is mandatory when one does not know if the server accept ranges and to get the size of the document, this is specified on validateRangeRequestCapabilities

Still, the first request is an option if the user can ensure by itself that validateRangeRequestCapabilities is always true.
For example, one can access the size using an alternative strategy (not related to PDF.JS) and if one wants to force range requests and set disableStream and disableAutoFetch.

But all of this requires more investigation 🔢

💯 Therefore, it will be nice to have a way to override this part easily outside PDF.JS but I don't know how to do it properly. Can one extends PDFFetchStreamReader ? 💯

have you instead considered not depending so heavily on that and simply load the PDF document with default values for the relevant disable... options.

I confirm that offline already works when disableStream and disableAutoFetch are not set. 👍

Given that caching of range requests are not well supported across browsers

In fact, it works well for range requests on Safari, Chrome, Opera (including mobile?), I have not tested it on Edge or any alternatives and they plan to implement the feature on Firefox.

@Snuffleupagus
Copy link
Collaborator

Still, the first request is an option if the user can ensure by itself that validateRangeRequestCapabilities is always true.

You want to trust the server, rather than the user here[1], to avoid a support nightmare later on.
As was clearly mentioned in #11624 (comment) we definitely do not want any foot guns in the options here, hence that won't happen!

Besides, it's really not at all correct to make the changes you're suggesting since it'd make PDFFetchStreamReader unable to read any data (which is bound to break lots of other things).

Therefore, it will be nice to have a way to override this part easily outside PDF.JS but I don't know how to do it properly. Can one extends PDFFetchStreamReader?

No, that's considered internal functionality and it's consequently not exposed in the public API. (You'd need to fork the code, but as I've tried to explain multiple times the changes you're suggesting aren't really a good idea.)

The "correct" way of handling special cases, with regards to fetching of data, would be to utilize PDFDataRangeTransport instead. That allows completely custom data delivery, that you thus can implement in what ever way you want/need in your case (it's being used in the PDF Viewer that's built-in to the Firefox browser); see #11453 (comment) for some relevant links.

This issue really ought to remain closed. /cc @timvandermeij


[1] Previous experiences have shown that users may, more often than you'd like, set certain options without always fully understanding the ramifications of doing so.

@alexandre-sarfati
Copy link
Author

alexandre-sarfati commented Feb 25, 2020

But all of this requires more investigation 🔢

The "correct" way of handling special cases, with regards to fetching of data, would be to utilize PDFDataRangeTransport instead. That allows completely custom data delivery, that you thus can implement in what ever way you want/need in your case (it's being used in the PDF Viewer that's built-in to the Firefox browser); see #11453 (comment) for some relevant links.

Thank you for this @Snuffleupagus
Can one extends PDFDataRangeTransport in Typescript ?

@timvandermeij
Copy link
Contributor

timvandermeij commented Feb 25, 2020

Yes, it's part of the public API; see also

* @param {string|TypedArray|DocumentInitParameters|PDFDataRangeTransport} src

Closing as answered since there is not really anything we can/should do here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants