Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix incorrect processing of titles with question marks and anchor parameters #807

Conversation

Jaifroid
Copy link
Member

Fixes #806.

This initial fix needs very thorough testing.

@Jaifroid Jaifroid added the bug label Jan 22, 2022
@Jaifroid Jaifroid added this to the v3.3 milestone Jan 22, 2022
@Jaifroid Jaifroid self-assigned this Jan 22, 2022
@Jaifroid Jaifroid changed the title Fix incorrect processing of titles with question marks in SW mode Fix incorrect processing of titles with question marks and anchor parameters Jan 23, 2022
@Jaifroid
Copy link
Member Author

Jaifroid commented Jan 23, 2022

Now added processing of anchor parameters on links in JQuery mode.

Tested on many titles with question marks and quotation marks, both single and double.

Tested on links with anchor parameter (e.g. ../An_article#Heading_to_jump_to) in both SW and JQuery modes.

service-worker.js Outdated Show resolved Hide resolved
@Jaifroid Jaifroid requested a review from mossroy January 23, 2022 16:45
Copy link
Contributor

@mossroy mossroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Congratulations for this PR.
I know these issues can be quite difficult to handle.

But I would suggest to try to use the URL object https://developer.mozilla.org/en-US/docs/Web/API/URL instead of regular expressions where/when it's possible (for sure in SW mode, maybe in jQuery mode too).
It would certainly be safer, and handle all cases.
For IE11 (that does not support this object properly), we could fallback to the regexp. And never mind if some edge cases are not properly supported for IE11 (let's not waste time on it)

@Jaifroid
Copy link
Member Author

I've been experimenting a bit with the URL option. It's not as clever as we might like to think. We're supposed to feed it a bare URL string, and it will return a parsed and encoded URL according to RFC3986. So I tried our problematic URL, and we get:

image

It's reproduced the error we were trying to avoid, and it believes ?.38140.htm is a querystring (look at search property) rather than a question mark that needs to be encoded. Additionally, it has swallowed the relative part of the URL string ../. It's not possible to feed this function just the pathname without the protocol:

image

So we have to encode the URL, and then decode it:

image

NB This constructor fails with the entity references that our uiUtil.htmlEscapeChars() produced. But I've abandoned that as outdated (and this is a good example of why -- those codes need themselves to be encoded before putting them in a URL, or else the ampersand and the hash are misconstrued).

In sum, I can work with this constructor, though it is not doing anything more sophisticated than finding everything after a ? as a querystring, up to a # for an anchor, and we still have to take care not to encode accidentally any querystring or anchor.

@mossroy
Copy link
Contributor

mossroy commented Jan 25, 2022

I'm not surprised it's expecting an encoded URL (else it would have no way to see the difference between a question mark for the querystring and a question mark in the path/filename).
But it's what we already have in SW mode, I suppose.

Wouldn't be enough to do the following?

urlObject = new URL("http://localhost/" + url);
return urlObject.pathname;

(untested)

My concern is to make sure we don't miss edge cases like the one you found with encoded characters (ex: ' ). Are there any other ones we did not think of? Using the built-in object of the browser makes sure all are covered.

But if it's becoming too complicated, never mind.

NB : I don't mind if we keep the regexp for jQuery mode (for IE11 compatibility), and use URL object for SW mode

@Jaifroid
Copy link
Member Author

I agree, and I'm testing a solution along these lines.

@Jaifroid
Copy link
Member Author

OK, after many frustrating tests trying to create a universal function that could be used in the same way in SW mode and in jQuery mode, I came to the conclusion that the ONLY place that the URL() method can be used viably is in the Service Worker. And the reason we can use it there is only because the Service Worker provides us with a fully qualified URL ('http://domain/path/to/article.html'), which is what is needed for the URL method to work with unless we can provide a base URL.

In jQuery mode, we are working with relative URLs and ZIM URLs, and it is unreliable and complicated to construct artificial fully qualified URLs, since we are inside an iframe which has a fixed src . I wrote a ridiculously complicated function to work around the issues in calculating a full URL, or in using a dummy + base URL. Then I realized that this would introduce a lot of overhead in long documents with hundreds if not thousands of hyperlinks, as we would have to do calculations and create URLs for every single one of them, and in IE11 it would be worse using the workaround of creating a DOM (like in deriveZIMUrlFromRelativeUrl). I was introducing complexity where the regexes are in fact extremely simple and elegant and are fully road-tested.

So I now agree with you that the most elegant solution is to use the URL function in the Service Worker, which is in fact very simple, and to use our road-tested regexes in jQuery mode. This is what I've done in the latest.

I've finessed the stripping of or conserving of the querystring in the SW as well. I had been stripping it for the APP_:CACHE, but this also stripped them for ZIM assets, which is incorrect, given that a querystring on the end of a JS file can be significant. They are now being correctly conserved (there is a test case in StackExchange ZIMs that use MathJax, where a MathJax JS file requires a significant querystring and so it could not be found in the cache before).

@Jaifroid
Copy link
Member Author

Wouldn't be enough to do the following?

urlObject = new URL("http://localhost/" + url);
return urlObject.pathname;

(untested)

Just to explain why this doesn't work in jQuery mode (or with a dummy URL): because we lose the path information if we add the relative URL on to this dummy. It gets stripped back to a bare ZIM URL in the best scenario, and sometimes loses the namespace in the worst scenario.

We in fact take advantage of the relative URL calculations to derive the ZIM URL once we've clicked on a link (but the dummy is added to the base URL, not to the asset URL). However, it is onerous to do it for every link in an article.

In SW, we have event.request.url, which is a fully qualified string (not an object), so we don't need any trickery to get an accurate URL object from this using the URL method.

Copy link
Contributor

@mossroy mossroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!
And thanks for the unit test, too.

I did not test, but the code looks good

www/js/lib/uiUtil.js Outdated Show resolved Hide resolved
@mossroy
Copy link
Contributor

mossroy commented Jan 26, 2022

This looks good.
After merging this, maybe we could start the 3.3.0 release?

@Jaifroid
Copy link
Member Author

I just want to do some careful manual testing in both modes and and different browsers, and some inspecting of the caches produced in SW mode to be sure it's all working as intended, as the code dependency is a bit tricky, as we've seen from the number of errors we've had to correct over the years coming from non-obvious gotchas in URLs.. It shouldn't take too long. Then, yes, I agree we can continue with the checklist at #802. I'll also need to add some changelog entries for what this PR does. It fixes two or three things to do with URL parameters, including scrolling to URLs with anchors in jQuery mode.

service-worker.js Outdated Show resolved Hide resolved
@Jaifroid
Copy link
Member Author

Jaifroid commented Jan 28, 2022

My testing has detected an issue with some fonts that are being requested in jQuery mode that are not being detected as ZIM assets and so are being trapped as if they were requests for the app's own files. I need to sort this out before merging this. It's not hugely serious but is causing a lot of notified failures in console.debug.

UPDATE: This is something of an edge case. It concerns the PWA (not within a Firefox extension). If a user had been running it in SW mode, but has switched to JQuery mode, then the Service Worker continues to be available to serve the app's own code in case the app is offline (this is basic PWA functionality). If, additionally, the ZIM contains incorrectly referenced assets, or assets that we don't correctly intercept in JQuery moe (e.g. assets referenced inside CSS stylesheets, which we currently don't process), then the Service Worker will see these as Fetch requests, and because they are not prefixed with the ZIM name, it will pass them through and attempt to fetch them from the cache or the Network. The responses, of course, fail.

I've put a block on these failed responses being cached, by testing Response.ok before Cache.put(). However, I need to test whether it might be better to cache the failed response, to prevent an attempted Network request each time for these failed assets. Although this is an edge case, I think it's important to get it right as it future-proofs the app IMHO.

@Jaifroid
Copy link
Member Author

It is indeed better in terms of UX to cache a 404 response, as it prevents a slow Network request (slow to fail) when that asset is requested again. The symptom, when the app is offline, is that update of the UI is delayed and slow while the browser is waiting for Network. If we have cached the 404 response, then we retrieve the 404 much faster when the app is offline.

To be clear, this issue only affects the PWA if the user switches to jQuery mode, and the Service Worker tries to retrieve an unprocessed asset from the Network (unprocessed because we do not currently process ZIM assets in CSS files, for example).

Also, this will not affect the correct extraction of these assets in Service Worker mode, because in that mode the asset URL is prefixed with the ZIM archive's name. Caching 404s from the Network will also not affect the app's own code (APP_CACHE assets) because if one of those fails to fetch when the Service Worker is registered, then the caching of the entire app will throw and report an error (this is part of the PWA spec).

So, the changes in d88b523 appear safe to me. But let me know if you have concerns. And let me know if it's OK to merge (pending final tests).

@Jaifroid
Copy link
Member Author

Apologies, @mossroy, but I'm getting cold feet about caching 404 responses to work around an edge case. I'm worried about unintended consequences, and I note that the Cache API specification bars 404 responses from being cached when the Cache.add() method is used (we use Cache.put() because it is more permissive, as we are constructing artificial Responses).

It just feels wrong and is also out of scope of this PR, so I'm going to revert d88b523.

@Jaifroid Jaifroid force-pushed the Fix-incorrect-processing-of-titles-with-question-marks-in-SW-mode branch from eb1a119 to e4503f0 Compare January 31, 2022 18:22
@Jaifroid
Copy link
Member Author

Jaifroid commented Jan 31, 2022

Interesting error exposed by the rebase of this PR. We are not dealing correctly with a colon in the domain + port, at least in jQuery mode. It might explain #790. Or could be an artefact introduced by rebase.

image

@Jaifroid
Copy link
Member Author

OK, this turned out to be nothing. It's just the way Chromium displays the failed network requests for these fonts, which are not processed by us at all.

So, I've rebased this on master and have tidied up a bit, and as far as I'm concerned, it's ready to squash and merge. Tested multiple times on Edge Chromium, Firefox and IE11, and as a Chromium extension, both for titles with question marks in and for hash anchors, which with this PR now work in both jQuery and SW modes.

@mossroy Let me know if you're happy for me to do squash/merge.

Copy link
Contributor

@mossroy mossroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can merge

@Jaifroid Jaifroid merged commit b143e83 into master Jan 31, 2022
@Jaifroid Jaifroid deleted the Fix-incorrect-processing-of-titles-with-question-marks-in-SW-mode branch January 31, 2022 21:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Title with question mark not handled correctly in Service Worker mode
2 participants