Fix incorrect processing of titles with question marks and anchor parameters #807

Jaifroid · 2022-01-22T23:43:46Z

Fixes #806.

This initial fix needs very thorough testing.

Jaifroid · 2022-01-23T09:58:33Z

Now added processing of anchor parameters on links in JQuery mode.

Tested on many titles with question marks and quotation marks, both single and double.

Tested on links with anchor parameter (e.g. ../An_article#Heading_to_jump_to) in both SW and JQuery modes.

service-worker.js

mossroy

Congratulations for this PR.
I know these issues can be quite difficult to handle.

But I would suggest to try to use the URL object https://developer.mozilla.org/en-US/docs/Web/API/URL instead of regular expressions where/when it's possible (for sure in SW mode, maybe in jQuery mode too).
It would certainly be safer, and handle all cases.
For IE11 (that does not support this object properly), we could fallback to the regexp. And never mind if some edge cases are not properly supported for IE11 (let's not waste time on it)

Jaifroid · 2022-01-24T21:02:30Z

I've been experimenting a bit with the URL option. It's not as clever as we might like to think. We're supposed to feed it a bare URL string, and it will return a parsed and encoded URL according to RFC3986. So I tried our problematic URL, and we get:

It's reproduced the error we were trying to avoid, and it believes ?.38140.htm is a querystring (look at search property) rather than a question mark that needs to be encoded. Additionally, it has swallowed the relative part of the URL string ../. It's not possible to feed this function just the pathname without the protocol:

So we have to encode the URL, and then decode it:

NB This constructor fails with the entity references that our uiUtil.htmlEscapeChars() produced. But I've abandoned that as outdated (and this is a good example of why -- those codes need themselves to be encoded before putting them in a URL, or else the ampersand and the hash are misconstrued).

In sum, I can work with this constructor, though it is not doing anything more sophisticated than finding everything after a ? as a querystring, up to a # for an anchor, and we still have to take care not to encode accidentally any querystring or anchor.

mossroy · 2022-01-25T15:43:31Z

I'm not surprised it's expecting an encoded URL (else it would have no way to see the difference between a question mark for the querystring and a question mark in the path/filename).
But it's what we already have in SW mode, I suppose.

Wouldn't be enough to do the following?

urlObject = new URL("http://localhost/" + url);
return urlObject.pathname;

(untested)

My concern is to make sure we don't miss edge cases like the one you found with encoded characters (ex: ' ). Are there any other ones we did not think of? Using the built-in object of the browser makes sure all are covered.

But if it's becoming too complicated, never mind.

NB : I don't mind if we keep the regexp for jQuery mode (for IE11 compatibility), and use URL object for SW mode

Jaifroid · 2022-01-25T15:55:31Z

I agree, and I'm testing a solution along these lines.

Jaifroid · 2022-01-26T00:00:16Z

OK, after many frustrating tests trying to create a universal function that could be used in the same way in SW mode and in jQuery mode, I came to the conclusion that the ONLY place that the URL() method can be used viably is in the Service Worker. And the reason we can use it there is only because the Service Worker provides us with a fully qualified URL ('http://domain/path/to/article.html'), which is what is needed for the URL method to work with unless we can provide a base URL.

In jQuery mode, we are working with relative URLs and ZIM URLs, and it is unreliable and complicated to construct artificial fully qualified URLs, since we are inside an iframe which has a fixed src . I wrote a ridiculously complicated function to work around the issues in calculating a full URL, or in using a dummy + base URL. Then I realized that this would introduce a lot of overhead in long documents with hundreds if not thousands of hyperlinks, as we would have to do calculations and create URLs for every single one of them, and in IE11 it would be worse using the workaround of creating a DOM (like in deriveZIMUrlFromRelativeUrl). I was introducing complexity where the regexes are in fact extremely simple and elegant and are fully road-tested.

So I now agree with you that the most elegant solution is to use the URL function in the Service Worker, which is in fact very simple, and to use our road-tested regexes in jQuery mode. This is what I've done in the latest.

I've finessed the stripping of or conserving of the querystring in the SW as well. I had been stripping it for the APP_:CACHE, but this also stripped them for ZIM assets, which is incorrect, given that a querystring on the end of a JS file can be significant. They are now being correctly conserved (there is a test case in StackExchange ZIMs that use MathJax, where a MathJax JS file requires a significant querystring and so it could not be found in the cache before).

Jaifroid · 2022-01-26T00:12:44Z

Wouldn't be enough to do the following?
urlObject = new URL("http://localhost/" + url);
return urlObject.pathname;
(untested)

Just to explain why this doesn't work in jQuery mode (or with a dummy URL): because we lose the path information if we add the relative URL on to this dummy. It gets stripped back to a bare ZIM URL in the best scenario, and sometimes loses the namespace in the worst scenario.

We in fact take advantage of the relative URL calculations to derive the ZIM URL once we've clicked on a link (but the dummy is added to the base URL, not to the asset URL). However, it is onerous to do it for every link in an article.

In SW, we have event.request.url, which is a fully qualified string (not an object), so we don't need any trickery to get an accurate URL object from this using the URL method.

mossroy

Great!
And thanks for the unit test, too.

I did not test, but the code looks good

www/js/lib/uiUtil.js

mossroy · 2022-01-26T17:22:45Z

This looks good.
After merging this, maybe we could start the 3.3.0 release?

Jaifroid · 2022-01-26T18:18:13Z

I just want to do some careful manual testing in both modes and and different browsers, and some inspecting of the caches produced in SW mode to be sure it's all working as intended, as the code dependency is a bit tricky, as we've seen from the number of errors we've had to correct over the years coming from non-obvious gotchas in URLs.. It shouldn't take too long. Then, yes, I agree we can continue with the checklist at #802. I'll also need to add some changelog entries for what this PR does. It fixes two or three things to do with URL parameters, including scrolling to URLs with anchors in jQuery mode.

service-worker.js

Jaifroid · 2022-01-28T08:36:20Z

My testing has detected an issue with some fonts that are being requested in jQuery mode that are not being detected as ZIM assets and so are being trapped as if they were requests for the app's own files. I need to sort this out before merging this. It's not hugely serious but is causing a lot of notified failures in console.debug.

UPDATE: This is something of an edge case. It concerns the PWA (not within a Firefox extension). If a user had been running it in SW mode, but has switched to JQuery mode, then the Service Worker continues to be available to serve the app's own code in case the app is offline (this is basic PWA functionality). If, additionally, the ZIM contains incorrectly referenced assets, or assets that we don't correctly intercept in JQuery moe (e.g. assets referenced inside CSS stylesheets, which we currently don't process), then the Service Worker will see these as Fetch requests, and because they are not prefixed with the ZIM name, it will pass them through and attempt to fetch them from the cache or the Network. The responses, of course, fail.

I've put a block on these failed responses being cached, by testing Response.ok before Cache.put(). However, I need to test whether it might be better to cache the failed response, to prevent an attempted Network request each time for these failed assets. Although this is an edge case, I think it's important to get it right as it future-proofs the app IMHO.

Jaifroid · 2022-01-29T08:28:14Z

It is indeed better in terms of UX to cache a 404 response, as it prevents a slow Network request (slow to fail) when that asset is requested again. The symptom, when the app is offline, is that update of the UI is delayed and slow while the browser is waiting for Network. If we have cached the 404 response, then we retrieve the 404 much faster when the app is offline.

To be clear, this issue only affects the PWA if the user switches to jQuery mode, and the Service Worker tries to retrieve an unprocessed asset from the Network (unprocessed because we do not currently process ZIM assets in CSS files, for example).

Also, this will not affect the correct extraction of these assets in Service Worker mode, because in that mode the asset URL is prefixed with the ZIM archive's name. Caching 404s from the Network will also not affect the app's own code (APP_CACHE assets) because if one of those fails to fetch when the Service Worker is registered, then the caching of the entire app will throw and report an error (this is part of the PWA spec).

So, the changes in d88b523 appear safe to me. But let me know if you have concerns. And let me know if it's OK to merge (pending final tests).

Jaifroid · 2022-01-29T10:23:04Z

Apologies, @mossroy, but I'm getting cold feet about caching 404 responses to work around an edge case. I'm worried about unintended consequences, and I note that the Cache API specification bars 404 responses from being cached when the Cache.add() method is used (we use Cache.put() because it is more permissive, as we are constructing artificial Responses).

It just feels wrong and is also out of scope of this PR, so I'm going to revert d88b523.

Jaifroid · 2022-01-31T18:25:45Z

Interesting error exposed by the rebase of this PR. We are not dealing correctly with a colon in the domain + port, at least in jQuery mode. It might explain #790. Or could be an artefact introduced by rebase.

Jaifroid · 2022-01-31T20:38:01Z

OK, this turned out to be nothing. It's just the way Chromium displays the failed network requests for these fonts, which are not processed by us at all.

So, I've rebased this on master and have tidied up a bit, and as far as I'm concerned, it's ready to squash and merge. Tested multiple times on Edge Chromium, Firefox and IE11, and as a Chromium extension, both for titles with question marks in and for hash anchors, which with this PR now work in both jQuery and SW modes.

@mossroy Let me know if you're happy for me to do squash/merge.

mossroy

You can merge

Jaifroid added the bug label Jan 22, 2022

Jaifroid added this to the v3.3 milestone Jan 22, 2022

Jaifroid self-assigned this Jan 22, 2022

Jaifroid changed the title ~~Fix incorrect processing of titles with question marks in SW mode~~ Fix incorrect processing of titles with question marks and anchor parameters Jan 23, 2022

Jaifroid commented Jan 23, 2022

View reviewed changes

service-worker.js Outdated Show resolved Hide resolved

Jaifroid requested a review from mossroy January 23, 2022 16:45

mossroy reviewed Jan 24, 2022

View reviewed changes

mossroy approved these changes Jan 26, 2022

View reviewed changes

www/js/lib/uiUtil.js Outdated Show resolved Hide resolved

Jaifroid mentioned this pull request Jan 26, 2022

Articles containing "?" cannot be opened kiwix/kiwix-android#2671

Closed

Jaifroid commented Jan 27, 2022

View reviewed changes

service-worker.js Outdated Show resolved Hide resolved

Jaifroid mentioned this pull request Jan 29, 2022

Add downloadable fonts (from ZIM) to files processed in jQuery mode #790

Open

Jaifroid added 8 commits January 31, 2022 17:53

Initial fixes - need testing

3b8d237

Comments and cleanup

7eed74f

Deal correctly with anchor parameters in JQuery mode

6694105

Make removeUrlParameters more robust

b5d0644

Make anchor stripping more robust

73077b9

Apply more robust regex also to jQuery mode

2726c9f

Ignore javascript links

e32bf37

Align function in uiUtil with same function in Service Worker

bacb852

Jaifroid added 11 commits January 31, 2022 17:54

Use URL function in Service Worker

bece1a7

Clean up

bd550f1

Finesse querystrings in cache

ce15388

Tidying and avoid repetition

710ca07

Remove unused htmlEscapeChars

ff5889c

Rationalize code and prevent caching not found responses

fca24d5

Cache 404 responses

fb10769

Revert "Cache 404 responses"

9612e82

Fix potential invalid messaging

d1c0fa2

Fix exclusion of certain URL schemata

1e09617

Cleanup after rebase

e4503f0

Jaifroid force-pushed the Fix-incorrect-processing-of-titles-with-question-marks-in-SW-mode branch from eb1a119 to e4503f0 Compare January 31, 2022 18:22

Further cleanup after rebase

133c259

mossroy approved these changes Jan 31, 2022

View reviewed changes

Jaifroid merged commit b143e83 into master Jan 31, 2022

Jaifroid deleted the Fix-incorrect-processing-of-titles-with-question-marks-in-SW-mode branch January 31, 2022 21:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix incorrect processing of titles with question marks and anchor parameters #807

Fix incorrect processing of titles with question marks and anchor parameters #807

Jaifroid commented Jan 22, 2022

Jaifroid commented Jan 23, 2022 •

edited

Loading

mossroy left a comment

Jaifroid commented Jan 24, 2022

mossroy commented Jan 25, 2022

Jaifroid commented Jan 25, 2022

Jaifroid commented Jan 26, 2022

Jaifroid commented Jan 26, 2022

mossroy left a comment

mossroy commented Jan 26, 2022

Jaifroid commented Jan 26, 2022

Jaifroid commented Jan 28, 2022 •

edited

Loading

Jaifroid commented Jan 29, 2022

Jaifroid commented Jan 29, 2022

Jaifroid commented Jan 31, 2022 •

edited

Loading

Jaifroid commented Jan 31, 2022

mossroy left a comment

Fix incorrect processing of titles with question marks and anchor parameters #807

Fix incorrect processing of titles with question marks and anchor parameters #807

Conversation

Jaifroid commented Jan 22, 2022

Jaifroid commented Jan 23, 2022 • edited Loading

mossroy left a comment

Choose a reason for hiding this comment

Jaifroid commented Jan 24, 2022

mossroy commented Jan 25, 2022

Jaifroid commented Jan 25, 2022

Jaifroid commented Jan 26, 2022

Jaifroid commented Jan 26, 2022

mossroy left a comment

Choose a reason for hiding this comment

mossroy commented Jan 26, 2022

Jaifroid commented Jan 26, 2022

Jaifroid commented Jan 28, 2022 • edited Loading

Jaifroid commented Jan 29, 2022

Jaifroid commented Jan 29, 2022

Jaifroid commented Jan 31, 2022 • edited Loading

Jaifroid commented Jan 31, 2022

mossroy left a comment

Choose a reason for hiding this comment

Jaifroid commented Jan 23, 2022 •

edited

Loading

Jaifroid commented Jan 28, 2022 •

edited

Loading

Jaifroid commented Jan 31, 2022 •

edited

Loading