Dynamic URL rewriting: rewrite only when ZIM path exists #276

benoit74 · 2024-05-24T07:12:33Z

In dynamic URL rewriting (in JS with wombat), all URLs are rewritten.

This is a fair assumption because in most cases the associated resources have also been automatically fetched at crawled time.

It also makes the things way simpler since we do not need to pass the list of existing ZIM entries to JS.

Question is: do we want to rewrite only existing URLs?

While this would allow less 404 client-side, it also comes with the big drawback that the browser will suddenly begin to fetch online resources because the URL has not been rewritten because ZIM entry was missing. The user will not even know/realize that, and might incur data costs (for instance). This will also make testing a warc2zim ZIM even harder, because it might looks like the ZIM is working, but indeed some resources have been fetched online and will hence not be available to all our users.

benoit74 · 2024-05-31T12:20:19Z

There is one interesting business case on mesquartierschinois where the facebook/twitter/... share buttons are dynamically generated by JS code, and hence dynamically rewritten.

Currently the link is broken since these resources are obviously not present inside the ZIM. Rewriting these links properly would help to not have broken links.

Jaifroid · 2024-05-31T13:34:08Z

This is again the "boundary problem" (see #65 (comment)). The wider issue should be tackled upstream IMHO, but I remember @mgautierfr added a bit of code in libkiwix (see kiwix/libkiwix#1036 (review)) for turning off Wombat rewriting temporarily for the purposes of inspecting the non-rewritten link. That code could be revisited to make this a bit more intuitive for end-user?

What I do in KJS is to check a link clicked by a user, and if it's not in the ZIM, assume it's an external link (or a missing resource, which from an end-user perspective is the same thing). If not in the ZIM, and the external link can easily be reconstructed (usually the case), then I throw up the "external link" dialogue box, which I think is probably a bit clearer for the end user (see screenshot below). If user clicks on that, they get taken to the original WordPress site at least (though in this case, it seems I need to decode the querystring!!! -- forgot to do that for zimit2!).

benoit74 · 2024-05-31T14:28:17Z

I'm sorry but I don't agree, this issue needs to be solved in zimit, not in readers.

The link is incorrectly rewritten (it is rewritten while it should not since it is not present inside the ZIM), period. This is particularly visible in kiwix-serve where the link is prefixed with kiwix-serve domain, port, ZIM name, ...

We cannot assume every reader will fix incoherence produced by scrapers (it already showed to fail in the past with the SW approach).

And I strongly discourage to decode the over-encoded querystring, this is only going to lead to other issues when it is not over-encoded but kiwix-js believe it is.

Jaifroid · 2024-05-31T16:11:56Z

@benoit74, I don't think we're in radical disagreement here! If there is a good solution to the boundary problem that can be found at the level of warc2zim, then that would be perfect. Either way would be good for readers like Kiwix Serve and KJS. They still have to check whether a link is an external link or not, because we must block opening external links inside the iframe for the very reasons you state... and more. For both Kiwix Serve and KJS, it would end up violating the sandbox. For KJS, it's even worse, as it leads to fatal CORS issues and the deactivation of the app's UI from which the app can't recover. So, I only do this because I'm forced to (and not just for zimit, but it's harder to determine in zimit)...🤔

Pre-requisites for a zimit2 solution here would be:

Find a way to prevent Wombat from rewriting links that are not in the ZIM
Do not URI-encode the querystring for external links (do not rewrite them at all, so they will work as-is).

I'm very happy to wait for whatever solution is adopted here before "fixing" that in KJS.

Jaifroid · 2024-07-10T07:57:13Z

@benoit74 As we've gone ahead and released Zimit2 ZIMs with overencoded querystrings on external links, I think I now have to handle this situation. We are preparing an upcoming release of the Browser Extension, so I'll (reluctantly) need to work around these querystrings in Zimit2 ZIMs only (see kiwix/kiwix-js#1258). I know you disagree with handling these things in readers (and I don't dissent from that). So, what I propose is that I mark the workaround code as a temporary workaround for periodic review in case a solution is found here, and link the code back to this issue. I hope that sounds reasonable.

benoit74 added enhancement New feature or request question Further information is requested labels May 24, 2024

benoit74 added this to the 2.1.0 milestone May 24, 2024

benoit74 mentioned this issue May 24, 2024

Add support for onxxx HTML events #270

Merged

benoit74 mentioned this issue May 31, 2024

External links seems not (always) kept in there original form #163

Closed

Jaifroid mentioned this issue May 31, 2024

When displaying external link dialogue box from zimit2 archives, we need to decode the over-encoded querystring kiwix/kiwix-js#1258

Closed

Jaifroid mentioned this issue Jun 2, 2024

mailto: links are blocked on Chrome kiwix/libkiwix#1138

Closed

benoit74 modified the milestones: 2.1.0, 2.2.0, later Jun 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dynamic URL rewriting: rewrite only when ZIM path exists #276

Dynamic URL rewriting: rewrite only when ZIM path exists #276

benoit74 commented May 24, 2024

benoit74 commented May 31, 2024

Jaifroid commented May 31, 2024

benoit74 commented May 31, 2024

Jaifroid commented May 31, 2024 •

edited

Loading

Jaifroid commented Jul 10, 2024

Dynamic URL rewriting: rewrite only when ZIM path exists #276

Dynamic URL rewriting: rewrite only when ZIM path exists #276

Comments

benoit74 commented May 24, 2024

benoit74 commented May 31, 2024

Jaifroid commented May 31, 2024

benoit74 commented May 31, 2024

Jaifroid commented May 31, 2024 • edited Loading

Jaifroid commented Jul 10, 2024

Jaifroid commented May 31, 2024 •

edited

Loading