Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic URL rewriting: rewrite only when ZIM path exists #276

Open
benoit74 opened this issue May 24, 2024 · 5 comments
Open

Dynamic URL rewriting: rewrite only when ZIM path exists #276

benoit74 opened this issue May 24, 2024 · 5 comments
Labels
enhancement New feature or request question Further information is requested
Milestone

Comments

@benoit74
Copy link
Collaborator

In dynamic URL rewriting (in JS with wombat), all URLs are rewritten.

This is a fair assumption because in most cases the associated resources have also been automatically fetched at crawled time.

It also makes the things way simpler since we do not need to pass the list of existing ZIM entries to JS.

Question is: do we want to rewrite only existing URLs?

While this would allow less 404 client-side, it also comes with the big drawback that the browser will suddenly begin to fetch online resources because the URL has not been rewritten because ZIM entry was missing. The user will not even know/realize that, and might incur data costs (for instance). This will also make testing a warc2zim ZIM even harder, because it might looks like the ZIM is working, but indeed some resources have been fetched online and will hence not be available to all our users.

@benoit74 benoit74 added enhancement New feature or request question Further information is requested labels May 24, 2024
@benoit74 benoit74 added this to the 2.1.0 milestone May 24, 2024
@benoit74
Copy link
Collaborator Author

There is one interesting business case on mesquartierschinois where the facebook/twitter/... share buttons are dynamically generated by JS code, and hence dynamically rewritten.

image

Currently the link is broken since these resources are obviously not present inside the ZIM. Rewriting these links properly would help to not have broken links.

@Jaifroid
Copy link

This is again the "boundary problem" (see #65 (comment)). The wider issue should be tackled upstream IMHO, but I remember @mgautierfr added a bit of code in libkiwix (see kiwix/libkiwix#1036 (review)) for turning off Wombat rewriting temporarily for the purposes of inspecting the non-rewritten link. That code could be revisited to make this a bit more intuitive for end-user?

What I do in KJS is to check a link clicked by a user, and if it's not in the ZIM, assume it's an external link (or a missing resource, which from an end-user perspective is the same thing). If not in the ZIM, and the external link can easily be reconstructed (usually the case), then I throw up the "external link" dialogue box, which I think is probably a bit clearer for the end user (see screenshot below). If user clicks on that, they get taken to the original WordPress site at least (though in this case, it seems I need to decode the querystring!!! -- forgot to do that for zimit2!).

image

@benoit74
Copy link
Collaborator Author

I'm sorry but I don't agree, this issue needs to be solved in zimit, not in readers.

The link is incorrectly rewritten (it is rewritten while it should not since it is not present inside the ZIM), period. This is particularly visible in kiwix-serve where the link is prefixed with kiwix-serve domain, port, ZIM name, ...

We cannot assume every reader will fix incoherence produced by scrapers (it already showed to fail in the past with the SW approach).

And I strongly discourage to decode the over-encoded querystring, this is only going to lead to other issues when it is not over-encoded but kiwix-js believe it is.

@Jaifroid
Copy link

Jaifroid commented May 31, 2024

@benoit74, I don't think we're in radical disagreement here! If there is a good solution to the boundary problem that can be found at the level of warc2zim, then that would be perfect. Either way would be good for readers like Kiwix Serve and KJS. They still have to check whether a link is an external link or not, because we must block opening external links inside the iframe for the very reasons you state... and more. For both Kiwix Serve and KJS, it would end up violating the sandbox. For KJS, it's even worse, as it leads to fatal CORS issues and the deactivation of the app's UI from which the app can't recover. So, I only do this because I'm forced to (and not just for zimit, but it's harder to determine in zimit)...🤔

Pre-requisites for a zimit2 solution here would be:

  1. Find a way to prevent Wombat from rewriting links that are not in the ZIM
  2. Do not URI-encode the querystring for external links (do not rewrite them at all, so they will work as-is).

I'm very happy to wait for whatever solution is adopted here before "fixing" that in KJS.

@Jaifroid
Copy link

@benoit74 As we've gone ahead and released Zimit2 ZIMs with overencoded querystrings on external links, I think I now have to handle this situation. We are preparing an upcoming release of the Browser Extension, so I'll (reluctantly) need to work around these querystrings in Zimit2 ZIMs only (see kiwix/kiwix-js#1258). I know you disagree with handling these things in readers (and I don't dissent from that). So, what I propose is that I mark the workaround code as a temporary workaround for periodic review in case a solution is found here, and link the code back to this issue. I hope that sounds reasonable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants