-
-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dynamic URL rewriting: rewrite only when ZIM path exists #276
Comments
There is one interesting business case on mesquartierschinois where the facebook/twitter/... share buttons are dynamically generated by JS code, and hence dynamically rewritten. Currently the link is broken since these resources are obviously not present inside the ZIM. Rewriting these links properly would help to not have broken links. |
This is again the "boundary problem" (see #65 (comment)). The wider issue should be tackled upstream IMHO, but I remember @mgautierfr added a bit of code in libkiwix (see kiwix/libkiwix#1036 (review)) for turning off Wombat rewriting temporarily for the purposes of inspecting the non-rewritten link. That code could be revisited to make this a bit more intuitive for end-user? What I do in KJS is to check a link clicked by a user, and if it's not in the ZIM, assume it's an external link (or a missing resource, which from an end-user perspective is the same thing). If not in the ZIM, and the external link can easily be reconstructed (usually the case), then I throw up the "external link" dialogue box, which I think is probably a bit clearer for the end user (see screenshot below). If user clicks on that, they get taken to the original WordPress site at least (though in this case, it seems I need to decode the querystring!!! -- forgot to do that for zimit2!). |
I'm sorry but I don't agree, this issue needs to be solved in zimit, not in readers. The link is incorrectly rewritten (it is rewritten while it should not since it is not present inside the ZIM), period. This is particularly visible in kiwix-serve where the link is prefixed with kiwix-serve domain, port, ZIM name, ... We cannot assume every reader will fix incoherence produced by scrapers (it already showed to fail in the past with the SW approach). And I strongly discourage to decode the over-encoded querystring, this is only going to lead to other issues when it is not over-encoded but kiwix-js believe it is. |
@benoit74, I don't think we're in radical disagreement here! If there is a good solution to the boundary problem that can be found at the level of warc2zim, then that would be perfect. Either way would be good for readers like Kiwix Serve and KJS. They still have to check whether a link is an external link or not, because we must block opening external links inside the iframe for the very reasons you state... and more. For both Kiwix Serve and KJS, it would end up violating the sandbox. For KJS, it's even worse, as it leads to fatal CORS issues and the deactivation of the app's UI from which the app can't recover. So, I only do this because I'm forced to (and not just for zimit, but it's harder to determine in zimit)...🤔 Pre-requisites for a zimit2 solution here would be:
I'm very happy to wait for whatever solution is adopted here before "fixing" that in KJS. |
@benoit74 As we've gone ahead and released Zimit2 ZIMs with overencoded querystrings on external links, I think I now have to handle this situation. We are preparing an upcoming release of the Browser Extension, so I'll (reluctantly) need to work around these querystrings in Zimit2 ZIMs only (see kiwix/kiwix-js#1258). I know you disagree with handling these things in readers (and I don't dissent from that). So, what I propose is that I mark the workaround code as a temporary workaround for periodic review in case a solution is found here, and link the code back to this issue. I hope that sounds reasonable. |
In dynamic URL rewriting (in JS with wombat), all URLs are rewritten.
This is a fair assumption because in most cases the associated resources have also been automatically fetched at crawled time.
It also makes the things way simpler since we do not need to pass the list of existing ZIM entries to JS.
Question is: do we want to rewrite only existing URLs?
While this would allow less 404 client-side, it also comes with the big drawback that the browser will suddenly begin to fetch online resources because the URL has not been rewritten because ZIM entry was missing. The user will not even know/realize that, and might incur data costs (for instance). This will also make testing a warc2zim ZIM even harder, because it might looks like the ZIM is working, but indeed some resources have been fetched online and will hence not be available to all our users.
The text was updated successfully, but these errors were encountered: