kiwix-serve and url encoding #775

mgautierfr · 2022-05-17T14:48:34Z

On the PR #764, @veloman-yunkan raised a important question #764 (comment)
As the issue title suggests, it is about url encoding

In libkiwix, we have a function to urlEncode a string : https://github.com/kiwix/libkiwix/blob/master/src/tools/stringTools.cpp#L146-L230
This function has a boolean parameter encodeReserved and the function does:

keep alphanumerical char as it is
keep char in -_.!~*\() as it is
if encodeReserved is false : keep char in ;,/?:@&=+$ as it is (else encode it)
encode every other char

However, it doesn't correspond to the standard specified in rfc3986 see section 2.2.
Reserved chars can be (at best) but in two sets:

gen-delims = :/?#[]@
sub-delims = !$&'()*+,;=

As the name of the sets suggests, they are use to delimitate components in the url. Readers (url parsers) must separate components first using the delimiters and then url decode each components.
On the opposite side, writer must url encode the components and then composed them using delimiters. This make the delimiter chars sometime encode, sometime not.

From what I understand from the spec, we should always build url this way:

std::string url = scheme + "://" + urlEncode(host);
for (auto& part: path_parts) {
 url += "/" + urlEncode(part);
}
char sep = '?';
for (auto& query_part: query) {
  url += sep + urlEncode(query_part.key) + "=" + urlEncode(query_part.value);
  sep = '&';
}

urlEncode encoding the reserved chars. Of course, this is the simple version. A complete version should handle different scheme, authentication (user@host), anchor (#title), ....

However, everything is not so easy.
From zim spec, the url of the articles (path of the item) must NOT be url encoded. So they can contains things like /, ?/& and #.
And this become funny:

/ is "normal". / must be interpreted by the server as a component separator. libmicrohttp allow use to access the whole path (including /) and so we can find entry with a /. We must not url encode / as it is interpreted by the browser to build absolute path from relative path (css style, images,...). By encoding / (A%2FFoo instead of A/Foo) we can get the entry from kiwix-serve but the styling is broken (relative link img.png will be img.png instead of A/img.png)
?/& are interpreted by libmicrohttpd as a querystring. We can access the query key/value, but we cannot get the exact string. So we must urlencode ?/& to prevent libmicrohttp to interpret them.
# is interpreted by the browser for the anchor. The anchor is not passed to the server. So we must urlencode it to have it in the url.

So it seems we need several encoding function, depending of the context:

encodeComponent() which encode "everything" (everything but "normal" char).
encodeEntryPath() which encode everything but /
A encodeUrlSafe() which encode only chars that can be misinterpreted by html parser (") but not change the parsing of the url itself ?
helper method to build url from different parts (buildUrl(string host, vector<string> path, map<string, string> query, ...)) ?

It seems that different kind of component may contains specific separator not encoded. However, it seems it is not a problem to encode them. So for simplicity we can url encode "everything", no need for specific encodeFoo for each Foo.

Did I miss something ?

Question about scrapper (@rgaudin @kelson42) : If we want to have article containing ?/&/# in the path, the links pointing to them must be properly urlencoded on the scrapper side. I let you check that :)

The text was updated successfully, but these errors were encountered:

rgaudin · 2022-05-17T15:38:08Z

As long as libkiwix operates as a webserver would, I think scrapers would be marginally impacted. We have two kind of scrapers:

built-from-scratch ones. Those usually don't make use of query parameters and use plain (UTF8) links and paths. Use of fragment is frequent in links but not in paths.
generic-ones. Those tend to reuse links from original websites so query strings and fragment are to be expected. Paths are transformed so none of those characters should be present. Paths are decoded (urllib.parse.unquote) UTF-8 strings . We'd generally use urllib.parse.quote for links which is RFC3986 compliant.

stale · 2022-08-13T11:04:34Z

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

kelson42 · 2022-12-26T14:55:41Z

I strongly believe the time has come to fix once for ever these kind of encoding problem. @veloman-yunkan has alreqdy started, so I would suggest he tacles the problem as a whole. Automated testing is here the key.

kelson42 · 2023-02-02T20:58:50Z

@veloman-yunkan Any chance tomget the final PR to close this ticket based on main...robust_uri_encoding ?

veloman-yunkan · 2023-02-10T17:47:51Z

@kelson42 I think that recent PRs #866, #870, #890 have reasonably improved the situation. As noted in the description of #870 the solution proposed in this ticket looked overdesigned and I can't justify spending further effort on it. We better close this ticket now. We can reopen it later if new use-cases for URL encoding are discovered for which the current implementation falls short.

mgautierfr assigned rgaudin, kelson42 and veloman-yunkan May 17, 2022

mgautierfr added enhancement help wanted question kiwix-serve labels May 17, 2022

mgautierfr mentioned this issue May 17, 2022

Preparatory work on multizim #764

Merged

stale bot added the stale label Aug 13, 2022

veloman-yunkan mentioned this issue Oct 23, 2022

Kiwix-serve random feature returns partly broken URLs #441

Closed

kelson42 added this to the 12.2.0 milestone Oct 29, 2022

stale bot removed the stale label Oct 29, 2022

veloman-yunkan mentioned this issue Dec 26, 2022

Question mark in article title caused broser to report "The page isn’t redirecting properly" kiwix/kiwix-tools#589

Closed

kelson42 assigned veloman-yunkan and unassigned rgaudin, kelson42 and veloman-yunkan Dec 26, 2022

mgautierfr mentioned this issue Jan 3, 2023

Public vs private kiwix-serve's endpoints kiwix/kiwix-tools#593

Closed

veloman-yunkan mentioned this issue Jan 7, 2023

URI encoding of redirects #866

Merged

kelson42 modified the milestones: 12.2.0, 12.1.0 Jan 7, 2023

adamlamar mentioned this issue Jan 11, 2023

Fix git clone on Windows #868

Merged

This was referenced Jan 17, 2023

Regression with URL encoding: Zimit ZIM files are not readable anymore kiwix/kiwix-tools#594

Closed

Quick-fix of issues caused by URL encoding #870

Merged

kelson42 closed this as completed Feb 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kiwix-serve and url encoding #775

kiwix-serve and url encoding #775

mgautierfr commented May 17, 2022 •

edited

Loading

rgaudin commented May 17, 2022

stale bot commented Aug 13, 2022

kelson42 commented Dec 26, 2022

kelson42 commented Feb 2, 2023

veloman-yunkan commented Feb 10, 2023

kiwix-serve and url encoding #775

kiwix-serve and url encoding #775

Comments

mgautierfr commented May 17, 2022 • edited Loading

rgaudin commented May 17, 2022

stale bot commented Aug 13, 2022

kelson42 commented Dec 26, 2022

kelson42 commented Feb 2, 2023

veloman-yunkan commented Feb 10, 2023

mgautierfr commented May 17, 2022 •

edited

Loading