Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kiwix-serve and url encoding #775

Closed
mgautierfr opened this issue May 17, 2022 · 5 comments
Closed

kiwix-serve and url encoding #775

mgautierfr opened this issue May 17, 2022 · 5 comments

Comments

@mgautierfr
Copy link
Member

mgautierfr commented May 17, 2022

On the PR #764, @veloman-yunkan raised a important question #764 (comment)
As the issue title suggests, it is about url encoding

In libkiwix, we have a function to urlEncode a string : https://github.com/kiwix/libkiwix/blob/master/src/tools/stringTools.cpp#L146-L230
This function has a boolean parameter encodeReserved and the function does:

  • keep alphanumerical char as it is
  • keep char in -_.!~*\() as it is
  • if encodeReserved is false : keep char in ;,/?:@&=+$ as it is (else encode it)
  • encode every other char

However, it doesn't correspond to the standard specified in rfc3986 see section 2.2.
Reserved chars can be (at best) but in two sets:

  • gen-delims = :/?#[]@
  • sub-delims = !$&'()*+,;=

As the name of the sets suggests, they are use to delimitate components in the url. Readers (url parsers) must separate components first using the delimiters and then url decode each components.
On the opposite side, writer must url encode the components and then composed them using delimiters. This make the delimiter chars sometime encode, sometime not.

From what I understand from the spec, we should always build url this way:

std::string url = scheme + "://" + urlEncode(host);
for (auto& part: path_parts) {
 url += "/" + urlEncode(part);
}
char sep = '?';
for (auto& query_part: query) {
  url += sep + urlEncode(query_part.key) + "=" + urlEncode(query_part.value);
  sep = '&';
}

urlEncode encoding the reserved chars. Of course, this is the simple version. A complete version should handle different scheme, authentication (user@host), anchor (#title), ....

However, everything is not so easy.
From zim spec, the url of the articles (path of the item) must NOT be url encoded. So they can contains things like /, ?/& and #.
And this become funny:

  • / is "normal". / must be interpreted by the server as a component separator. libmicrohttp allow use to access the whole path (including /) and so we can find entry with a /. We must not url encode / as it is interpreted by the browser to build absolute path from relative path (css style, images,...). By encoding / (A%2FFoo instead of A/Foo) we can get the entry from kiwix-serve but the styling is broken (relative link img.png will be img.png instead of A/img.png)
  • ?/& are interpreted by libmicrohttpd as a querystring. We can access the query key/value, but we cannot get the exact string. So we must urlencode ?/& to prevent libmicrohttp to interpret them.
  • # is interpreted by the browser for the anchor. The anchor is not passed to the server. So we must urlencode it to have it in the url.

So it seems we need several encoding function, depending of the context:

  • encodeComponent() which encode "everything" (everything but "normal" char).
  • encodeEntryPath() which encode everything but /
  • A encodeUrlSafe() which encode only chars that can be misinterpreted by html parser (") but not change the parsing of the url itself ?
  • helper method to build url from different parts (buildUrl(string host, vector<string> path, map<string, string> query, ...)) ?

It seems that different kind of component may contains specific separator not encoded. However, it seems it is not a problem to encode them. So for simplicity we can url encode "everything", no need for specific encodeFoo for each Foo.

Did I miss something ?

Question about scrapper (@rgaudin @kelson42) : If we want to have article containing ?/&/# in the path, the links pointing to them must be properly urlencoded on the scrapper side. I let you check that :)

@rgaudin
Copy link
Member

rgaudin commented May 17, 2022

As long as libkiwix operates as a webserver would, I think scrapers would be marginally impacted. We have two kind of scrapers:

  • built-from-scratch ones. Those usually don't make use of query parameters and use plain (UTF8) links and paths. Use of fragment is frequent in links but not in paths.
  • generic-ones. Those tend to reuse links from original websites so query strings and fragment are to be expected. Paths are transformed so none of those characters should be present. Paths are decoded (urllib.parse.unquote) UTF-8 strings . We'd generally use urllib.parse.quote for links which is RFC3986 compliant.

@stale
Copy link

stale bot commented Aug 13, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions.

@kelson42
Copy link
Collaborator

I strongly believe the time has come to fix once for ever these kind of encoding problem. @veloman-yunkan has alreqdy started, so I would suggest he tacles the problem as a whole. Automated testing is here the key.

@kelson42
Copy link
Collaborator

kelson42 commented Feb 2, 2023

@veloman-yunkan Any chance tomget the final PR to close this ticket based on main...robust_uri_encoding ?

@veloman-yunkan
Copy link
Collaborator

@kelson42 I think that recent PRs #866, #870, #890 have reasonably improved the situation. As noted in the description of #870 the solution proposed in this ticket looked overdesigned and I can't justify spending further effort on it. We better close this ticket now. We can reopen it later if new use-cases for URL encoding are discovered for which the current implementation falls short.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants