-
-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kiwix-serve and url encoding #775
Comments
As long as libkiwix operates as a webserver would, I think scrapers would be marginally impacted. We have two kind of scrapers:
|
This issue has been automatically marked as stale because it has not had recent activity. It will be now be reviewed manually. Thank you for your contributions. |
I strongly believe the time has come to fix once for ever these kind of encoding problem. @veloman-yunkan has alreqdy started, so I would suggest he tacles the problem as a whole. Automated testing is here the key. |
@veloman-yunkan Any chance tomget the final PR to close this ticket based on main...robust_uri_encoding ? |
@kelson42 I think that recent PRs #866, #870, #890 have reasonably improved the situation. As noted in the description of #870 the solution proposed in this ticket looked overdesigned and I can't justify spending further effort on it. We better close this ticket now. We can reopen it later if new use-cases for URL encoding are discovered for which the current implementation falls short. |
On the PR #764, @veloman-yunkan raised a important question #764 (comment)
As the issue title suggests, it is about url encoding
In libkiwix, we have a function to urlEncode a string : https://github.com/kiwix/libkiwix/blob/master/src/tools/stringTools.cpp#L146-L230
This function has a boolean parameter
encodeReserved
and the function does:-_.!~*\()
as it isencodeReserved
is false : keep char in;,/?:@&=+$
as it is (else encode it)However, it doesn't correspond to the standard specified in rfc3986 see section 2.2.
Reserved chars can be (at best) but in two sets:
:/?#[]@
!$&'()*+,;=
As the name of the sets suggests, they are use to delimitate components in the url. Readers (url parsers) must separate components first using the delimiters and then url decode each components.
On the opposite side, writer must url encode the components and then composed them using delimiters. This make the delimiter chars sometime encode, sometime not.
From what I understand from the spec, we should always build url this way:
urlEncode
encoding the reserved chars. Of course, this is the simple version. A complete version should handle different scheme, authentication (user@host
), anchor (#title
), ....However, everything is not so easy.
From zim spec, the url of the articles (path of the item) must NOT be url encoded. So they can contains things like
/
,?
/&
and#
.And this become funny:
/
is "normal"./
must be interpreted by the server as a component separator. libmicrohttp allow use to access the whole path (including/
) and so we can find entry with a/
. We must not url encode/
as it is interpreted by the browser to build absolute path from relative path (css style, images,...). By encoding/
(A%2FFoo
instead ofA/Foo
) we can get the entry from kiwix-serve but the styling is broken (relative linkimg.png
will beimg.png
instead ofA/img.png
)?
/&
are interpreted by libmicrohttpd as a querystring. We can access the query key/value, but we cannot get the exact string. So we must urlencode?
/&
to prevent libmicrohttp to interpret them.#
is interpreted by the browser for the anchor. The anchor is not passed to the server. So we must urlencode it to have it in the url.So it seems we need several encoding function, depending of the context:
encodeComponent()
which encode "everything" (everything but "normal" char).encodeEntryPath()
which encode everything but/
encodeUrlSafe()
which encode only chars that can be misinterpreted by html parser ("
) but not change the parsing of the url itself ?buildUrl(string host, vector<string> path, map<string, string> query, ...)
) ?It seems that different kind of component may contains specific separator not encoded. However, it seems it is not a problem to encode them. So for simplicity we can url encode "everything", no need for specific
encodeFoo
for eachFoo
.Did I miss something ?
Question about scrapper (@rgaudin @kelson42) : If we want to have article containing
?
/&
/#
in the path, the links pointing to them must be properly urlencoded on the scrapper side. I let you check that :)The text was updated successfully, but these errors were encountered: