-
Notifications
You must be signed in to change notification settings - Fork 334
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
I think host parsing is incorrect, might not follow spec, use ASCII / IDNA only #554
Comments
I haven’t looked the
It really depends how exactly you define “incorrect”.
The WHATWG URL specification has, separately:
https://url.spec.whatwg.org/#urls also states:
However I don’t know if this is accurate. That is, is there no Unicode string that matches the grammar but for which the algorithm emits a validation error or returns failure? Or, is there no Unicode string that doesn’t match the grammar but for which the algorithm returns a URL record and does not emit any validation error? This may be worth filing a spec issue. Even if this statement is accurate right now, it is rather fragile in the face of future spec changes. Also note that the parsing algorithm takes a base URL as an optional additional parameter, but the “valid URL string” concept is defined as independent of the presence or contents of that base URL.
Limited in order to fit what definition?
|
I'm sorry, but I did not (and don't) have any time to further look into this. Thank you for your extensive answer though. I'll close this for now, but might reopen it in the future if I've something more concrete. |
A few days ago I discovered that some string are parsed successfully as
Url
, even though I'm quite certain those are invalid, such as"http://\""
. (see seanmonstar/reqwest#668 via #552)The problem
I located the problem to be with hostname parsing. Any hostname characters are accepted at this moment (except for a small blacklist). I believe hostnames must be limited to ASCII or IDNA encodable characters, so that'd be an error.
To fix this, I changed the following line to use
domain_to_ascii_strict
instead.This strict variant fails on any character that is not encodable as ASCII/IDNA:
rust-url/src/host.rs
Line 83 in 7d2c9d6
Changing this does solve the issue I've linked before: seanmonstar/reqwest#668
The catch
But there's a catch. This change makes a few included unit tests fail. I'm
wondering whether these unit tests are conform the URL specification that is explicitly linked in the README of this crate. If these are not, I'd argue they should be fixed or removed.
Failing tests
First, the 'Domains with empty labels' fails from the test data set:
rust-url/tests/urltestdata.json
Lines 3802 to 3847 in 7d2c9d6
This is probably related to the following:
Second, the leading dots (host) unit test fails:
rust-url/tests/unit.rs
Lines 421 to 429 in 7d2c9d6
Issue Panic when parsing a
.
in file URLs #166 and [idna] Preserve leading dots in host #337 are related.Third, the host unit test fails:
rust-url/tests/unit.rs
Lines 203 to 237 in 7d2c9d6
Only line 219 and 234 fail. Has to do with dots again.
I don't know whether these tests are conform the URL specification. Sadly, I don't have enough time right now to study this.
Questions
Now I might be totally wrong on this, that's why I'm asking the following to
url
maintainers. Hopefully you can help me out."http://\""
incorrect?Url::parse(...)
fail, for input not conform the URL specification?Based on the questions above, what would be the proper steps to take, if there are any?
Maybe a special case should be implemented for handling
.
and..
domains if a base is known (this would partially fix unit tests)?I've prepared a branch/PR with this change. It doesn't contain much changes because the questions need answering first. It currently obviously fails due to these unit tests. Here it is: #555
The text was updated successfully, but these errors were encountered: