Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I think host parsing is incorrect, might not follow spec, use ASCII / IDNA only #554

Closed
timvisee opened this issue Oct 15, 2019 · 2 comments

Comments

@timvisee
Copy link

timvisee commented Oct 15, 2019

A few days ago I discovered that some string are parsed successfully as Url, even though I'm quite certain those are invalid, such as "http://\"". (see seanmonstar/reqwest#668 via #552)

The problem

I located the problem to be with hostname parsing. Any hostname characters are accepted at this moment (except for a small blacklist). I believe hostnames must be limited to ASCII or IDNA encodable characters, so that'd be an error.

To fix this, I changed the following line to use domain_to_ascii_strict instead.
This strict variant fails on any character that is not encodable as ASCII/IDNA:

let domain = idna::domain_to_ascii(&domain)?;

Changing this does solve the issue I've linked before: seanmonstar/reqwest#668

The catch

But there's a catch. This change makes a few included unit tests fail. I'm
wondering whether these unit tests are conform the URL specification that is explicitly linked in the README of this crate. If these are not, I'd argue they should be fixed or removed.

Failing tests

  • First, the 'Domains with empty labels' fails from the test data set:

    "Domains with empty labels",
    {
    "input": "http://./",
    "base": "about:blank",
    "href": "http://./",
    "origin": "http://.",
    "protocol": "http:",
    "username": "",
    "password": "",
    "host": ".",
    "hostname": ".",
    "port": "",
    "pathname": "/",
    "search": "",
    "hash": ""
    },
    {
    "input": "http://../",
    "base": "about:blank",
    "href": "http://../",
    "origin": "http://..",
    "protocol": "http:",
    "username": "",
    "password": "",
    "host": "..",
    "hostname": "..",
    "port": "",
    "pathname": "/",
    "search": "",
    "hash": ""
    },
    {
    "input": "http://0..0x300/",
    "base": "about:blank",
    "href": "http://0..0x300/",
    "origin": "http://0..0x300",
    "protocol": "http:",
    "username": "",
    "password": "",
    "host": "0..0x300",
    "hostname": "0..0x300",
    "port": "",
    "pathname": "/",
    "search": "",
    "hash": ""
    },

    This is probably related to the following:

  • Second, the leading dots (host) unit test fails:

    rust-url/tests/unit.rs

    Lines 421 to 429 in 7d2c9d6

    #[test]
    // https://github.com/servo/rust-url/issues/166
    fn test_leading_dots() {
    assert_eq!(
    Host::parse(".org").unwrap(),
    Host::Domain(".org".to_owned())
    );
    assert_eq!(Url::parse("file://./foo").unwrap().domain(), Some("."));
    }

    Issue Panic when parsing a . in file URLs #166 and [idna] Preserve leading dots in host #337 are related.

  • Third, the host unit test fails:

    rust-url/tests/unit.rs

    Lines 203 to 237 in 7d2c9d6

    #[test]
    fn host() {
    fn assert_host(input: &str, host: Host<&str>) {
    assert_eq!(Url::parse(input).unwrap().host(), Some(host));
    }
    assert_host("http://www.mozilla.org", Host::Domain("www.mozilla.org"));
    assert_host(
    "http://1.35.33.49",
    Host::Ipv4(Ipv4Addr::new(1, 35, 33, 49)),
    );
    assert_host(
    "http://[2001:0db8:85a3:08d3:1319:8a2e:0370:7344]",
    Host::Ipv6(Ipv6Addr::new(
    0x2001, 0x0db8, 0x85a3, 0x08d3, 0x1319, 0x8a2e, 0x0370, 0x7344,
    )),
    );
    assert_host("http://1.35.+33.49", Host::Domain("1.35.+33.49"));
    assert_host(
    "http://[::]",
    Host::Ipv6(Ipv6Addr::new(0, 0, 0, 0, 0, 0, 0, 0)),
    );
    assert_host(
    "http://[::1]",
    Host::Ipv6(Ipv6Addr::new(0, 0, 0, 0, 0, 0, 0, 1)),
    );
    assert_host(
    "http://0x1.0X23.0x21.061",
    Host::Ipv4(Ipv4Addr::new(1, 35, 33, 49)),
    );
    assert_host("http://0x1232131", Host::Ipv4(Ipv4Addr::new(1, 35, 33, 49)));
    assert_host("http://111", Host::Ipv4(Ipv4Addr::new(0, 0, 0, 111)));
    assert_host("http://2..2.3", Host::Domain("2..2.3"));
    assert!(Url::parse("http://42.0x1232131").is_err());
    assert!(Url::parse("http://192.168.0.257").is_err());
    }

    Only line 219 and 234 fail. Has to do with dots again.

I don't know whether these tests are conform the URL specification. Sadly, I don't have enough time right now to study this.

Questions

Now I might be totally wrong on this, that's why I'm asking the following to url maintainers. Hopefully you can help me out.

  • Is my assumption correct, is "http://\"" incorrect?
  • Should Url::parse(...) fail, for input not conform the URL specification?
  • Should hostnames be limited to ASCII and IDNA characters?
  • Are the linked unit tests faulty, based on the above?

Based on the questions above, what would be the proper steps to take, if there are any?
Maybe a special case should be implemented for handling . and .. domains if a base is known (this would partially fix unit tests)?

I've prepared a branch/PR with this change. It doesn't contain much changes because the questions need answering first. It currently obviously fails due to these unit tests. Here it is: #555

@SimonSapin
Copy link
Member

I haven’t looked the "http://\"" case specifically, but I’ll try to answer your direct questions.

  • Is my assumption correct, is "http://\"" incorrect?

It really depends how exactly you define “incorrect”.

The WHATWG URL specification has, separately:

  • A concept of valid URL string, designating a set of Unicode strings defined as a pseudo-grammar written in English

  • A basic URL parser algorithm, defined as pseudo-code written in English. This algorithm take an Unicode string and returns either a URL object or “failure”. The Url::parse function of this crate implements that algorithm and returns Result<Url, ParseError>, where ParseError represents “failure”.

    The parser may also emit any number of validation errors, even when it does not return “failure”. This crate implements ParseOptions::syntax_violation_callback as a way to see those. (Validation errors used to be called syntax violations in a previous version of the spec.)

https://url.spec.whatwg.org/#urls also states:

A valid URL string defines what input would not trigger a validation error or failure when given to the URL parser. I.e., input that would be considered conforming or valid.

However I don’t know if this is accurate. That is, is there no Unicode string that matches the grammar but for which the algorithm emits a validation error or returns failure? Or, is there no Unicode string that doesn’t match the grammar but for which the algorithm returns a URL record and does not emit any validation error? This may be worth filing a spec issue. Even if this statement is accurate right now, it is rather fragile in the face of future spec changes.

Also note that the parsing algorithm takes a base URL as an optional additional parameter, but the “valid URL string” concept is defined as independent of the presence or contents of that base URL.

  • Should hostnames be limited to ASCII and IDNA characters?

Limited in order to fit what definition?

  • Are the linked unit tests faulty, based on the above?

urltestdata.json comes from https://github.com/web-platform-tests/wpt/tree/master/url, which is developed together with the spec. However our copy of it (and our implementation of the algorithm) is out of date and known not to match the current spec.

@timvisee
Copy link
Author

timvisee commented Dec 1, 2019

I'm sorry, but I did not (and don't) have any time to further look into this. Thank you for your extensive answer though.

I'll close this for now, but might reopen it in the future if I've something more concrete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants