I think host parsing is incorrect, might not follow spec, use ASCII / IDNA only #554

timvisee · 2019-10-15T21:16:34Z

A few days ago I discovered that some string are parsed successfully as Url, even though I'm quite certain those are invalid, such as "http://\"". (see seanmonstar/reqwest#668 via #552)

The problem

I located the problem to be with hostname parsing. Any hostname characters are accepted at this moment (except for a small blacklist). I believe hostnames must be limited to ASCII or IDNA encodable characters, so that'd be an error.

To fix this, I changed the following line to use domain_to_ascii_strict instead.
This strict variant fails on any character that is not encodable as ASCII/IDNA:

rust-url/src/host.rs

Line 83 in 7d2c9d6

let domain = idna::domain_to_ascii(&domain)?;

Changing this does solve the issue I've linked before: seanmonstar/reqwest#668

The catch

But there's a catch. This change makes a few included unit tests fail. I'm
wondering whether these unit tests are conform the URL specification that is explicitly linked in the README of this crate. If these are not, I'd argue they should be fixed or removed.

Failing tests

First, the 'Domains with empty labels' fails from the test data set:

rust-url/tests/urltestdata.json

Lines 3802 to 3847 in 7d2c9d6

    
           "Domains with empty labels", 
        
           { 
        
             "input": "http://./", 
        
             "base": "about:blank", 
        
             "href": "http://./", 
        
             "origin": "http://.", 
        
             "protocol": "http:", 
        
             "username": "", 
        
             "password": "", 
        
             "host": ".", 
        
             "hostname": ".", 
        
             "port": "", 
        
             "pathname": "/", 
        
             "search": "", 
        
             "hash": "" 
        
           }, 
        
           { 
        
             "input": "http://../", 
        
             "base": "about:blank", 
        
             "href": "http://../", 
        
             "origin": "http://..", 
        
             "protocol": "http:", 
        
             "username": "", 
        
             "password": "", 
        
             "host": "..", 
        
             "hostname": "..", 
        
             "port": "", 
        
             "pathname": "/", 
        
             "search": "", 
        
             "hash": "" 
        
           }, 
        
           { 
        
             "input": "http://0..0x300/", 
        
             "base": "about:blank", 
        
             "href": "http://0..0x300/", 
        
             "origin": "http://0..0x300", 
        
             "protocol": "http:", 
        
             "username": "", 
        
             "password": "", 
        
             "host": "0..0x300", 
        
             "hostname": "0..0x300", 
        
             "port": "", 
        
             "pathname": "/", 
        
             "search": "", 
        
             "hash": "" 
        
           },

This is probably related to the following:

Second, the leading dots (host) unit test fails:

rust-url/tests/unit.rs

Lines 421 to 429 in 7d2c9d6

    
           #[test] 
        
           // https://github.com/servo/rust-url/issues/166 
        
           fn test_leading_dots() { 
        
               assert_eq!( 
        
                   Host::parse(".org").unwrap(), 
        
                   Host::Domain(".org".to_owned()) 
        
               ); 
        
               assert_eq!(Url::parse("file://./foo").unwrap().domain(), Some(".")); 
        
           }

Issue Panic when parsing a . in file URLs #166 and [idna] Preserve leading dots in host #337 are related.

Third, the host unit test fails:

rust-url/tests/unit.rs

Lines 203 to 237 in 7d2c9d6

    
           #[test] 
        
           fn host() { 
        
               fn assert_host(input: &str, host: Host<&str>) { 
        
                   assert_eq!(Url::parse(input).unwrap().host(), Some(host)); 
        
               } 
        
               assert_host("http://www.mozilla.org", Host::Domain("www.mozilla.org")); 
        
               assert_host( 
        
                   "http://1.35.33.49", 
        
                   Host::Ipv4(Ipv4Addr::new(1, 35, 33, 49)), 
        
               ); 
        
               assert_host( 
        
                   "http://[2001:0db8:85a3:08d3:1319:8a2e:0370:7344]", 
        
                   Host::Ipv6(Ipv6Addr::new( 
        
                       0x2001, 0x0db8, 0x85a3, 0x08d3, 0x1319, 0x8a2e, 0x0370, 0x7344, 
        
                   )), 
        
               ); 
        
               assert_host("http://1.35.+33.49", Host::Domain("1.35.+33.49")); 
        
               assert_host( 
        
                   "http://[::]", 
        
                   Host::Ipv6(Ipv6Addr::new(0, 0, 0, 0, 0, 0, 0, 0)), 
        
               ); 
        
               assert_host( 
        
                   "http://[::1]", 
        
                   Host::Ipv6(Ipv6Addr::new(0, 0, 0, 0, 0, 0, 0, 1)), 
        
               ); 
        
               assert_host( 
        
                   "http://0x1.0X23.0x21.061", 
        
                   Host::Ipv4(Ipv4Addr::new(1, 35, 33, 49)), 
        
               ); 
        
               assert_host("http://0x1232131", Host::Ipv4(Ipv4Addr::new(1, 35, 33, 49))); 
        
               assert_host("http://111", Host::Ipv4(Ipv4Addr::new(0, 0, 0, 111))); 
        
               assert_host("http://2..2.3", Host::Domain("2..2.3")); 
        
               assert!(Url::parse("http://42.0x1232131").is_err()); 
        
               assert!(Url::parse("http://192.168.0.257").is_err()); 
        
           }

Only line 219 and 234 fail. Has to do with dots again.

I don't know whether these tests are conform the URL specification. Sadly, I don't have enough time right now to study this.

Questions

Now I might be totally wrong on this, that's why I'm asking the following to url maintainers. Hopefully you can help me out.

Is my assumption correct, is "http://\"" incorrect?
Should Url::parse(...) fail, for input not conform the URL specification?
Should hostnames be limited to ASCII and IDNA characters?
Are the linked unit tests faulty, based on the above?

Based on the questions above, what would be the proper steps to take, if there are any?
Maybe a special case should be implemented for handling . and .. domains if a base is known (this would partially fix unit tests)?

I've prepared a branch/PR with this change. It doesn't contain much changes because the questions need answering first. It currently obviously fails due to these unit tests. Here it is: #555

The text was updated successfully, but these errors were encountered:

SimonSapin · 2019-10-15T21:51:47Z

I haven’t looked the "http://\"" case specifically, but I’ll try to answer your direct questions.

Is my assumption correct, is "http://\"" incorrect?

It really depends how exactly you define “incorrect”.

Should Url::parse(...) fail, for input not conform the URL specification?

The WHATWG URL specification has, separately:

A concept of valid URL string, designating a set of Unicode strings defined as a pseudo-grammar written in English
A basic URL parser algorithm, defined as pseudo-code written in English. This algorithm take an Unicode string and returns either a URL object or “failure”. The Url::parse function of this crate implements that algorithm and returns Result<Url, ParseError>, where ParseError represents “failure”.

The parser may also emit any number of validation errors, even when it does not return “failure”. This crate implements ParseOptions::syntax_violation_callback as a way to see those. (Validation errors used to be called syntax violations in a previous version of the spec.)

https://url.spec.whatwg.org/#urls also states:

A valid URL string defines what input would not trigger a validation error or failure when given to the URL parser. I.e., input that would be considered conforming or valid.

However I don’t know if this is accurate. That is, is there no Unicode string that matches the grammar but for which the algorithm emits a validation error or returns failure? Or, is there no Unicode string that doesn’t match the grammar but for which the algorithm returns a URL record and does not emit any validation error? This may be worth filing a spec issue. Even if this statement is accurate right now, it is rather fragile in the face of future spec changes.

Also note that the parsing algorithm takes a base URL as an optional additional parameter, but the “valid URL string” concept is defined as independent of the presence or contents of that base URL.

Should hostnames be limited to ASCII and IDNA characters?

Limited in order to fit what definition?

Are the linked unit tests faulty, based on the above?

urltestdata.json comes from https://github.com/web-platform-tests/wpt/tree/master/url, which is developed together with the spec. However our copy of it (and our implementation of the algorithm) is out of date and known not to match the current spec.

timvisee · 2019-12-01T15:29:56Z

I'm sorry, but I did not (and don't) have any time to further look into this. Thank you for your extensive answer though.

I'll close this for now, but might reopen it in the future if I've something more concrete.

This was referenced Oct 15, 2019

WIP: Fix hostname parsing, limit to ASCII / IDNA #555

Closed

Does not return ParseError for invalid Url #552

Open

timvisee closed this as completed Dec 1, 2019

lemire mentioned this issue Jan 28, 2023

http://./ is a valid url ada-url/ada#146

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I think host parsing is incorrect, might not follow spec, use ASCII / IDNA only #554

I think host parsing is incorrect, might not follow spec, use ASCII / IDNA only #554

timvisee commented Oct 15, 2019 •

edited

Loading

SimonSapin commented Oct 15, 2019

timvisee commented Dec 1, 2019

I think host parsing is incorrect, might not follow spec, use ASCII / IDNA only #554

I think host parsing is incorrect, might not follow spec, use ASCII / IDNA only #554

Comments

timvisee commented Oct 15, 2019 • edited Loading

The problem

The catch

Failing tests

Questions

SimonSapin commented Oct 15, 2019

timvisee commented Dec 1, 2019

timvisee commented Oct 15, 2019 •

edited

Loading