can't parse urls starting with xn-- #438

Janpot · 2019-05-02T10:59:44Z

Can't seem to parse urls like http://xn--abc.com. This seems to work in browsers though.
I've been digging through the code and specs a bit.

It looks like tr46.toASCII returns an error. Digging further, it looks like it should implement this spec: https://www.unicode.org/reports/tr46/#Processing. But that seems to say:

Even if an error occurs, the conversion of the string is performed as much as is possible.

And it says

If the label starts with “xn--”:
Attempt to convert the rest of the label to Unicode according to Punycode [RFC3492]. If that conversion fails, record that there was an error, and continue with the next label. Otherwise replace the original label in the string by the results of the conversion.

The url spec seems to dictate (https://url.spec.whatwg.org/#idna)

If result is a failure value, validation error, return failure.

I feel like this should be possible though, tr46 seems quite ambiguous as to what's recoverable and what not.

I came across an example that renders and parses in the browser but seems to fail the parsing algorithm: http://xn--12cr4aua8bifvs3aljr6edb1al1vlg1a.blogspot.com (disclaimer: I am in no way connected to this url or the content of the site, it just passed by our systems)

In any case, I'm not super experienced in reading these specs, so take the previous with an appropriate grain of salt. It just seems strange to me that urls can render in a browser, but fail parsing them according to the spec.

EDIT:

forgot to mention, when I say "I've been digging through the code" I'm talking about https://github.com/jsdom/whatwg-url. FWIW, the node.js native url parser seems to behave the same way.

The text was updated successfully, but these errors were encountered:

annevk · 2019-05-02T12:25:14Z

Thank you for reporting this. Unfortunately, I suspect you're correct and this is not something we have adequately tested for thus far.

Interesting cases (Live URL Viewer links for comparison purposes):

https://xn--.com/ (jsdom has interesting behavior here)
https://xn--a.com/
https://xn--ß.com/ (this does seem to fail everywhere)

@macchiati I suspect this might require further adjustments to TR46 in due course, once we figure out the full details. Part of the problem here is that browsers have been slow on aligning with requirements in general and making the necessary adjustments.

Janpot · 2019-05-02T13:23:36Z

Digging a bit further, and another cornercase that seems related to this is that host setter doesn't throw when set to an invalid value:

const x = new URL('http://example.com');
x.host = 'xn--a';
console.log(x.href);
// node: http://example.com/
// browser: http://xn--a/

While

const x = new URL('http://example.com');
x.href = 'http://xn--a/';
console.log(x.href);
// node: throws "Invalid URL: http://xn--a/"
// browser: http://xn--a/

Wouldn't it make sense to make all setters throw when it results in an invalid URL, not only the href setter. According to the spec, this behavior is only specified for the href setter. (Maybe this should be a separate issue?)

EDIT:

And here's another funny one:

const x = new URL('http://example.com');
x.host = 'xn--ß.com';
console.log(x.href);
// node: http://example.com/
// chrome: http://xn--%C3%9F.com/
// firefox: http://xn--xn---yna.com/ 
// safari: http://example.com/

Only firefox seems to be consistent withitself:

const x = new URL('http://example.com');
x.host = 'xn--ß.com';
const y = new URL('http://xn--ß.com');
console.log(x.href === y.href);
// node: exception on line 3
// chrome: exception on line 3
// firefox: true
// safari: exception on line 3

annevk · 2019-05-02T14:17:11Z

That's how the host setter deals with incorrect input for largely historical reasons. The input ends up getting ignored. It should be functionally equivalent otherwise though.

Janpot · 2019-05-02T14:40:46Z

Ok I see. fwiw, I can't think of a situation where I want this to just ignore my input, rather than complain. But I guess that would be a breaking change by now.

annevk · 2019-05-02T14:54:35Z

Unfortunately it would be. There's been some talk about a dedicated host API, which will not have this problem.

zackw · 2019-10-10T17:06:16Z

I'd like to point out that the current rev of the IDNA RFC [IDNA2008] encourages applications that do DNS lookup to be liberal in what they accept, and in particular to "rely on the assumption that names that are present in the DNS are valid" except for specific cases which are known to cause "serious problems". In particular, note the text at the end of section 5.4:

For all other strings, the lookup application MUST rely on the
presence or absence of labels in the DNS to determine the validity of
those labels and the validity of the characters they contain. If
they are registered, they are presumed to be valid; if they are not,
their possible validity is not relevant.

where "all other strings" means "all strings that have passed the sequence of checks for 'serious problems' described in sections 5.3 and 5.4".

Here are some examples of URLs that I have personally observed in the wild (during my research, which involves Web crawling) to contain hostnames which are formally invalid per some RFC or other, but do not rise to the level of a 'serious problem', and which I think should probably be accepted by the URL standard, if only for interop's sake:

http://r2---sn-gvbxgn-tt1s.googlevideo.com/
http://r9---sn-i3b7sn7d.googlevideo.com/
http://lgbt_grani.livejournal.com/
http://www.mi-ru_mo.bbs.fc2.com/
http://-friction-.tumblr.com/

domenic · 2020-09-30T18:16:33Z

I wonder if we should consider enshrining browsers' "ASCII fast path", where they don't perform ToASCII on ASCII inputs. In https://bugs.chromium.org/p/chromium/issues/detail?id=724018 @annevk seemed to think that was a bad idea, but I'm not sure I fully understood the negative consequences of that direction.

annevk · 2020-10-01T09:39:31Z

Yeah, I think that's probably needed given the number of existing systems that seem to rely on this to varying degrees. I think my concerns were mostly design-wise, that it seems somewhat bad to have a different set of restrictions on non-ASCII and ASCII input, e.g., with regards to length.

If it wasn't already the case it might also lead to certain security issues I suppose, as you can smuggle invalid xn-- sequences in that might trip up something downstream that is poorly configured. That browsers already allow this hopefully means the ecosystem is already robust against such surprises.

rmisev · 2020-10-01T09:43:19Z

Example: http://xn--a.xn--nxa/
According to "ASCII fast path" it must pass. But I can't browse to this address. When entered in the address bar, Chrome converts it to http://xn--a.β/ and fails (because it treats http://xn--a.β/ invalid). I think if Chrome converts it to Unicode form, then must treat both equivalent. But actually it isn't:

new URL("http://xn--a.xn--nxa/") - succeeds
new URL("http://xn--a.β/") - throws

I think if URL's ASCII form is valid, then converted to Unicode form must by valid too. And vice versa - if URL is invalid in one form, then it must be invalid in other form too. Otherwise we got weird results. So I am the opposite of "ASCII fast path".

annevk · 2020-10-01T09:50:57Z

Well, but it does seem like Chrome (and similar for Firefox and Safari) attempts to browse to http://xn--a.xn--nxa/, right? Whereas for http://xn--a.β/ it does a search. So if you fiddled with your DNS to put something there it would probably work.

I agree that this leads to weirdly inconsistent rules though, so if we go down this path we should be very explicit about it and document these side effects.

rmisev · 2020-10-01T12:48:47Z

Well, but it does seem like Chrome (and similar for Firefox and Safari) attempts to browse to http://xn--a.xn--nxa/, right? Whereas for http://xn--a.β/ it does a search.

Yes, you are right. Anyway converting valid URL to not valid (even visually) is weird.

macchiati · 2020-10-01T18:20:03Z

A fix is being proposed for tr46. @markus Scherer <markus.icu@gmail.com>

…

On Thu, Oct 1, 2020, 05:49 Rimas Misevičius ***@***.***> wrote: Well, but it does seem like Chrome (and similar for Firefox and Safari) attempts to browse to http://xn--a.xn--nxa/, right? Whereas for http://xn--a.β/ it does a search. Yes, you are right. Anyway converting valid URL to not valid (even visually) is weird. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#438 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACJLEMADKJ4QIUHCHMDHA23SIR3EBANCNFSM4HJ5J26Q> .

markusicu · 2020-10-01T19:46:56Z

A fix is being proposed for tr46.

The fix proposed for UTS #46 is to detect "xn--" and "xn--ASCII-" as errors in a way that's equivalent with what IDNA2008 does. See https://www.unicode.org/L2/L2020/20240-utc165-properties-recs.pdf item F7 (on page 8).

More generally, the strategy is to report errors but still produce output (unless, for example, the Punycode string is ill-formed and thus not decodable), because different users/callers may ignore different types of errors.

domenic · 2020-10-01T19:52:05Z

Thanks @markusicu! However, what about xn-ASCII with no trailing dash? As seen in e.g. https://jsdom.github.io/whatwg-url/#url=aHR0cHM6Ly94bi0tYS5jb20v&base=YWJvdXQ6Ymxhbms=, browsers treat these sorts of URLs as parseable. Will the updated UTS 46 also produce output for those?

markusicu · 2020-10-01T20:21:07Z

what about xn-ASCII with no trailing dash? ... Will the updated UTS 46 also produce output for those?

I assume you mean xn--ASCII with double hyphen. The difference is that the additional hyphen in xn--ASCII- separates the "basic characters" (ASCII) from the actual Punycode encoding, and that is empty in this kind of label, which means that just Punycode-decoding it returns the ASCII part and you have an alternate encoding of the same label. (Punycode does not fail. IDNA2008 fails a round-trip check.)

Without the additional hyphen the "ASCII" substring is not actually ASCII at all but it's all-non-ASCII Punycode.

I don't think that UTS #46 is missing anything for those.
https://www.unicode.org/reports/tr46/#ProcessingStepConvertValidate

It might be ill-formed Punycode, and the spec says to just record an error for that label. If it's well-formed, then the decoded string is subjected to validation, which in turn might record an error if there is a disallowed character or something else wrong.

TimothyGu · 2021-05-17T19:33:04Z

This issue demonstrates a need for URLs such as xn--x.com to be preserved as xn--x.com, despite the Punycode decoding error. However, to prevent reparse bugs, we need to treat Unicode and validly-encoded ASCII versions of a invalid label the same way. In other words:

Both xn--a-ecp.ru and a⒈.ru should have the same parsing result (⒈ is a disallowed character according to UTS 46)
Both xn--a.xn--nxa and xn--a.β should have the same parsing result
It's unclear whether xn--é and xn--xn---epa should be allowed to have different parsing results (Double-encoded IDNA labels don't roundtrip #603)

I propose allowing ASCII labels with Punycode decoding errors to remain, but still forbid other types of UTS 46 error. So we have the following matrix:

Domain	spec	Chrome	Firefox	Safari	proposal
xn--a-ecp.ru	fail	xn--a-ecp.ru	xn--a-ecp.ru	fail	fail
a⒈.ru	fail	fail	xn--a-q10i.ru	fail	fail
xn--a.xn--nxa	fail	xn--a.xn--nxa	xn--a.xn--nxa	xn--a.xn--nxa	xn--a.xn--nxa
xn--a.β	fail	fail	xn--a.xn--nxa	fail	xn--a.xn--nxa
xn--é	fail	fail	xn--xn---epa	fail	fail
xn--xn---epa	xn--xn---epa	xn--xn---epa	xn--xn---epa	xn--xn---epa	xn--xn---epa

There's already precedent (Safari) for treating Punycode decoding error differently from other UTS 46 failures, as one can see by comparing xn--a-ecp.ru against xn--a.xn--nxa. However, this also means we will need a UTS 46 modification to distinguish Punycode decoding errors from other types of errors.

One way to get this is adding a IgnoreInvalidPunycode boolean flag to UTS 46, and in Processing's xn-- step, change it to:

Attempt to convert the rest of the label to Unicode according to Punycode [RFC3492]. If that conversion fails, record that there was an error if the label contains non-ASCII characters or if IgnoreInvalidPunycode is false, and continue with the next label. Otherwise replace the original label in the string by the results of the conversion.

annevk · 2023-01-10T12:10:34Z

@markusicu @macchiati thoughts on #438 (comment)? Especially for the xn--a.xn--nxa case where all implementations reportedly do the same thing, but UTS46 does not.

annevk · 2023-01-10T13:04:47Z

Hmm, it seems that only Chromium-based browsers still have a problem here studying the results of https://wpt.fyi/results/url/toascii.window.html so maybe no change is needed. @foolip are you all planning on fixing those remaining failures?

foolip · 2023-01-11T07:23:16Z

@ricea can you make a judgment about these failures and the linked bugs?

ricea · 2023-01-12T03:32:54Z

@foolip I think we want to fix these.

I think the linked bug for these issues is slightly different, so I newly filed https://crbug.com/1406728

annevk · 2023-01-12T07:55:53Z

Thanks, I think that means we can close this out. @ricea your help and insights on #543 would be appreciated.

annevk added topic: idna topic: parser labels May 2, 2019

SimonSapin mentioned this issue Sep 10, 2019

URL crate is failing to parse these existing URLs servo/rust-url#489

Open

annevk mentioned this issue May 6, 2020

Verify domain is not empty after "domain to ASCII" #497

Merged

3 tasks

domenic mentioned this issue Sep 30, 2020

Refusing a mix of numeric-only and BIDI domains #543

Open

annevk mentioned this issue May 17, 2021

Editorial: Add note about when ToASCII = ASCII lowercase #598

Merged

sleevi mentioned this issue May 17, 2021

Double-encoded IDNA labels don't roundtrip #603

Closed

ghost mentioned this issue Nov 14, 2021

Consider switching to an Order Sorted Algebra model? alwinb/url-specification#13

Closed

annevk closed this as completed Jan 12, 2023

TimothyGu mentioned this issue Feb 23, 2023

IDNA: add a couple interesting ToASCII cases web-platform-tests/wpt#37907

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

can't parse urls starting with xn-- #438

can't parse urls starting with xn-- #438

Janpot commented May 2, 2019 •

edited

Loading

annevk commented May 2, 2019

Janpot commented May 2, 2019 •

edited

Loading

annevk commented May 2, 2019

Janpot commented May 2, 2019

annevk commented May 2, 2019

zackw commented Oct 10, 2019

domenic commented Sep 30, 2020

annevk commented Oct 1, 2020

rmisev commented Oct 1, 2020

annevk commented Oct 1, 2020

rmisev commented Oct 1, 2020

macchiati commented Oct 1, 2020 via email

markusicu commented Oct 1, 2020

domenic commented Oct 1, 2020

markusicu commented Oct 1, 2020

TimothyGu commented May 17, 2021

annevk commented Jan 10, 2023

annevk commented Jan 10, 2023

foolip commented Jan 11, 2023

ricea commented Jan 12, 2023

annevk commented Jan 12, 2023

can't parse urls starting with xn-- #438

can't parse urls starting with xn-- #438

Comments

Janpot commented May 2, 2019 • edited Loading

annevk commented May 2, 2019

Janpot commented May 2, 2019 • edited Loading

annevk commented May 2, 2019

Janpot commented May 2, 2019

annevk commented May 2, 2019

zackw commented Oct 10, 2019

domenic commented Sep 30, 2020

annevk commented Oct 1, 2020

rmisev commented Oct 1, 2020

annevk commented Oct 1, 2020

rmisev commented Oct 1, 2020

macchiati commented Oct 1, 2020 via email

markusicu commented Oct 1, 2020

domenic commented Oct 1, 2020

markusicu commented Oct 1, 2020

TimothyGu commented May 17, 2021

annevk commented Jan 10, 2023

annevk commented Jan 10, 2023

foolip commented Jan 11, 2023

ricea commented Jan 12, 2023

annevk commented Jan 12, 2023

Janpot commented May 2, 2019 •

edited

Loading

Janpot commented May 2, 2019 •

edited

Loading