Initialize the IgnoreInvalidPunycode flag when calling UTS 46 #821

hsivonen · 2024-02-06T07:25:25Z

What is the issue with the URL Standard?

UTS 46 revision 31 added a IgnoreInvalidPunycode flag to its ToASCII and ToUnicode operations. The URL Standard should be explicit about the value of this flag when it calls into ToASCII or into ToUnicode.

hsivonen · 2024-03-01T12:37:06Z

AFAICT, the current behavior of Firefox and Safari would be consistent with setting this flag to false and Chrome’s behavior would be consistent with setting this flag to true.

Looking at how browsers comply with the existing spec, Safari seems to comply well, Firefox seems to comply except Firefox fails to enforce bidi rule on LTR labels in a bidi domain name (i.e. Firefox enforces the bidi rule on a per-label basis), and Chrome’s behavior seems hard to explain from the spec.

These observations would support setting IgnoreInvalidPunycode to false. However, I’m missing some context of why the IgnoreInvalidPunycode flag was introduced in UTS 46. The rationale says it enables an ASCII fast path, but UTS 46 still requires validating xn-- labels that decode successfully as Punycode, so the flag does not, AFAICT, enable an ASCII fast path in general (and the “industry practice” evidently doesn’t cover Firefox and Safari).

@markusicu, @macchiati, can you share more context for the motivation of IgnoreInvalidPunycode and how you’d expect the URL Standard to set the flag?

macchiati · 2024-03-02T17:47:29Z

I can't remember off the top of my head; would have to look back at the development notes.

…

---------- Forwarded message --------- From: Henri Sivonen ***@***.***> Date: Fri, Mar 1, 2024, 04:37 Subject: Re: [whatwg/url] Initialize the IgnoreInvalidPunycode flag when calling UTS 46 (Issue #821) To: whatwg/url ***@***.***> Cc: Mark Davis ***@***.***>, Mention ***@***.***> AFAICT, the current behavior of Firefox and Safari would be consistent with setting this flag to false and Chrome’s behavior would be consistent with setting this flag to true. Looking at how browsers comply with the existing spec, Safari seems to comply well, Firefox seems to comply except Firefox fails to enforce bidi rule on LTR labels in a bidi domain name (i.e. Firefox enforces the bidi rule on a per-label basis), and Chrome’s behavior seems hard to explain from the spec. These observations would support setting IgnoreInvalidPunycode to false. However, I’m missing some context of why the IgnoreInvalidPunycode flag was introduced in UTS 46. The rationale says it enables an ASCII fast path, but UTS 46 still requires validating xn-- labels that decode successfully as Punycode, so the flag does not, AFAICT, enable an ASCII fast path in general (and the “industry practice” evidently doesn’t cover Firefox and Safari). @markusicu <https://github.com/markusicu>, @macchiati <https://github.com/macchiati>, can you share more context for the motivation of IgnoreInvalidPunycode and how you’d expect the URL Standard to set the flag? — Reply to this email directly, view it on GitHub <#821 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACJLEMCPDTNYLKLQTLNVWXLYWBY77AVCNFSM6AAAAABC3OVTROVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZTGEYTMMJQHA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

annevk · 2024-03-03T08:11:47Z

Yeah I don't understand this either. This was not part of our feedback to UTS46 last year (#744) and I would not want ASCII special casing of this sort.

socram8888 · 2024-11-26T20:17:57Z

I've been trying to figure out why my domain was not working on FF but did on Chrome, and found about the IgnoreInvalidPunycode flag.

I'd encourage you to set it to true, as false will break domains that can be registered - see my xn--i29h.kz domain.

annevk · 2024-11-27T07:08:32Z

That domain also fails in Safari and in any conforming URL parser: https://jsdom.github.io/whatwg-url/#url=aHR0cHM6Ly94bi0taTI5aC5rei8=&base=YWJvdXQ6Ymxhbms=. There are certainly domains you can register or use as subdomain that won't end up working. It's not immediately clear to me that all of those necessarily should.

cc @markusicu

socram8888 · 2024-11-27T08:42:40Z

@annevk That website you've just given me kinda proves why IgnoreInvalidPunycode should be true.

If an URL were to have a 15.1 character such as \U0002EBF0, my Firefox ESR 128.0 would be unable to process it - not even in the punycoded form! https://jsdom.github.io/whatwg-url/#url=aHR0cDovL3huLS04ZzBuLmNvbS8=&base=YWJvdXQ6Ymxhbms=

And even more, if you try to use 🪉, the harp emoji in 16.0, it will not work on neither: https://jsdom.github.io/whatwg-url/#url=aHR0cDovL3huLS1rMDloLmNvbS8=&base=YWJvdXQ6Ymxhbms=

Despite being actually valid according to IdnaMappingTable for 16.0.0:

1FA89         ; valid      ;      ; NV8    # 16.0 HARP

Why is that? Because the tr46 library @jsdom/whatwg-url uses implements UTS 46 with the IDNA table 15.1.0, while my Firefox ESR 128.0 supports only up to 15.0.0, with the latest being 16.0.0.

If IgnoreInvalidPunycode were true by default, as it is on Chrome, browsers would still prevent accessing via invalid or not-yet-supported Unicode characters that could introduce security problems due to homographic attacks and confusables, but would allow navigating just fine via the punycoded version.

In short, requiring software updates to use new DNS domains that are all valid to the basic RFC 1034 seems like a bad idea with no obvious benefits to me.

rmisev · 2024-11-27T16:31:50Z

I don't think xn--i29h.kz is a good example for this topic. It contains valid punycode that decodes to the U+1FACD character. This character is disallowed according to IdnaMappingTable.txt. This means that the ToASCII will return a failure even if IgnoreInvalidPunycode is true.

socram8888 · 2024-11-27T16:41:34Z

@rmisev Fair enough. It's not yet valid, since it won't be until Unicode 17 is released next year, but will be and regardless it's a valid RFC 1034 domain.

If you prefer xn--k09h.com (0x1FA89), xn--q09h.com (0x1FA8F), xn--cvf.com (0x1B4E), xn--1ph.com (0x2427), xn--w78a.com (0xA7CB), etc... are all examples of domains that while valid until current UTS 46 revision 33, cannot be used in any existing URL parsing implementation that uses IgnoreInvalidPunycode=false, because they all seemingly target 31 with an older IDNA table (vs the new at https://unicode.org/Public/idna/16.0.0/IdnaMappingTable.txt)

hsivonen · 2024-11-28T08:23:13Z

@socram8888 , as @rmisev already said, IgnoreInvalidPunycode isn't about what you are talking about. Your examples fail at the validation step and not at the Punycode decode step.

hsivonen · 2024-11-28T08:31:49Z

@socram8888 ,

As for the validation step: The validation step needs to prohibit unassigned code points when going from the Unicode form to the ASCII form, because their future normalization behavior isn't known and the operation depends on normalization.

If labels that are already in the ASCII form weren't subject to validation, round-trippability between the ASCII and Unicoode forms wouldn't be stable.

This does have the side effect that domains from the future don't work in old software, but that's not too much of a practical problem in browsers due to the update cycle. (The update cycle isn't immediate. E.g. Firefox's IDNA is on Unicode 15.1 instead of 16.0 right now, but it's not unbounded waiting.)

In any case, you are asking for a design change to IDNA that doesn't belong in the URL Standard.

socram8888 · 2024-11-28T09:44:25Z

@hsivonen You are right. Reading again the UTS 46 indeed requires a validation step that is not affected by the IgnoreInvalidPunycode, so even with that flag set parsing said domain would still fail.

Chromium, in the URL parsing, seems to be just leaving fully ASCII hostnames as is (https://github.com/chromium/chromium/blob/9df64a975a05e623c6f53e2e2a1936226b8dc42e/url/url_canon_host.cc#L467-L476 and https://github.com/chromium/chromium/blob/451e794a3a3abc8d999c4682da559ce1885af849/net/dns/dns_config_service_win.cc#L363-L373, for example)

When parsing a hostname, it fully allows instantiating URLs with totally invalid IDNAs, but will only display decoded the valid and secure ones. For example:

xn--espaa-rta.orca.pet, which is the hostname with proper NFKC normalization, is displayed on the navbar as españa.orca.pet.
xn--espana-0xd.orca.pet, which uses an invalid NFD normalization, is displayed as the original ASCII string.

But regardless, it still allows instantiating both:

In my humble opinion, this is a perfect solution, as every single RFC 1034-conforming host is accesible, while making homoglyph attacks impossible (which is ultimately I think the whole point of validation). I don't fully comprehend why we have an URL standard that is unable to represent all DNS-conforming hostnames...

annevk · 2024-11-29T09:54:56Z

I dug into this a bit. We looked into an ASCII fast path in the past. They are bad: #309 (comment) (and also #267). xn-- prefixed host names have to conform to the rules of UTS46. If they don't, they'll be considered in error.

Also move UseSTD3ASCIIRules around in Unicode ToASCII to align with the UTS46 order. While this is not a change in behavior, this is not marked as editorial as UTS46 integration is somewhat significant and worth highlighting. Fixes #821.

socram8888 · 2024-11-29T10:18:11Z

@annevk But the problem is that UTS 46 is clearly another living standard so there's no guarantee that what yesterday was invalid won't be valid tomorrow (as you noticed in #836 (comment)), and worse even, that was valid yesterday won't be invalid tomorrow.

I understand validation changing as new issues are found when it comes to displaying it to the end user for confusables protection, but parsing shall IMO remain stable when dealing with ASCII-only domains...

annevk · 2024-11-29T11:20:09Z

Valid to invalid is indeed concerning, but given how UTS46 was created to counterbalance IDNA2008 which did exactly that I'm not too worried about that happening. There are a couple of cases we still need to iron out around IDNA, but the relationship with UTS46 has been good and productive.

If there is some point where it becomes problematic we can always take stock then and determine appropriate next steps, including folding the algorithms that worked for us in directly.

And the validation in UTS46 is not really concerned with confusables. It's much lower-level. When to display Unicode to the end user is still mostly handled by proprietary algorithms, but I'm also rather suspect of that whole approach as you can have confusables within ASCII as well. Properly addressing phishing has to be done differently.

Also move UseSTD3ASCIIRules around in Unicode ToASCII to align with the UTS46 order. While this is not a change in behavior, this is not marked as editorial as UTS46 integration is somewhat significant and worth highlighting. Fixes #821.

hsivonen added the topic: idna label Feb 6, 2024

This comment was marked as resolved.

Sign in to view

valenting mentioned this issue Nov 27, 2024

IgnoreInvalidPunycode set to false breaks my domain servo/rust-url#1001

Closed

1 task

annevk mentioned this issue Nov 29, 2024

Set IDNA's IgnoreInvalidPunycode to false #843

Merged

annevk closed this as completed in #843 Nov 29, 2024

annevk mentioned this issue Dec 6, 2024

ContextJ (RFC 5892) is Security Theater #776

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initialize the IgnoreInvalidPunycode flag when calling UTS 46 #821

Initialize the IgnoreInvalidPunycode flag when calling UTS 46 #821

hsivonen commented Feb 6, 2024

hsivonen commented Mar 1, 2024

macchiati commented Mar 2, 2024 via email

annevk commented Mar 3, 2024

socram8888 commented Nov 26, 2024

This comment was marked as resolved.

annevk commented Nov 27, 2024

socram8888 commented Nov 27, 2024 •

edited

Loading

rmisev commented Nov 27, 2024

socram8888 commented Nov 27, 2024 •

edited

Loading

hsivonen commented Nov 28, 2024

hsivonen commented Nov 28, 2024

socram8888 commented Nov 28, 2024 •

edited

Loading

annevk commented Nov 29, 2024 •

edited

Loading

socram8888 commented Nov 29, 2024 •

edited

Loading

annevk commented Nov 29, 2024

Initialize the IgnoreInvalidPunycode flag when calling UTS 46 #821

Initialize the IgnoreInvalidPunycode flag when calling UTS 46 #821

Comments

hsivonen commented Feb 6, 2024

What is the issue with the URL Standard?

hsivonen commented Mar 1, 2024

macchiati commented Mar 2, 2024 via email

annevk commented Mar 3, 2024

socram8888 commented Nov 26, 2024

This comment was marked as resolved.

annevk commented Nov 27, 2024

socram8888 commented Nov 27, 2024 • edited Loading

rmisev commented Nov 27, 2024

socram8888 commented Nov 27, 2024 • edited Loading

hsivonen commented Nov 28, 2024

hsivonen commented Nov 28, 2024

socram8888 commented Nov 28, 2024 • edited Loading

annevk commented Nov 29, 2024 • edited Loading

socram8888 commented Nov 29, 2024 • edited Loading

annevk commented Nov 29, 2024

socram8888 commented Nov 27, 2024 •

edited

Loading

socram8888 commented Nov 27, 2024 •

edited

Loading

socram8888 commented Nov 28, 2024 •

edited

Loading

annevk commented Nov 29, 2024 •

edited

Loading

socram8888 commented Nov 29, 2024 •

edited

Loading