-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IDNA / UTS #46 "should" requirements (Bidi and Joiners) #110
Comments
Interesting, yeah, I don't think that should be enforced. We should probably make this configurable in the IDNA standard. I noticed the line you quoted mentioned transitional processing. It seems Gecko is successfully shipping non-transitional processing these days. Perhaps Servo can do so too? And maybe the URL Standard should start requiring that? It's still rather contentious whether it's a good idea though... |
What do other browsers do? CC @valenting |
@SimonSapin all other browsers do transitional as far as I know. See https://bugzilla.mozilla.org/show_bug.cgi?id=1218179 and https://bugzilla.mozilla.org/show_bug.cgi?id=1255188 for details. |
https://bugs.webkit.org/show_bug.cgi?id=144194 Reviewed by Darin Adler. Source/WebCore: Use uidna_nameToASCII instead of the deprecated uidna_IDNToASCII. It uses IDN2008 instead of IDN2003, and it uses UTF #46 when used with a UIDNA opened with uidna_openUTS46. This follows https://url.spec.whatwg.org/#concept-domain-to-ascii except we do not use Transitional_Processing to prevent homograph attacks on german domain names with "ß" and "ss" in them. These are now treated as separate domains. Firefox also doesn't use Transitional_Processing. Chrome and the current specification use Transitional_processing, but whatwg/url#110 might change the spec. In addition, http://unicode.org/reports/tr46/ says: "implementations are encouraged to apply the Bidi and ContextJ validity criteria" Bidi checks prevent domain names with bidirectional text, such as latin and hebrew characters in the same domain. Chrome and Firefox do this. ContextJ checks prevent code points such as U+200D, which is a zero-width joiner which users would not see when looking at the domain name. Firefox currently enables ContextJ checks and it is suggested by UTS #46, so we'll do it. ContextO checks, which we do not use and neither does any other browser nor the spec, would fail if a domain contains code points such as U+30FB, which looks somewhat like a dot. We can investigate enabling these checks later. Covered by new API tests and rebased LayoutTests. The new API tests verify that we do not use transitional processing, that we do apply the Bidi and ContextJ checks, but not ContextO checks. * platform/URLParser.cpp: (WebCore::URLParser::domainToASCII): (WebCore::URLParser::internationalDomainNameTranscoder): * platform/URLParser.h: * platform/mac/WebCoreNSURLExtras.mm: (WebCore::mapHostNameWithRange): Tools: * TestWebKitAPI/Tests/WebCore/URLParser.cpp: (TestWebKitAPI::TEST_F): Add some tests from http://unicode.org/faq/idn.html verifying that we follow UTS46's deviations from IDN2008. Add some tests based on https://tools.ietf.org/html/rfc5893 verifying that we check for bidirectional text. Add a test based on https://tools.ietf.org/html/rfc5892 verifying that we do not do ContextO check. Add a test for U+321D and U+321E which have particularly interesting punycode encodings. We match Firefox here now. Also add a test from http://www.unicode.org/reports/tr46/#IDNAComparison verifying we are not using IDN2003. We should consider importing all of http://www.unicode.org/Public/idna/9.0.0/IdnaTest.txt as URL domain tests. LayoutTests: * fast/encoding/idn-security.html: Move some characters with changed IDN encodings to inside the check for old ICU. * fast/url/idna2003-expected.txt: * fast/url/idna2008-expected.txt: Update expected results. We are now more compliant with IDN2008. git-svn-id: http://svn.webkit.org/repository/webkit/trunk@208902 268f45cc-cd09-0410-ab3c-d52691b4dbfc
WebKit just switched to non-transitional processing and added tests verifying that we do Bidi checks and ContextJ checks. We don't do ContextO checks because nobody else does yet. See https://bugs.webkit.org/show_bug.cgi?id=144194 |
Just an update on user-agent support from a random user who happened to Googlewhack to this thread (edit: sorry, I seem to have gone a tad off topic):
In a quick check of development environments I've used in the last 24 hours (transparent support via the language's HTTP interfaces are untested):
|
There's more recent tickets filed by myself: https://bugs.chromium.org/p/chromium/issues/detail?id=694157 and https://developer.microsoft.com/en-us/microsoft-edge/platform/issues/11009037/. |
@annevk Thanks! I missed those while Googling. |
See also #263 for issues with Python that may well exist in other implementations. Interoperability issues for everyone. |
FWIW, I'm pretty sure CONTEXTJ must be false, otherwise 👩⚕️ cannot be represented whereas that works fine in user agents. (As seen in http://www.unicode.org/reports/tr46/tr46-18.html#Validity_Criteria.) |
Hmm, maybe that's wrong. Safari definitely doesn't seem to do the same thing as Firefox for CONTEXTJ though per https://trac.webkit.org/changeset/208902/webkit it should? @achristensen07 any insights? I kinda thing we should allow CONTEXTJ if we allow emojis for subdomains. Banning a subset of emojis seems a little weird. |
http://www.unicode.org/reports/tr46/proposed.html#Processing has these now as input flags so they're no longer should requirements. My limited testing shows CheckBidi should be true. For CheckJoiners the results are unclear. Input welcome. |
And in case it wasn't clear, Nontransitional_Processing started to be used in the URL Standard since #239. |
Fixes #53 and fixes #267 by no longer breaking on on hyphens in the 3rd and 4th position of a domain label. This is known to break YouTube: r3---sn-2gb7ln7k.googlevideo.com. This is done by setting the proposed CheckHyphens flag to false. Fixes #110 by clarifying that BIDI and CONTEXTJ checks are to be done by setting the proposed CheckBidi and CheckJoiners flags to true. Follow-up #313 is filed to remove the proposed bits once Unicode is updated.
Tests: web-platform-tests/wpt#5976. Fixes #53 and fixes #267 by no longer breaking on on hyphens in the 3rd and 4th position of a domain label. This is known to break YouTube: r3---sn-2gb7ln7k.googlevideo.com. This is done by setting the proposed CheckHyphens flag to false. Fixes #110 by clarifying that BIDI and CONTEXTJ checks are to be done by setting the proposed CheckBidi and CheckJoiners flags to true. Follow-up #313 is filed to remove the proposed bits once Unicode is updated.
It seems that most user agents enforce CheckJoiners if I don't check the more problematic emoji case. So I'll go with that. |
https://url.spec.whatwg.org/#idna refers (through the “Unicode ToAscii” and “Unicode ToUnicode” algorithms) to http://www.unicode.org/reports/tr46/#Processing and rely on the error flag.
This it turns refers to Section 4.1 http://www.unicode.org/reports/tr46/#Validity_Criteria which has a series of “must” requirements. For example:
This section also has a subsection 4.1.2 http://www.unicode.org/reports/tr46/#Right_to_Left_Scripts
Note “should” (emphasis added) and “strongly recommended” rather than “must”.
If the URL Standard is to define interoperable algorithms, I think it needs to define in which requirements Section 4.1.2 sets the error flag.
(Related: servo/rust-url#179)
The text was updated successfully, but these errors were encountered: