-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Amend scanner to support astral characters in identifiers when parsing es6+ #32096
Amend scanner to support astral characters in identifiers when parsing es6+ #32096
Conversation
Are there plans to provide emitter for ES5 and below?
|
We don't downlevel the es5 Unicode set into an es3 compatable thing - we just issue Btw, the Unicode table I've added was captured with Unicode v11.0. IIRC 12 is out and 13 is in draft. I can update it to a newer version by using latest node, probably, but we should come up with an update policy for the es6+ Unicode table - we don't want to add a |
Would it be possible to replace "invalid" characters with an identifier representing their codepoint? E.g., |
Technically any number of escaping schemes are possible, but by substituting characters, it could affect the public API using those identifiers. For an internal |
@weswigham oh, I see it now. Thank you for clarification. |
Did ES3 support Unicode escapes in identifiers? I can't remember anymore. var foo = "Foo";
console.log( \u0066\u006F\u006F); // -> "Foo" Not that it would be of much help for astral-plane characters, since it's limited to 4-digit codepoints anyway... 😕 |
ping @rbuckton for a review again ❤️ |
@rbuckton any more feedback on this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than the small comment mentioned above, this looks good.
The build script could be simplified and made to not rely on the current Node.js/V8 + Unicode version: const ID_Start = require('unicode-12.1.0/Binary_Property/ID_Start/code-points.js');
const ID_Continue = require('unicode-12.1.0/Binary_Property/ID_Continue/code-points.js');
// ...then add the other needed characters, and write the two arrays to disk. I'd also like to propose not explicitly including
|
Huge shoutout to the devs who were patient enough to tend to this. =) Kudos. 👍
Also a huge shoutout to IBM and EBCDIC for creating the Next Line character ( @mathiasbynens, have you ever encountered a legitimate use of Line and Paragraph Separators? ( They're like Proxies. Not saying they're useless, only that I still haven't found a practical use for one (that doesn't involve membranes or DOM-related shenanigans…) |
Would you champion removing note 3 from this section of the spec which very explicitly calls them out as included, then? It would very much seem to me that identifiers named ゜ (which is in the other start table) are, in fact, supported in chrome, at runtime. (Or at least over at https://jsconsole.com/ , I'm on mobile) As for not relying on the runtime... I think I prefer to use the table generated from the runtime. We'll get exactly the character set matched by a known v8 version (we don't intend to keep multiple copies around, but will probably periodically update the table when new major node releases occur). |
That non-normative note just calls out the fact that I'm simply recommending the removal of the hardcoded Unicode code points such as U+2118 which are /\p{ID_Start}/u.test('\u2118'); The only reason you'd want to hardcode those characters explicitly is if you find an engine whose |
It's great that you're discovering bugs and filing them <3 Thank you! This (the fact that engines will always have bugs) is exactly why I would recommend against relying on any particular runtime's behavior. Your above statement seems to conflict with:
IMHO it's better to have a separate implementation (so that you'd spot bugs which you could then file) instead of relying on any one implementation to always be correct.
In my opinion making LS and PS LineTerminators was a mistake, but unfortunately we cannot go back and change it now. I don't really understand how this is on-topic though, since LS and PS are not |
Ah, ok. So it's just that the regex is redundant, alright. I was under the impression that the sub-classification group needed special handling (and was called out in the spec) because it wasn't part of the base |
Oops, sorry, never mind. 😅 I was midway through my reply when I got sidetracked browsing Wikipedia, reading up on the different ways to encode a newline. By the time I went back to finishing what I was writing, some topics must've been juggled in my head. 😞 Disregard, I am an idiot. 👍 (Also, can you believe some systems used |
Fixes #31963
I've made the scanner just code-point aware enough to handle passing astral glyphs into
isIdentifierPart
, but it still usescharCodeAt
anywhere that it's not explicitly going to look for astral plane glyphs (ie, is only looking forCharacterCodes.newLine
). In addition, I've added a new set of unicode identifier start/part arrays generated by the script I've added in the script folder.