Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Amend scanner to support astral characters in identifiers when parsing es6+ #32096

Merged

Conversation

weswigham
Copy link
Member

@weswigham weswigham commented Jun 26, 2019

Fixes #31963

I've made the scanner just code-point aware enough to handle passing astral glyphs into isIdentifierPart, but it still uses charCodeAt anywhere that it's not explicitly going to look for astral plane glyphs (ie, is only looking for CharacterCodes.newLine). In addition, I've added a new set of unicode identifier start/part arrays generated by the script I've added in the script folder.

@IllusionMH
Copy link
Contributor

Are there plans to provide emitter for ES5 and below?

var 𝑚 = 4; in JS output will trigger Invalid character in IE11

@weswigham
Copy link
Member Author

We don't downlevel the es5 Unicode set into an es3 compatable thing - we just issue Invalid Character errors, and this es6+ Unicode support is the same way.

Btw, the Unicode table I've added was captured with Unicode v11.0. IIRC 12 is out and 13 is in draft. I can update it to a newer version by using latest node, probably, but we should come up with an update policy for the es6+ Unicode table - we don't want to add a unicode: 12 flag or the like just for this, I don't think.

@Alhadis
Copy link

Alhadis commented Jun 26, 2019

Would it be possible to replace "invalid" characters with an identifier representing their codepoint? E.g., 𝑚 would become U1D45A to match its codepoint representation: U+1D45A (sans plus).

@weswigham
Copy link
Member Author

Would it be possible to replace "invalid" characters with an identifier representing their codepoint?

Technically any number of escaping schemes are possible, but by substituting characters, it could affect the public API using those identifiers. For an internal const a rename is no big deal, but export it and suddenly that required renaming has big (undesirable) semantic meaning.

@IllusionMH
Copy link
Contributor

@weswigham oh, I see it now. Thank you for clarification.

@Alhadis
Copy link

Alhadis commented Jun 26, 2019

Did ES3 support Unicode escapes in identifiers? I can't remember anymore.

var foo = "Foo";
console.log( \u0066\u006F\u006F); // -> "Foo"

Not that it would be of much help for astral-plane characters, since it's limited to 4-digit codepoints anyway... 😕

@weswigham weswigham requested a review from rbuckton June 26, 2019 18:26
@weswigham
Copy link
Member Author

ping @rbuckton for a review again ❤️

@weswigham
Copy link
Member Author

@rbuckton any more feedback on this?

Copy link
Member

@rbuckton rbuckton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than the small comment mentioned above, this looks good.

@mathiasbynens
Copy link

mathiasbynens commented Aug 23, 2019

The build script could be simplified and made to not rely on the current Node.js/V8 + Unicode version:

const ID_Start = require('unicode-12.1.0/Binary_Property/ID_Start/code-points.js');
const ID_Continue = require('unicode-12.1.0/Binary_Property/ID_Continue/code-points.js');

// ...then add the other needed characters, and write the two arrays to disk.

I'd also like to propose not explicitly including Other_ID_Start and Other_ID_Continue "for backwards compatibility" without evidence that it's necessary. Rationale: AFAIK no modern JavaScript engine supports this, no other popular JS tool supports it either, strongly suggesting that it's just not needed. Links to how other tooling implements this:

@Alhadis
Copy link

Alhadis commented Aug 23, 2019

Huge shoutout to the devs who were patient enough to tend to this. =) Kudos. 👍

I'd also like to propose not including Other_ID_Start and Other_ID_Continue "for backwards compatibility" without evidence that it's necessary.

Also a huge shoutout to IBM and EBCDIC for creating the Next Line character (U+0085), despite mainframe systems already supporting CR and LF at the time, but figuring the world needed a third line terminator anyway. 👍

@mathiasbynens, have you ever encountered a legitimate use of Line and Paragraph Separators? (LS/PS). I know they're only defined by Unicode for compatibility, but I'm curious if you've ever seen them used for anything practical...

They're like Proxies. Not saying they're useless, only that I still haven't found a practical use for one (that doesn't involve membranes or DOM-related shenanigans…)

@weswigham
Copy link
Member Author

I'd also like to propose not including Other_ID_Start and Other_ID_Continue "for backwards compatibility" without evidence that it's necessary. Rationale: AFAIK no modern JavaScript engine supports this, no other popular JS tool supports it either, strongly suggesting that it's just not needed.

Would you champion removing note 3 from this section of the spec which very explicitly calls them out as included, then? It would very much seem to me that identifiers named ゜ (which is in the other start table) are, in fact, supported in chrome, at runtime. (Or at least over at https://jsconsole.com/ , I'm on mobile)

As for not relying on the runtime... I think I prefer to use the table generated from the runtime. We'll get exactly the character set matched by a known v8 version (we don't intend to keep multiple copies around, but will probably periodically update the table when new major node releases occur).

@weswigham
Copy link
Member Author

Huge shoutout to the devs who were patient enough to tend to this. =) Kudos. 👍

We actually found bugs in browsers while working on follow-ups to this. Go figure.

@mathiasbynens
Copy link

Would you champion removing note 3 from this section of the spec which very explicitly calls them out as included, then?

That non-normative note just calls out the fact that ID_Start already includes Other_ID_Start per Unicode. I think it's fine to keep it.

I'm simply recommending the removal of the hardcoded Unicode code points such as U+2118 which are Other_ID_Start on the grounds that it's not necessary, since those are already covered by ID_Start:

/\p{ID_Start}/u.test('\u2118');

The only reason you'd want to hardcode those characters explicitly is if you find an engine whose \p{ID_Start} doesn't match them.

@mathiasbynens
Copy link

We actually found bugs in browsers while working on follow-ups to this. Go figure.

It's great that you're discovering bugs and filing them <3 Thank you!

This (the fact that engines will always have bugs) is exactly why I would recommend against relying on any particular runtime's behavior. Your above statement seems to conflict with:

As for not relying on the runtime... I think I prefer to use the table generated from the runtime. We'll get exactly the character set matched by a known v8 version (we don't intend to keep multiple copies around, but will probably periodically update the table when new major node releases occur).

IMHO it's better to have a separate implementation (so that you'd spot bugs which you could then file) instead of relying on any one implementation to always be correct.

@mathiasbynens, have you ever encountered a legitimate use of Line and Paragraph Separators? (LS/PS). I know they're only defined by Unicode for compatibility, but I'm curious if you've ever seen them used for anything practical...

In my opinion making LS and PS LineTerminators was a mistake, but unfortunately we cannot go back and change it now. I don't really understand how this is on-topic though, since LS and PS are not ID_Start/ID_Continue characters -- could you elaborate?

@weswigham
Copy link
Member Author

I'm simply recommending the removal of the hardcoded Unicode code points such as U+2118 which are Other_ID_Start on the grounds that it's not necessary, since those are already covered by ID_Start

Ah, ok. So it's just that the regex is redundant, alright. I was under the impression that the sub-classification group needed special handling (and was called out in the spec) because it wasn't part of the base ID_X property set. Should that be the case, I can simplify the regex in the script a tad, though it shouldn't change the generated table.

@Alhadis
Copy link

Alhadis commented Sep 2, 2019

I don't really understand how this is on-topic though, since LS and PS are not ID_Start/ID_Continue characters — could you elaborate?

Oops, sorry, never mind. 😅 I was midway through my reply when I got sidetracked browsing Wikipedia, reading up on the different ways to encode a newline. By the time I went back to finishing what I was writing, some topics must've been juggled in my head. 😞

Disregard, I am an idiot. 👍 (Also, can you believe some systems used LF+CR to encode newlines? Yuck…)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Identifiers with Astral Unicode characters are unsupported
5 participants