-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode in input string is not handled #225
Comments
Trying to get an understanding of how the string encoding works but I think a
check should catch high/low surrogate pair code points, which are what we'd want to throw an error for if the These ranges are from https://mathiasbynens.be/notes/javascript-encoding#surrogate-pairs |
Wondering if there is a spec somewhere that says how to handle unicode characters actually. I.e. that says e.g.
I think because
Interestingly
So looks like |
I think a fix for this could be changing the generated code to be
instead of
when the This causes With this change
becomes {
type: 'RegExp',
body: {
type: 'Char',
value: '👍',
kind: 'simple',
symbol: '👍',
codePoint: 128077
},
flags: 'u'
} which is still different to {
type: 'RegExp',
body: {
type: 'Char',
value: '\\u{1f44d}',
kind: 'unicode',
symbol: '👍',
codePoint: 128077
},
flags: 'u'
} but I think Also a bit unrelated I think we should use WDYT? |
@tjenkinson thanks for the report and investigation, I think the change looks reasonable. @mathiasbynens, what are your thoughts on this? Also, yes, when |
The |
parses differently to
The first is becoming 2 chars
\ud83d
and\udc4d
.I might try and detect any unicode in the input string and error out if that's the case, but wondering if this lib can handle both the above the same, or maybe error?
The text was updated successfully, but these errors were encountered: