Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode in input string is not handled #225

Open
tjenkinson opened this issue Apr 13, 2021 · 5 comments
Open

Unicode in input string is not handled #225

tjenkinson opened this issue Apr 13, 2021 · 5 comments

Comments

@tjenkinson
Copy link
Contributor

/👍/u

parses differently to

/\u{1f44d}/u

The first is becoming 2 chars \ud83d and \udc4d.

I might try and detect any unicode in the input string and error out if that's the case, but wondering if this lib can handle both the above the same, or maybe error?

@tjenkinson
Copy link
Contributor Author

tjenkinson commented Apr 13, 2021

Trying to get an understanding of how the string encoding works but I think a

/[\uD800-\uDBFF\uDC00-\uDFFF]/

check should catch high/low surrogate pair code points, which are what we'd want to throw an error for if the u flag is present and it's not possible to switch them to the \u{x} representations.

These ranges are from https://mathiasbynens.be/notes/javascript-encoding#surrogate-pairs

@tjenkinson
Copy link
Contributor Author

tjenkinson commented Apr 13, 2021

Wondering if there is a spec somewhere that says how to handle unicode characters actually.

I.e. that says /👍/u should be translated to /\u{1f44d}/u. It does seem to be what Chrome does.

e.g.

/^𝌆+$/.test('𝌆') === true
/^𝌆+$/.test('𝌆𝌆') === false

I think because 𝌆 was converted to \ud834\udf06+

/^𝌆+$/.test('𝌆\udf06') === true
/^𝌆+$/u.test('𝌆𝌆') === true

Interestingly

/^\ud834\udf06+$/u.test('𝌆𝌆') === true

So looks like \ud834\udf06 gets converted to 𝌆 before the + is taken into consideration. regexp-tree actually handles this properly.

@tjenkinson
Copy link
Contributor Author

tjenkinson commented Apr 13, 2021

I think a fix for this could be changing the generated code to be

[/^[^*?+\[()\\|]/u, function() { return 'CHAR' }],

instead of

[/^[^*?+\[()\\|]/, function() { return 'CHAR' }],

when the u flag is enabled on the input. Not sure how to do that yet

This causes _match to return a surrogate pair as is instead of slicing it in two, and the cursor seems to be updated correctly because this._cursor += match.length handles it being 2.

With this change

/👍/u

becomes

{
  type: 'RegExp',
  body: {
    type: 'Char',
    value: '👍',
    kind: 'simple',
    symbol: '👍',
    codePoint: 128077
  },
  flags: 'u'
}

which is still different to /\u{1f44d}/u:

{
  type: 'RegExp',
  body: {
    type: 'Char',
    value: '\\u{1f44d}',
    kind: 'unicode',
    symbol: '👍',
    codePoint: 128077
  },
  flags: 'u'
}

but I think value and kind being different is ok because one of them is actually escaped, so they are a bit different, but importantly the codePoint is now the same and it remains a single Char.

Also a bit unrelated I think we should use charCodeAt instead of codePointAt when the u flag is not enabled, so that users don't need to polyfill codePointAt even if they're not using the u flag.

WDYT?

@DmitrySoshnikov
Copy link
Owner

@tjenkinson thanks for the report and investigation, I think the change looks reasonable. @mathiasbynens, what are your thoughts on this?

Also, yes, when u is not enabled, the charCodeAt might be a good alternative.

@mathiasbynens
Copy link
Contributor

The u flag indeed acts as an opt-in to using code points as the character boundary, instead of UCS-2/UTF-16 code units (without the u flag). I wrote about that here along with some examples: https://mathiasbynens.be/notes/es6-unicode-regex

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants