Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: allow identifiers to contain utf-8 characters #216

Merged
merged 3 commits into from
Jan 29, 2024
Merged

Conversation

calebdw
Copy link
Collaborator

@calebdw calebdw commented Jan 28, 2024

Checklist

  • All tests pass in CI
  • There are enough tests for the new fix/feature
  • Grammar rules have not been renamed unless absolutely necessary (0 rules renamed)
  • The conflicts section hasn't grown too much (0 new conflicts)
  • The parser size hasn't grown too much (master: 2727, PR: 2727)
    (check the value of STATE_COUNT in src/parser.c)

See #142 (comment) for more context: essentially the true definition of a legal identifier is consistent with what is currently listed on the docs: ^[a-zA-Z_\x80-\xff][a-zA-Z0-9_\x80-\xff]*$

However, from my (limited) understanding there is a bug of sorts when PHP is parsing multibyte characters and instead compares the individual bytes which in practice allows characters outside the given regex to be parsed as identifiers.

This PR updates the grammar and scanner to allow the full utf-8 range as valid identifiers (\u0080-\uffff). PHP may even parse characters higher than this range (I haven't tested it), but given that utf-8 is already technically illegal I don't see the need to try and support anything higher. Although it would be real easy to update the scanner to use the full uint32 returned by the lexer.

Closes #142, closes #171

common/scanner.h Outdated Show resolved Hide resolved
common/define-grammar.js Outdated Show resolved Hide resolved
@calebdw calebdw force-pushed the identifiers branch 2 times, most recently from 2342d83 to 10a54be Compare January 29, 2024 03:47
Copy link
Member

@amaanq amaanq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@amaanq amaanq merged commit 3854d1c into master Jan 29, 2024
4 checks passed
@calebdw calebdw deleted the identifiers branch January 29, 2024 13:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

variable name in chinese recognized as error Support any chars PHP accept for names and labels
2 participants