feat: allow identifiers to contain utf-8 characters #216

calebdw · 2024-01-28T05:41:04Z

Checklist

All tests pass in CI
There are enough tests for the new fix/feature
Grammar rules have not been renamed unless absolutely necessary (0 rules renamed)
The conflicts section hasn't grown too much (0 new conflicts)
The parser size hasn't grown too much (master: 2727, PR: 2727)
(check the value of STATE_COUNT in src/parser.c)

See #142 (comment) for more context: essentially the true definition of a legal identifier is consistent with what is currently listed on the docs: ^[a-zA-Z_\x80-\xff][a-zA-Z0-9_\x80-\xff]*$

However, from my (limited) understanding there is a bug of sorts when PHP is parsing multibyte characters and instead compares the individual bytes which in practice allows characters outside the given regex to be parsed as identifiers.

This PR updates the grammar and scanner to allow the full utf-8 range as valid identifiers (\u0080-\uffff). PHP may even parse characters higher than this range (I haven't tested it), but given that utf-8 is already technically illegal I don't see the need to try and support anything higher. Although it would be real easy to update the scanner to use the full uint32 returned by the lexer.

Closes #142, closes #171

common/scanner.h

common/define-grammar.js

amaanq

LGTM

calebdw requested review from amaanq and cfroystad January 28, 2024 05:41

calebdw force-pushed the identifiers branch from c7b3ea5 to 192e775 Compare January 28, 2024 05:42

amaanq reviewed Jan 29, 2024

View reviewed changes

common/scanner.h Outdated Show resolved Hide resolved

common/define-grammar.js Outdated Show resolved Hide resolved

calebdw force-pushed the identifiers branch 2 times, most recently from 2342d83 to 10a54be Compare January 29, 2024 03:47

feat: update grammar to use full utf-8 range

9726bce

calebdw force-pushed the identifiers branch from 10a54be to 3478b6f Compare January 29, 2024 04:17

calebdw added 2 commits January 28, 2024 22:31

chore(scanner): update String to store wchar

84e44a1

chore: generate

10dfcae

calebdw force-pushed the identifiers branch from 3478b6f to 10dfcae Compare January 29, 2024 04:32

amaanq approved these changes Jan 29, 2024

View reviewed changes

amaanq merged commit 3854d1c into master Jan 29, 2024
4 checks passed

calebdw deleted the identifiers branch January 29, 2024 13:01

rabbiveesh mentioned this pull request Feb 15, 2024

Non-ASCII identifiers are not recognised tree-sitter-perl/tree-sitter-perl#161

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: allow identifiers to contain utf-8 characters #216

feat: allow identifiers to contain utf-8 characters #216

calebdw commented Jan 28, 2024 •

edited

Loading

amaanq left a comment

feat: allow identifiers to contain utf-8 characters #216

feat: allow identifiers to contain utf-8 characters #216

Conversation

calebdw commented Jan 28, 2024 • edited Loading

Checklist

amaanq left a comment

Choose a reason for hiding this comment

calebdw commented Jan 28, 2024 •

edited

Loading