Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experimental support for unicode identifiers. #1407

Draft
wants to merge 20 commits into
base: master
Choose a base branch
from

Conversation

WardBrian
Copy link
Member

I know for a fact that this requires a few changes in stan-dev/stan's json data handler to recognize unicode names, which is just one of several reasons this is a draft.

The basic overview:
OCaml strings should be treated mostly like arrays of bytes, and ocamllex handles inputs as sets of bytes. We can define rules that recognize UTF-8-compatible bytes, and then do validation on them after the fact based on the the Unicode Annex 31: Unicode Identifiers standard.

We then pretend for most of the compiler like it's just bytes, which is fine, because we never do things like subslice variable names.

Finally, at output time, we already had string escaping (since #952), so most of the code-gen works fine. Recent C++ standards require that compilers support UTF-8 names based on the same UAX31 rules linked above, but older ones may not. For now I've got it generating "Universal character names" which seem like the legacy version of this, which hopefully means older compilers will be happy with it.

Submission Checklist

  • Run unit tests
  • Documentation
    • If a user-facing facing change was made, the documentation PR is here: TDB

Release notes

stanc3 can now accept a flag --allow-unicode which enables the use of non-ascii characters in Stan files. All files are expected to be encoded in UTF-8.
This is experimental and may not work with older C++ compilers.

Copyright and Licensing

By submitting this pull request, the copyright holder is agreeing to
license the submitted work under the BSD 3-clause license (https://opensource.org/licenses/BSD-3-Clause)

@WardBrian WardBrian linked an issue Feb 15, 2024 that may be closed by this pull request
Copy link

codecov bot commented Feb 16, 2024

Codecov Report

Attention: Patch coverage is 73.75000% with 21 lines in your changes are missing coverage. Please review.

Project coverage is 89.76%. Comparing base (8bc6ba0) to head (49381b7).
Report is 3 commits behind head on master.

❗ Current head 49381b7 differs from pull request most recent head 450cad4. Consider uploading reports for the commit 450cad4 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1407      +/-   ##
==========================================
- Coverage   89.87%   89.76%   -0.12%     
==========================================
  Files          63       65       +2     
  Lines       10525    10597      +72     
==========================================
+ Hits         9459     9512      +53     
- Misses       1066     1085      +19     
Files Coverage Δ
src/frontend/Errors.ml 88.00% <100.00%> (ø)
src/stan_math_backend/Cpp.ml 88.91% <100.00%> (+0.11%) ⬆️
src/stan_math_backend/Cpp_Json.ml 100.00% <100.00%> (ø)
src/stanc/stanc.ml 82.14% <ø> (ø)
src/frontend/Identifiers.ml 91.66% <91.66%> (ø)
src/common/Unicode.ml 55.81% <55.81%> (ø)

... and 1 file with indirect coverage changes

@bob-carpenter
Copy link
Contributor

Isn't there sub slicing when we peel off _lpdf suffixes in sampling statements?

@WardBrian
Copy link
Member Author

Isn't there sub slicing when we peel off _lpdf suffixes in sampling statements?

  1. Not quite, since we generally go from something without a suffix (Y ~ foo(...)) to the thing with the suffix (target += foo_lpdf(Y | ...)), so even there it is a concatenation problem more than a subslicing.
  2. Even in situations where we do need to do it, I think that should be fine, since the cut point is happening somewhere we require ASCII characters to appear. So even a θ_lpdf, the cut point in the string of bytes would still just be 5 bytes from the end

But, that is definitely one area of this that would need much more testing before it could be merged.

@WardBrian WardBrian mentioned this pull request Feb 27, 2024
3 tasks
@WardBrian WardBrian force-pushed the feature/unicode-identifiers branch 2 times, most recently from e2bf53a to 49381b7 Compare March 7, 2024 19:38
@WardBrian WardBrian force-pushed the feature/unicode-identifiers branch from 49381b7 to 2865cca Compare March 19, 2024 16:08
@WardBrian
Copy link
Member Author

There is some prior art now, in the OCaml compiler itself:
ocaml/ocaml#12664

They define an explicit set of characters they allow, which sidesteps a lot of the issues here (no need for things like UUSeg, etc): https://github.com/ocaml/ocaml/blob/6c298db0e356d0e04dd45acf6684f693f8baa7db/utils/misc.ml#L265-L272

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feature request: unicode in source
2 participants