Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode Normalization of Identifiers #2379

Open
8 of 10 tasks
tamaroning opened this issue Jul 5, 2023 · 0 comments
Open
8 of 10 tasks

Unicode Normalization of Identifiers #2379

tamaroning opened this issue Jul 5, 2023 · 0 comments

Comments

@tamaroning
Copy link
Contributor

tamaroning commented Jul 5, 2023

Related to #2287

All identifiers must be normalized to the NFC form.
Normalization is used to treat equally strings which consists of different codepoints but practiaclly the same.
It is useful when a program is linked to other Rust libraries,

NFC Normalization takes two steps:

  1. Some kinds of characters are decomposed to one or multiple chatacters. (This is called NFD form)
  2. Then, some kinds of chatacters are composed to a single character. (This is called NFC form)

For example,
https://unicode.org/reports/tr15/images/UAX15-NormFig4.jpg

Implementation

Unicode data file parser

Decomposition

Composition

Optimization

  • quick check
  • use trie for composition mapping table
    • but the current implementation is not so slow

Tests

many testcases are needed!

Ref

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant