Unicode Normalization of Identifiers #2379

tamaroning · 2023-07-05T06:05:45Z

Related to #2287

All identifiers must be normalized to the NFC form.
Normalization is used to treat equally strings which consists of different codepoints but practiaclly the same.
It is useful when a program is linked to other Rust libraries,

NFC Normalization takes two steps:

Some kinds of characters are decomposed to one or multiple chatacters. (This is called NFD form)
Then, some kinds of chatacters are composed to a single character. (This is called NFC form)

For example,
https://unicode.org/reports/tr15/images/UAX15-NormFig4.jpg

Implementation

Optimization

quick check
use trie for composition mapping table
- but the current implementation is not so slow

Tests

many testcases are needed!

Ref

https://doc.rust-lang.org/reference/identifiers.html#normalization
https://unicode.org/reports/tr15/
Reference implemetation: https://www.w3.org/International/charlint/

tamaroning mentioned this issue Jul 4, 2023

Unicode support #2287

Open

15 tasks

tamaroning mentioned this issue Jul 14, 2023

Add header file generator for Unicode normalization and alphanumeric check #2425

Merged

tamaroning mentioned this issue Jul 24, 2023

Normalize Hangul #2467

Merged

tamaroning mentioned this issue Aug 6, 2023

Add type annotation to make-rust-unicode-data.py #2529

Merged

tamaroning mentioned this issue Dec 2, 2023

Implement Quick Check #2769

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode Normalization of Identifiers #2379

Unicode Normalization of Identifiers #2379

tamaroning commented Jul 5, 2023 •

edited

Loading

Unicode Normalization of Identifiers #2379

Unicode Normalization of Identifiers #2379

Comments

tamaroning commented Jul 5, 2023 • edited Loading

Implementation

Unicode data file parser

Decomposition

Composition

Optimization

Tests

Ref

tamaroning commented Jul 5, 2023 •

edited

Loading