Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: cmd/go: allow unicode in module paths and file names #67562

Open
CosmicToast opened this issue May 21, 2024 · 2 comments
Open

proposal: cmd/go: allow unicode in module paths and file names #67562

CosmicToast opened this issue May 21, 2024 · 2 comments

Comments

@CosmicToast
Copy link

Proposal Details

mod/module's spotty Unicode support is noted in module.go and has caused issues in the wild.
The challenges involved are as follows:

  1. There is a possibility of abuse, via look-alike characters. (This was explicitly called out as a concern by @rsc, so it is explicitly addressed by encoding all Unicode characters).
  2. Current code presumes exactly 2 case-folded states (upper and lower cases), while Unicode can have sets of runes of >2 that all case fold to one another.
  3. To have complete Unicode support, Unicode marks would need to be supported, which introduces the additional problem of equivalent representations.
    This proposal seeks to address all of these, in a backwards and forwards compatible manner.

The proposal:

  1. unicode implements the Normalization Form C (NFC) algorithm.^1
  2. From here on, the "lowest rune" in a set of case folds will refer to a rune such that: it is the lowest (numerically) rune in the set, unless it is in the ASCII range, in which case it is the highest (numerically) rune in the ASCII range. The exception exists for the sake of compatibility with the current lowercase default.
  3. The escaping process now functions as following: ^3
    1. The entire string is NFC-normalized. This creates a canonical string that is then operated on.
    2. Every rune that is part of a case folding set is transformed as follows:
      1. If the rune is already the lowest rune in its case folding set, it is unchanged.
      2. If there are only 2 runes in said set, the escaped form is ! followed by the lowest rune.
      3. If there are more than 2 runes in said set, the escaped form is !, followed by a delimiter character (the choice of character is not relevant, though it must not be part of any case folding set, nor be problematic punctuation), followed by the hexadecimal representation of the order of the rune in the set, numerically, followed by a delimiter character, followed by the lowest rune in the set. For an example, see ^2.
    3. Every remaining non-ASCII rune is encoded as a !, followed by a delimiter character (the choice of character is not relevant, though it must not be problematic punctuation or the lowest rune in any case folding set), followed by a bytewise hexadecimal representation of the rune, followed by a delimiter character.
  4. modPathOK now allows Unicode codepoints > RuneSelf.
  5. fileNameOK now allows Unicode codepoints > RuneSelf. ^4

This proposal makes an effort to be "forward-compatible" in the sense that any future improvements may be layered on top. The existing delimiters will retain their meanings, but may be marked as deprecated. New delimiters can then be introduced with new meanings.

Weak points of the proposal (will be updated if any additional concerns are raised):

  1. Unescaping without any information loss depends on any newly introduced Unicode characters that join an existing case-fold-equivalence set being the highest value (r > any other r in the set). This can be bypassed by dropping this part of the proposal, as case-equivalence ceases to be a concern under this escaping mode.
  2. It's a relatively "verbose" encoding (despite efforts to minimize this using compacted canonical representations). A simpler proposal is certainly possible, depending on whether the potential disadvantages are deemed acceptable.

^1: NFC is appropriate here, as a canonical equivalent form is desired. C is used rather than D to minimize payload size in subsequent parts. NFD may be used as well. NFKC and NFKD would not be appropriate.
^2: For example, let's pretend a, b, c, and d all case-fold into each other. The representation for c may be, for example, !.2.a.
^3: A complete example; the two following ,-separated strings are the original and escaped example forms respectively:
a, U+212A ('K' for Kelvin), U+00E9 (é), U+0045 U+0301 (É), A
a, !|2|!_004B_ (K (U+004B) + 2 = K for Kelvin), !_E9_, !!_E9_, !a
^4: This should be fine as (to my knowledge), all problematic punctuation is below RuneSelf. If this is wrong, this will be adjusted. Notably, if the rest of the proposal is not of interest, at least this part should be addressed. I've seen § used in the wild in file names in test cases, and it's a low-hanging fruit to get those usable via go modules.

P.S. I'm drafting this late at night, and intend to update this based on feedback. In case ^4 is relevant, I can create a separate proposal.

@gopherbot gopherbot added this to the Proposal milestone May 21, 2024
@seankhliao
Copy link
Member

Duplicate of #45549

@seankhliao seankhliao marked this as a duplicate of #45549 May 21, 2024
@seankhliao seankhliao closed this as not planned Won't fix, can't repro, duplicate, stale May 21, 2024
@rsc rsc reopened this May 23, 2024
@rsc
Copy link
Contributor

rsc commented May 23, 2024

Reopening because it is a proposal. #45549 would need a proposal if we wanted to move forward with it, so might as well be this one.

@seankhliao seankhliao changed the title proposal: mod/module: allow unicode in module paths and file names proposal: cmd/go: allow unicode in module paths and file names May 23, 2024
@ianlancetaylor ianlancetaylor moved this to Incoming in Proposals May 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Incoming
Development

No branches or pull requests

4 participants