proposal: cmd/go: allow unicode in module paths and file names #67562

CosmicToast · 2024-05-21T21:46:44Z

Proposal Details

mod/module's spotty Unicode support is noted in module.go and has caused issues in the wild.
The challenges involved are as follows:

There is a possibility of abuse, via look-alike characters. (This was explicitly called out as a concern by @rsc, so it is explicitly addressed by encoding all Unicode characters).
Current code presumes exactly 2 case-folded states (upper and lower cases), while Unicode can have sets of runes of >2 that all case fold to one another.
To have complete Unicode support, Unicode marks would need to be supported, which introduces the additional problem of equivalent representations.
This proposal seeks to address all of these, in a backwards and forwards compatible manner.

The proposal:

unicode implements the Normalization Form C (NFC) algorithm.^1
From here on, the "lowest rune" in a set of case folds will refer to a rune such that: it is the lowest (numerically) rune in the set, unless it is in the ASCII range, in which case it is the highest (numerically) rune in the ASCII range. The exception exists for the sake of compatibility with the current lowercase default.
The escaping process now functions as following: ^3
1. The entire string is NFC-normalized. This creates a canonical string that is then operated on.
2. Every rune that is part of a case folding set is transformed as follows:
  1. If the rune is already the lowest rune in its case folding set, it is unchanged.
  2. If there are only 2 runes in said set, the escaped form is ! followed by the lowest rune.
  3. If there are more than 2 runes in said set, the escaped form is !, followed by a delimiter character (the choice of character is not relevant, though it must not be part of any case folding set, nor be problematic punctuation), followed by the hexadecimal representation of the order of the rune in the set, numerically, followed by a delimiter character, followed by the lowest rune in the set. For an example, see ^2.
3. Every remaining non-ASCII rune is encoded as a !, followed by a delimiter character (the choice of character is not relevant, though it must not be problematic punctuation or the lowest rune in any case folding set), followed by a bytewise hexadecimal representation of the rune, followed by a delimiter character.
modPathOK now allows Unicode codepoints > RuneSelf.
fileNameOK now allows Unicode codepoints > RuneSelf. ^4

This proposal makes an effort to be "forward-compatible" in the sense that any future improvements may be layered on top. The existing delimiters will retain their meanings, but may be marked as deprecated. New delimiters can then be introduced with new meanings.

Weak points of the proposal (will be updated if any additional concerns are raised):

Unescaping without any information loss depends on any newly introduced Unicode characters that join an existing case-fold-equivalence set being the highest value (r > any other r in the set). This can be bypassed by dropping this part of the proposal, as case-equivalence ceases to be a concern under this escaping mode.
It's a relatively "verbose" encoding (despite efforts to minimize this using compacted canonical representations). A simpler proposal is certainly possible, depending on whether the potential disadvantages are deemed acceptable.

^1: NFC is appropriate here, as a canonical equivalent form is desired. C is used rather than D to minimize payload size in subsequent parts. NFD may be used as well. NFKC and NFKD would not be appropriate.
^2: For example, let's pretend a, b, c, and d all case-fold into each other. The representation for c may be, for example, !.2.a.
^3: A complete example; the two following ,-separated strings are the original and escaped example forms respectively:
a, U+212A ('K' for Kelvin), U+00E9 (é), U+0045 U+0301 (É), A
a, !|2|!_004B_ (K (U+004B) + 2 = K for Kelvin), !_E9_, !!_E9_, !a
^4: This should be fine as (to my knowledge), all problematic punctuation is below RuneSelf. If this is wrong, this will be adjusted. Notably, if the rest of the proposal is not of interest, at least this part should be addressed. I've seen § used in the wild in file names in test cases, and it's a low-hanging fruit to get those usable via go modules.

P.S. I'm drafting this late at night, and intend to update this based on feedback. In case ^4 is relevant, I can create a separate proposal.

The text was updated successfully, but these errors were encountered:

seankhliao · 2024-05-21T22:03:21Z

Duplicate of #45549

rsc · 2024-05-23T16:41:32Z

Reopening because it is a proposal. #45549 would need a proposal if we wanted to move forward with it, so might as well be this one.

CosmicToast added the Proposal label May 21, 2024

gopherbot added this to the Proposal milestone May 21, 2024

seankhliao marked this as a duplicate of #45549 May 21, 2024

seankhliao closed this as not planned Won't fix, can't repro, duplicate, stale May 21, 2024

rsc reopened this May 23, 2024

seankhliao changed the title ~~proposal: mod/module: allow unicode in module paths and file names~~ proposal: cmd/go: allow unicode in module paths and file names May 23, 2024

seankhliao added the modules label May 23, 2024

ianlancetaylor added this to Proposals May 23, 2024

ianlancetaylor moved this to Incoming in Proposals May 23, 2024

matloob mentioned this issue Jun 11, 2024

cmd/go: revisit allowed set of characters in module, import, and file paths #45549

Closed

seankhliao mentioned this issue Dec 15, 2024

proposal: embed: * pattern should support filenames with single quote #70852

Closed

seankhliao added the GoCommand cmd/go label Feb 22, 2025

dmitshur mentioned this issue Apr 1, 2025

cmd/go, testing: consider shrinking set of characters allowed/passed through in subtest names #73116

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

proposal: cmd/go: allow unicode in module paths and file names #67562

proposal: cmd/go: allow unicode in module paths and file names #67562

CosmicToast commented May 21, 2024

seankhliao commented May 21, 2024

Uh oh!

rsc commented May 23, 2024

Uh oh!

proposal: cmd/go: allow unicode in module paths and file names #67562

proposal: cmd/go: allow unicode in module paths and file names #67562

Comments

CosmicToast commented May 21, 2024

Proposal Details

seankhliao commented May 21, 2024

Uh oh!

rsc commented May 23, 2024

Uh oh!