Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify “letters” in spec #682

Open
toastal opened this issue Jan 19, 2024 · 10 comments
Open

Clarify “letters” in spec #682

toastal opened this issue Jan 19, 2024 · 10 comments

Comments

@toastal
Copy link
Contributor

toastal commented Jan 19, 2024

Must contain only letters and digits, optionally separated by hyphens
Must begin with a letter or digit

Are uppercase letters allowed? Is lower é a letter? Is lower α? Is ? Is ? Is 🍉?

acceptedChars = Parsing.Combinators.choice [ Parsing.String.Basic.lower, Parsing.String.Basic.digit ] <|> Parsing.fail charErr

https://pursuit.purescript.org/packages/purescript-parsing/10.2.0/docs/Parsing.String.Basic#v:lower

I’m assuming using lower allows anything from a bicameral script (just like how PureScript identifiers work). Asking because I got really frustrated by OCaml+OPAM’s limitations to ASCII only for packages/identifiers/modules, & this spec isn’t clear on what ’letter’ means.

@thomashoneyman
Copy link
Member

thomashoneyman commented Jan 19, 2024

cc: @f-f on this, as I don't personally have an opinion. As far as I know the intention is to accept lower-cased Unicode characters in UTF-8, minus some special characters, but perhaps it was meant to be ASCII all along.

@thomashoneyman
Copy link
Member

There are some explicit test cases here:

invalid :: Array (Tuple String String)
invalid = do
let startErr = "Package name should start with a lower case char or a digit"
let midErr = "Package name can contain lower case chars, digits and non-consecutive dashes"
let prefixErr = "Package names should not begin with 'purescript-'"
let endErr = "Package name should end with a lower case char or digit"
let manyDashesErr = "Package names cannot contain consecutive dashes"
let expectedEOF = "Expected EOF"
[ "-a" /\ startErr
, "double--dash" /\ manyDashesErr
, "BIGLETTERS" /\ startErr
, "some space" /\ midErr
, "a-" /\ endErr
, "" /\ startErr
, "🍝" /\ startErr
, "abc-🍝-abc" /\ expectedEOF
, "purescript-aff" /\ prefixErr
]

@toastal
Copy link
Contributor Author

toastal commented Jan 19, 2024

I hope there isn’t like a filesystem limitation. It would be awesome if the support was there so you could have résumé-builder or β-reduction.

@toastal
Copy link
Contributor Author

toastal commented Jan 19, 2024

Regardless, I think there is room to spell out the English in the spec a bit more clearly. …And add some clarifying test cases like hammer-of-þórr.

@toastal
Copy link
Contributor Author

toastal commented Jan 19, 2024

Actually I think there would be something important to bring up: Unicode confusion. This is Latin a, this is Cyrillic а. These can be used in certain attack vectors that haven’t yet been looked at.

If I were to suggest a solution I would look at the (now defunct sad) ocaml-m17n.

Can't look-alike characters like a and а be confusing?

They can.

However, m17n issues a warning if more than one script is used in an identifier, hopefully handling most of the confusing cases. You can still use several scripts if you separate them by underscores, e.g. show_色.

Additionally, m17n issues a warning if any two identifiers look alike enough to be visually confusable.

Which in this case would mean each kebab-separated ‘word’ can only be of one script type--which is now adding quite a bit of complexity with the trade-off being safety.

@JordanMartinez
Copy link
Contributor

However, m17n issues a warning if more than one script is used in an identifier, hopefully handling most of the confusing cases. You can still use several scripts if you separate them by underscores, e.g. show_色.
Additionally, m17n issues a warning if any two identifiers look alike enough to be visually confusable.

So, the proposed goal here would be to

  1. allow users to use non-ascii characters in a package name
  2. prevent users from installing a wrong/malicious package due to using lookalike characters

I'd also guess that another goal is to make it possible to type such a package name for all users. For example, a user with a Latin keyboard may not know how to type a CJK character whereas the reverse is often true

There also seems to be a spectrum for how to deal with this issue:

  1. make this impossible via a whitelist: prevent users from using non-ascii characters
  2. make this impossible via a blacklist: ban lookalike characters from being used, such that only one of such character sets is ever permitted (e.g. Latin a is used and all other as like Cyrillic a is banned)
  3. make this unlikely by warning the user: if multiple scripts are detected, notify the user and force the user to accept accept it before continuing
  4. make this impossible by forcing the user to opt-in to some additional non-textual identifier. For example, multiple script usage is allowed but such usages must be prefixed by ascii digits that correspond to a hash of the name. Thus, you could have resume-builder and 13452#résumé-builder. The usage of 13452 would clearly indicate whether or not the package is the intended one or not.

@f-f
Copy link
Member

f-f commented Jan 19, 2024

There are several barrels of worms that would be cracked open by allowing the whole of unicode in these names - no one can type them directly, a series of attack vectors, etc.

The original intent was to only allow ASCII, and I strongly feel we should stick to that since I don't believe the upside of more varied package names is worth the hassle here.

@toastal
Copy link
Contributor Author

toastal commented Jan 20, 2024

@f-f @JordanMartinez I’m happy y’all got back since it does seem this is a can of worms indeed & should be clearer in the spec docs.

Personally, I like the idea of losing American-centrism (the ‘A’ in ASCII)--even if it still limited to bicameral scripts. Deburring résumé-builder to resume-bulder kinda sucks & leaves non-English speakers having to make compromises on package names, which greatly favors English package names since English, the language, likes to pretend all the accents don’t matter since our vowels are an absolute mess with or without accents. Has anyone tested if you can currently push multiple languages? If so, does there need to be a bigger issue raised?

That said, I can understand favoring the simplicity (at least for now). In the case of OPAM, apparently can only upload packages to the registry named in ASCII, however, this shouldn’t be a limitation on OCaml package names that don’t plan to be added to the main public registry (tho in practice dune needs it to be a valid module name, & OCaml doesn’t non-ASCII module names). I would like clarification that I shouldn’t have an issue naming a project pokémon-center or déjà-vu-café or prelüde even if it were not to be uploaded? Also if a user/org ran a private registry could they hypothetically relax those rules (say they are based in Ukraine & prefer Cyrillic package names as their PureScript identifiers are currently supported in Cyrillic)?

Which goes out even further on a tangent: is PureScript Registry project even capable of allowing private registries? Are there mirrors for the public registry for resilience like in the Perl community? (I think about this a lot after seeing various GitHub & SourceHut outages).

@f-f
Copy link
Member

f-f commented Jan 20, 2024

leaves non-English speakers having to make compromises on package names

I'm not a native English speaker - Italian living in Finland, so I get to use all the various äåöèéàòùì on a regular basis - and no, not having characters outside of ASCII available for package names does not really bother me, which makes me think this is not such a universal issue.
FWIW the PureScript compiler does not accept non-ASCII characters in module names, so that'd be the place to look at first.

I would like clarification that I shouldn’t have an issue naming a project pokémon-center or déjà-vu-café or prelüde even if it were not to be uploaded?

Spago uses the same package-name parsing code as the Registry, so you wouldn't be able to build your project with it. Other things might work I guess?

is PureScript Registry project even capable of allowing private registries? Are there mirrors for the public registry for resilience like in the Perl community? (I think about this a lot after seeing various GitHub & SourceHut outages).

Private registries and mirrors are not the same thing - Spago allows for private package sets because everyone needs custom packages at some point, but we purposefully did not choose to allow for private registries to avoid incentivising fragmentation. E.g. I'm not a fan of the OCaml situation, where every big company has their own compiler and package ecosystem.

Mirrors of the official registry instead have been accounted for, and are on the roadmap. I do not recall if the spec contains any mention of this yet, but the original RFC calls this "storage backends"

@toastal
Copy link
Contributor Author

toastal commented Jan 21, 2024

Sounds like we at least have some new test cases + descriptions to add regardless 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants