Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix unhandled cases for exotic idents #6658

Closed
wants to merge 13 commits into from

Conversation

cometkim
Copy link
Member

@cometkim cometkim commented Feb 27, 2024

Changes

Note this is an important feature for Gentype(#6196) and Library Mode(#6210), which are assumed to be used by users on the JS side.

Rationale

In OCaml AST, there are two separate identifier classes Lident(lowercase ident) and Uident(uppercase ident), depending on whether the first letter of the identifier is lowercase or uppercase.

This is useful internally to distinguish whether the identifier is for a module or a variable/type binding.

e.g. https://github.com/rescript-lang/rescript-compiler/blob/17d383f/jscomp/ml/path.ml#L87-L91

let is_uident s =
  assert (s <> "");
  match s.[0] with
  | 'A'..'Z' -> true
  | _ -> false

ReScript supports exotic labels for defining bindings starting with uppercase letters. The ReScript parser removes the surrounding wrapper when parsing exotic labels and forces it to be a Lident node.

let \"Upper" = None

So this produces Lident(name = "Upper") node in OCaml AST.

However, the OCaml type system uses the previously mentioned is_uident for path analysis. As a result, exotic idents that start with an uppercase letter are unintentionally treated as modules internally. These are cases OCaml does not assume, and it ends up raising invariant errors deep in the runtime. (#6539)

OCaml itself has no knowledge of exotic ident, so it cannot determine whether the original declaration was exotic or not. For compatibility reasons, the AST cannot be changed to support it.

In the compiler, knowledge of exotic idents existed only in the printer. Here again, since there is no information passed from the grammar, some heavy logic (e.g. printIdentLike) is needed everywhere to recheck whether a label should be treated as exotic or not.

This change solves the problem by introducing several assumptions.

  • When parsing exotic idents, the parser passes the surrounding chars as-is. So OCaml will treat it as a Lident too (because '\' <> A .. Z), and we can easily recheck whether an ident name is treated as exotic or not by checking whether the first letter is the backslash.

  • Extoic uident is explicitly not supported (e.g. module \"lowercase") . There is no way to distinguish between uppercase lident and lowercase uident without breaking change on AST.

  • It's not the parser's job to provide "pretty labels", so we'll have to do that elsewhere where we need it. For example, the printer normalizes names using the ext_ident.unwrap_exotic utility where necessary.

@cometkim cometkim force-pushed the exotic-ident branch 5 times, most recently from 406746c to c582cf4 Compare March 26, 2024 18:19
@cometkim

This comment was marked as resolved.

@cometkim cometkim mentioned this pull request Apr 2, 2024
6 tasks
@cometkim cometkim changed the title properly handle exotic ident Fix unhandled cases for exotic idents Apr 2, 2024
@cometkim cometkim marked this pull request as ready for review April 2, 2024 13:11
@cometkim
Copy link
Member Author

cometkim commented Apr 2, 2024

@zth @cristianoc can you review this? these changes have a quite broad impact on other modules and may need to be understood by other compiler devs.

let startPos = position scanner in
let startOff = scanner.offset in
let closed = ref false in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really necessary to implement with closed? It seems like I only need to wrap \" and " at the end of my existing implementation if it's not an infix operator.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes a path to skip when a token seq like \\\"\n is entered.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't \\\"\n catched by this existing line?

| '\n' | '\r' ->
      (* line break *)
      let endPos = position scanner in
      scanner.err ~startPos ~endPos
        (Diagnostics.message "A quoted identifier can't contain line breaks.");
      next scanner

Copy link
Member Author

@cometkim cometkim Apr 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then it will possibly be unsafe to substring

It could obviously be replaced with unwrap_ident, but it has the same kind of assertion exists in it. I did it this way because it is simpler. There is no reason for the scanner to be more inefficient.

@zth
Copy link
Collaborator

zth commented Apr 29, 2024

I don't have bandwidth to review this right now, sorry for keeping you waiting. @cristianoc you got time to review this?

@cometkim cometkim force-pushed the exotic-ident branch 2 times, most recently from 7f35700 to 5d38f55 Compare May 23, 2024 21:35
@zth
Copy link
Collaborator

zth commented May 25, 2024

cc @cristianoc we need your eyes on this.

@cometkim cometkim force-pushed the exotic-ident branch 3 times, most recently from cbc01bd to 81f544b Compare May 25, 2024 10:30
@cometkim
Copy link
Member Author

No worry, I just requested his review in person

cometkim added a commit to cometkim/rescript that referenced this pull request May 25, 2024
Duplicates rescript-lang#6658, but with much smaller changes

As follows @cristianoc review, I tried to reduce possible affected surface to only exotic uident.
cometkim added a commit to cometkim/rescript that referenced this pull request May 25, 2024
Duplicates rescript-lang#6658, but with much smaller changes

Reflecting @cristianoc ' review, I tried to reduce possible affected surface to only exotic uident.
cometkim added a commit to cometkim/rescript that referenced this pull request May 25, 2024
Duplicates rescript-lang#6658, but with much smaller changes

Reflecting on @cristianoc ' review, I tried to reduce possible affected surface to only exotic uident.
@cometkim
Copy link
Member Author

Close this in favor of #6777

But avoiding to use printerIdentLike is still can be an additional optimization (which improve parser's speed about 5~10%)
Maybe I'll introduce it in another PR

@cometkim cometkim closed this May 25, 2024
cristianoc pushed a commit that referenced this pull request May 25, 2024
Duplicates #6658, but with much smaller changes

Reflecting on @cristianoc ' review, I tried to reduce possible affected surface to only exotic uident.
@cometkim cometkim deleted the exotic-ident branch May 25, 2024 15:01
@cometkim cometkim mentioned this pull request Jun 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unexpected compiler error occurs on PascalCase type names
3 participants