-
Notifications
You must be signed in to change notification settings - Fork 12.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve diagnostics for non XID characters in identifiers #86102
Comments
We probably should rework the parser to accept these as identifiers, and reject them in a later pass. |
@estebank It's not that simple, bear in mind that this includes other punctuation characters and whitespace and such. A lot of these things are not identifier-like. We may need to figure out a heuristic for identifier-like, or we can just show a better error for all cases. |
@rustbot claim |
That's fair. I left out that the recovery should be sane and only be done for things that could be an identifier, but completely agree.
That is something that applies to all errors 🌲 |
Note that it is possible to use unicode general categories to tell "could be identifier" apart from "probably not" but it's still iffy. |
How does the |
@jeanlucthumm just ask questions here, I'll help |
@Manishearth Would it be wrong to say that anything we don't explicitly call punctuation in the lexer is just identifier-like by default? Because we have a well defined set of what we use as punctuation, e.g. Also, the only other case I can think of where you would use non-XID characters that were also not intended to be part of an identifier, is if you accidentally typoed them in a place were identifiers cannot semantically be. But in the off chance this happens, the error message would still be more than enough to let you know you have a random typo, and I don't think having the compiler label the character as a potential identifier would confuse the user enough to prevent them from fixing it. If we don't consider that wrong, then a satisfactory solution would be to link them to the formal identifier definition. |
I think that's not a good assumption because often special characters get copied in by accident. Intentional use isn't the only thing. Curly quotes being copied in (because Word transforms them) is common. We may even at some point expand the lexer to allow numbers in other writing systems! I'm unsure if we should link to the formal definition here (@estebank, thoughts?) but I think just being clearer that that character isn't allowed in identifiers is enough? A decent heuristic for "identifierlike" might be filtering by |
If we filter by Also, what if a user puts a special punctuation character like the curly quotes in the middle of an identifier. Then the correct error would indeed be "You can't use curly quotes in an identifier", but the filtering by But while the filtering will still lead to wrong error messages, I think I agree with you that it will be fewer than treating everything as identifier-like because it's much less likely that e.g. punctuation characters will appear inside of identifiers and letter characters outside of identifiers. I think the core problem is that you can't use information from the unicode to make a statement like Also on second thought, I agree with not linking the formal definition, in the spirit of keeping things user friendly. I'll get started by going through |
Yeah, precisely. This is kinda nuanced. I think a first pass can just improve the error message to say "only certain kinds of characters are allowed in identifiers", perhaps linking to the RFC. Idk. But yeah if you want to look at figuring out a smarter heuristic that would be great, and it's not a huge deal if we get it wrong since it's a diagnostic and we can iterate on it. |
I ran into a very problematic category
The symbol
To solve the issue of In the meantime, I have a proof of concept in the But I also have a few questions:
I assume this is because I added an external dependency. Is there something special you have to do for rustc everytime you modify the Cargo.toml file? |
@jeanlucthumm You should that to rust\src\tools\tidy\src\deps.rs PERMITTED_DEPENDENCIES |
@jeanlucthumm Yeah so emoji are kinda scattered everywhere, it's hard to tell with the block/etc property. You're looking for I think delaying the diagnostic would be very hard since this stuff has to be rejected at lex time. I think it's totally fine to treat all As far as the tidy check, yes, that's to prevent additional deps from being pulled in. |
A few things: So I'm basically done with the code changes I wanted to make, but there are quite a few broken UI tests as a result. Before I go through and fix them, I just wanted to get a second opinion on the error messages:
The current heuristic is to print the help message about identifiers if both the following are true:
I figured point 2 would be a good addition since we only suggest punctuation characters as substitutions, but given your point that some of these could also be considered emojis, we can change that back if you'd like. I'm running into issues trying to minimize dependencies. For the unicode segmentation I picked
And the latter has no dependencies but tidy complained that it has a bad license: |
Yeah I'd change it back. I would go for unicode-general-category. @wezm are you open to relicensing it? @rust-lang/compiler I believe Apache-only crates are allowed if they're dependencies of the compiler but not the stdlib, yes? So we can opt out with rust/src/tools/tidy/src/deps.rs Lines 24 to 27 in ac8c3bf
|
@jeanlucthumm, can you make the lexer give back a new placeholder identifier? I would love it if we could get rid of the "expected pattern, found You could also tweak the output to be closer to
If we have graceful recovery, we might also want to handle things like fn 🍮() -> i32 { 4}
let my_🦀 = 🍮() ➖ 🍮(); // `: i32 = 0` where we would hopefully only have the lexer errors and nothing else. |
I suspect a more general solution is a one codepoint (or grapheme, if we care about that) "error" token, that would cause the parser to give up half-way during parsing |
Yeah sorry, had to deep dive
Which just skips over unkown tokens as if they were whitespace to be ignored. This is where we would instead return a new error Then in Currently doing these changes. This only solves the issue for patterns though. This exception would have to be hardcoded in all parsing paths, like function identifiers from @estebank's example. Will go through them. |
Hmm, what happens when you add a new |
Hit a particularly busy patch of life at the moment, should be free again in a month. Unassigning for now, and if no one has claimed this by then I will work on it again. |
I checked to see how hard it would be to accept emojis using |
We should also use the confusables table to take whitespace and token like unicode chars (like greep question mark) error on them in the lexer and translate them all to their valid ascii lookalike token. The confusable table we use isn't exhaustive but should be enough to improve the situation with things like smart quotes. We today recover somewhat, but have too many knock down errors afterwards. This and the change above to the lexer would be a net user experience improvement. Now, if someone tries to use a smart quote for example as an identifier... 🤷♀️ |
Opened #88781 after cleaning up the code I linked earlier. It only handles identifiers, namely things that look like emojis are treated as XID_Start and emojis+ZWJ are considered XID_Continue and rejected after lexing. Edit: BTW, the output for the earlier example with this PR:
Edit 2: it now handles confusables:
|
Tokenize emoji as if they were valid identifiers In the lexer, consider emojis to be valid identifiers and reject them later to avoid knock down parse errors. Partially address rust-lang#86102.
Tokenize emoji as if they were valid identifiers In the lexer, consider emojis to be valid identifiers and reject them later to avoid knock down parse errors. Partially address rust-lang#86102.
Tokenize emoji as if they were valid identifiers In the lexer, consider emojis to be valid identifiers and reject them later to avoid knock down parse errors. Partially address rust-lang#86102.
Tokenize emoji as if they were valid identifiers In the lexer, consider emojis to be valid identifiers and reject them later to avoid knock down parse errors. Partially address rust-lang#86102.
Tokenize emoji as if they were valid identifiers In the lexer, consider emojis to be valid identifiers and reject them later to avoid knock down parse errors. Partially address rust-lang#86102.
@Manishearth can you verify what might be the outstanding work to close this ticket? |
@estebank I think we're good! |
errors with
For non-ASCII characters, we should perhaps error with something better. I don't know what the error text should be (cc @estebank), because it's not just emoji, and there's no easy way to define "XID characters" without just linking to the spec, which seems bad. Maybe we can link to the reference?
Tagging as easy since the implementation isn't tricky, but we will probably need to figure out a good error message.
The text was updated successfully, but these errors were encountered: