Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect confusing unicode characters and show the alternative... #29837

Merged
merged 1 commit into from
Nov 17, 2015

Conversation

wafflespeanut
Copy link
Contributor

fixes #25957

@rust-highfive
Copy link
Collaborator

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @nikomatsakis (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. The way Github handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

@wafflespeanut
Copy link
Contributor Author

@Manishearth So, here's some rough artwork for you :)

"rough", because I haven't checked it yet (just wanted to show the progress). I suppose we should also add tests for this?

@wafflespeanut
Copy link
Contributor Author

Ah! limited to 100 chars (I thought it shared Servo's 120 chars limit)...

@@ -0,0 +1,156 @@
const ASCII_ARRAY: &'static [(char, &'static str)] = &[('_', "Low Line"), ('-', "Hyphen-Minus"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs the MPL header

@Manishearth
Copy link
Member

Also, yes, there should be some parse-fail tests for this.

@wafflespeanut wafflespeanut force-pushed the unicode_chars branch 3 times, most recently from d5a4945 to 25a86fa Compare November 15, 2015 07:31
('}', "Right Curly Brace"),
('*', "Asterisk"),
('/', "Slash"),
('\\', "Back Slash"),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

single word

@wafflespeanut
Copy link
Contributor Author

@Manishearth I've made the changes and now I've gone for a variation of what you'd suggested (I've used StringReader in our checking method, which emits stuff based on the situation). Also, I need some clarification regarding the test. In the test, I'm checking for an error which is something rustc was doing even before the change. Since I've only added the help comment along with it, is the test really necessary? or, should we test this in a different way?


fn main() {
let y = 0;
//~^ ERROR unknown start of token: \u{37e}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put the help message here too

compile-fail tests don't require all helps and notes to be listed, but if you do list a help or note, and the program fails to emit it, the test will fail.

@wafflespeanut
Copy link
Contributor Author

@Manishearth r?

.map(|idx| {
let (_, u_name, ascii_char) = UNICODE_ARRAY[idx];
let span = make_span(reader.last_pos, reader.pos);
match ASCII_ARRAY.iter().position(|&(c, _)| c == ascii_char) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use .find, not .position.

@Manishearth
Copy link
Member

LGTM, small nits involving style.

For future reference, you should be rarely indexing arrays and things in Rust. Most of the time you should use iterators (.position+indexing doesn't count 😄 ), and iterators are safer in that they can't cause additional panics due to out of bounds indexing.

@nikomatsakis
Copy link
Contributor

This patch looks pretty decent. I second @Manishearth's suggestions. Also, note that there is a tidy error because some of the lines in the parse-fail test are more than 100 characters. You can add a comment like // ignore-tidy-linelength on that file.

@wafflespeanut
Copy link
Contributor Author

@nikomatsakis @Manishearth Agreed, thanks! (and done). r?

@@ -174,6 +174,9 @@ impl SpanHandler {
self.handler.emit(Some((&self.cm, sp)), msg, Bug);
panic!(ExplicitBug);
}
pub fn span_bug_no_panic(&self, sp: Span, msg: &str) {
self.handler.emit(Some((&self.cm, sp)), msg, Bug);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot to mention, can we add self.handler.bump_err_count(); here too?

@Manishearth
Copy link
Member

@bors r+

thanks!

@bors
Copy link
Contributor

bors commented Nov 17, 2015

📌 Commit 7f63c7c has been approved by Manishearth

@wafflespeanut
Copy link
Contributor Author

@Manishearth Thank you! :)

@Havvy
Copy link
Contributor

Havvy commented Nov 17, 2015

❤️ 💓 ❤️

bors added a commit that referenced this pull request Nov 17, 2015
@bors
Copy link
Contributor

bors commented Nov 17, 2015

⌛ Testing commit 7f63c7c with merge 1b26148...

@bors bors merged commit 7f63c7c into rust-lang:master Nov 17, 2015
@Manishearth
Copy link
Member

😀 Congrats on your first PR!

@wafflespeanut wafflespeanut deleted the unicode_chars branch November 17, 2015 09:52
@brson brson added the relnotes Marks issues that should be documented in the release notes of the next release. label Nov 17, 2015
@brson
Copy link
Contributor

brson commented Nov 17, 2015

Nice polish.

@huonw
Copy link
Member

huonw commented Nov 17, 2015

Is there a reason this doesn't include U+201C LEFT DOUBLE QUOTATION MARK and U+201D RIGHT DOUBLE QUOTATION MARK as possible subsitutions for "? (If not, I can submit a patch to add them.)

@Manishearth
Copy link
Member

No reason. It contains the single quotes. I did Ctrl-F for those, but didn't bother to check the double quotes.

Go ahead!

@wafflespeanut
Copy link
Contributor Author

It's again worth mentioning that this still doesn't have all the substitutions - only the printable ones from http://www.unicode.org/Public/security/revision-06/confusables.txt. So, feel free to add more :)

@huonw
Copy link
Member

huonw commented Nov 18, 2015

Oh, I think I see why QUOTATION MARK was missed: things (including that) are considered confusable with APOSTROPHE, APOSTROPHE rather than ".

@Manishearth
Copy link
Member

The universe is starting to hit and appreciate this feature https://twitter.com/joeranweiler/status/678691374292590593 :D

bors added a commit that referenced this pull request May 3, 2016
Add more aliases for Unicode confusable chars

Building upon #29837, this PR:

* added aliases for space characters,
* distinguished square brackets from parens, and
* added common CJK punctuation characters as aliases.

This will especially help CJK users who may have forgotten to switch off IME when coding.
bors added a commit that referenced this pull request May 5, 2016
Add more aliases for Unicode confusable chars

Building upon #29837, this PR:

* added aliases for space characters,
* distinguished square brackets from parens, and
* added common CJK punctuation characters as aliases.

This will especially help CJK users who may have forgotten to switch off IME when coding.
nnethercote added a commit to nnethercote/rust that referenced this pull request Dec 14, 2023
It's unclear why this is used here. All entries in the third column of
`UNICODE_ARRAY` are covered by `ASCII_ARRAY`, so if the lookup fails
it's a genuine compiler bug. It was added way back in rust-lang#29837, for no
clear reason.

This commit changes it to `span_bug`, which is more typical.
nnethercote added a commit to nnethercote/rust that referenced this pull request Dec 14, 2023
It's unclear why this is used here. All entries in the third column of
`UNICODE_ARRAY` are covered by `ASCII_ARRAY`, so if the lookup fails
it's a genuine compiler bug. It was added way back in rust-lang#29837, for no
clear reason.

This commit changes it to `span_bug`, which is more typical.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
relnotes Marks issues that should be documented in the release notes of the next release.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Better Error Message When Parsing Greek Question Mark (and similar confusing characters)
8 participants