Detect confusing unicode characters and show the alternative... #29837

wafflespeanut · 2015-11-14T21:30:45Z

rust-highfive · 2015-11-14T21:31:01Z

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @nikomatsakis (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. The way Github handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

wafflespeanut · 2015-11-14T21:36:34Z

@Manishearth So, here's some rough artwork for you :)

"rough", because I haven't checked it yet (just wanted to show the progress). I suppose we should also add tests for this?

wafflespeanut · 2015-11-14T21:37:52Z

Ah! limited to 100 chars (I thought it shared Servo's 120 chars limit)...

Manishearth · 2015-11-14T21:38:56Z

src/libsyntax/parse/lexer/unicode_chars.rs

@@ -0,0 +1,156 @@
+const ASCII_ARRAY: &'static [(char, &'static str)] = &[('_', "Low Line"), ('-', "Hyphen-Minus"),


This needs the MPL header

Manishearth · 2015-11-14T23:20:26Z

Also, yes, there should be some parse-fail tests for this.

Manishearth · 2015-11-15T07:32:30Z

src/libsyntax/parse/lexer/unicode_chars.rs

+    ('}', "Right Curly Brace"),
+    ('*', "Asterisk"),
+    ('/', "Slash"),
+    ('\\', "Back Slash"),


single word

wafflespeanut · 2015-11-15T13:05:05Z

@Manishearth I've made the changes and now I've gone for a variation of what you'd suggested (I've used StringReader in our checking method, which emits stuff based on the situation). Also, I need some clarification regarding the test. In the test, I'm checking for an error which is something rustc was doing even before the change. Since I've only added the help comment along with it, is the test really necessary? or, should we test this in a different way?

Manishearth · 2015-11-15T19:54:19Z

src/test/parse-fail/unicode-chars.rs

+
+fn main() {
+    let y = 0;
+    //~^ ERROR unknown start of token: \u{37e}


put the help message here too

compile-fail tests don't require all helps and notes to be listed, but if you do list a help or note, and the program fails to emit it, the test will fail.

wafflespeanut · 2015-11-16T05:11:46Z

@Manishearth r?

Manishearth · 2015-11-16T08:06:17Z

src/libsyntax/parse/lexer/unicode_chars.rs

+    .map(|idx| {
+        let (_, u_name, ascii_char) = UNICODE_ARRAY[idx];
+        let span = make_span(reader.last_pos, reader.pos);
+        match ASCII_ARRAY.iter().position(|&(c, _)| c == ascii_char) {


use .find, not .position.

Manishearth · 2015-11-16T08:10:52Z

LGTM, small nits involving style.

For future reference, you should be rarely indexing arrays and things in Rust. Most of the time you should use iterators (.position+indexing doesn't count 😄 ), and iterators are safer in that they can't cause additional panics due to out of bounds indexing.

nikomatsakis · 2015-11-16T22:46:45Z

This patch looks pretty decent. I second @Manishearth's suggestions. Also, note that there is a tidy error because some of the lines in the parse-fail test are more than 100 characters. You can add a comment like // ignore-tidy-linelength on that file.

wafflespeanut · 2015-11-17T06:31:05Z

@nikomatsakis @Manishearth Agreed, thanks! (and done). r?

Manishearth · 2015-11-17T06:34:23Z

src/libsyntax/diagnostic.rs

@@ -174,6 +174,9 @@ impl SpanHandler {
        self.handler.emit(Some((&self.cm, sp)), msg, Bug);
        panic!(ExplicitBug);
    }
+    pub fn span_bug_no_panic(&self, sp: Span, msg: &str) {
+        self.handler.emit(Some((&self.cm, sp)), msg, Bug);
+    }


I forgot to mention, can we add self.handler.bump_err_count(); here too?

Manishearth · 2015-11-17T07:16:51Z

@bors r+

thanks!

bors · 2015-11-17T07:16:52Z

📌 Commit 7f63c7c has been approved by Manishearth

wafflespeanut · 2015-11-17T07:18:38Z

@Manishearth Thank you! :)

Havvy · 2015-11-17T07:21:27Z

❤️ 💓 ❤️

fixes #25957

bors · 2015-11-17T07:42:04Z

⌛ Testing commit 7f63c7c with merge 1b26148...

bors · 2015-11-17T09:31:09Z

☀️ Test successful - auto-linux-32-nopt-t, auto-linux-32-opt, auto-linux-64-debug-opt, auto-linux-64-nopt-t, auto-linux-64-opt, auto-linux-64-x-android-t, auto-linux-cross-opt, auto-linux-musl-64-opt, auto-mac-32-opt, auto-mac-64-nopt-t, auto-mac-64-opt, auto-win-gnu-32-nopt-t, auto-win-gnu-32-opt, auto-win-gnu-64-nopt-t, auto-win-gnu-64-opt, auto-win-msvc-32-opt, auto-win-msvc-64-opt

Manishearth · 2015-11-17T09:35:21Z

😀 Congrats on your first PR!

brson · 2015-11-17T20:34:42Z

Nice polish.

huonw · 2015-11-17T23:13:43Z

Is there a reason this doesn't include U+201C LEFT DOUBLE QUOTATION MARK “ and U+201D RIGHT DOUBLE QUOTATION MARK ” as possible subsitutions for "? (If not, I can submit a patch to add them.)

Manishearth · 2015-11-17T23:15:26Z

No reason. It contains the single quotes. I did Ctrl-F for those, but didn't bother to check the double quotes.

Go ahead!

wafflespeanut · 2015-11-18T04:48:01Z

It's again worth mentioning that this still doesn't have all the substitutions - only the printable ones from http://www.unicode.org/Public/security/revision-06/confusables.txt. So, feel free to add more :)

huonw · 2015-11-18T06:09:14Z

Oh, I think I see why QUOTATION MARK was missed: things (including that) are considered confusable with APOSTROPHE, APOSTROPHE rather than ".

cc #29837 (comment)

Manishearth · 2015-12-20T22:08:08Z

The universe is starting to hit and appreciate this feature https://twitter.com/joeranweiler/status/678691374292590593 :D

Add more aliases for Unicode confusable chars Building upon #29837, this PR: * added aliases for space characters, * distinguished square brackets from parens, and * added common CJK punctuation characters as aliases. This will especially help CJK users who may have forgotten to switch off IME when coding.

It's unclear why this is used here. All entries in the third column of `UNICODE_ARRAY` are covered by `ASCII_ARRAY`, so if the lookup fails it's a genuine compiler bug. It was added way back in rust-lang#29837, for no clear reason. This commit changes it to `span_bug`, which is more typical.

rust-highfive assigned nikomatsakis Nov 14, 2015

Manishearth reviewed Nov 14, 2015
View reviewed changes

wafflespeanut force-pushed the unicode_chars branch 3 times, most recently from d5a4945 to 25a86fa Compare November 15, 2015 07:31

Manishearth reviewed Nov 15, 2015
View reviewed changes

wafflespeanut force-pushed the unicode_chars branch from 25a86fa to 39e6bfa Compare November 15, 2015 12:58

Manishearth reviewed Nov 15, 2015
View reviewed changes

wafflespeanut force-pushed the unicode_chars branch from 39e6bfa to 56647c3 Compare November 16, 2015 05:10

Manishearth reviewed Nov 16, 2015
View reviewed changes

wafflespeanut force-pushed the unicode_chars branch from 56647c3 to c2c416c Compare November 17, 2015 06:30

Manishearth reviewed Nov 17, 2015
View reviewed changes

Detect confusing unicode characters and show the alternative

7f63c7c

wafflespeanut force-pushed the unicode_chars branch from c2c416c to 7f63c7c Compare November 17, 2015 07:05

bors added a commit that referenced this pull request Nov 17, 2015

Auto merge of #29837 - Wafflespeanut:unicode_chars, r=Manishearth

1b26148

fixes #25957

bors merged commit 7f63c7c into rust-lang:master Nov 17, 2015

wafflespeanut deleted the unicode_chars branch November 17, 2015 09:52

brson added the relnotes Marks issues that should be documented in the release notes of the next release. label Nov 17, 2015

huonw mentioned this pull request Nov 17, 2015

Add some unicode aliases for ". #29902

Merged

bors added a commit that referenced this pull request Nov 18, 2015

Auto merge of #29902 - huonw:smart-quotes, r=alexcrichton

28f6b88

cc #29837 (comment)

xen0n mentioned this pull request Apr 21, 2016

Add more aliases for Unicode confusable chars #33128

Merged

thiagoarrais mentioned this pull request Oct 26, 2021

Check for "almost quote" hedyorg/hedy#1100

Closed

		@@ -0,0 +1,156 @@
		const ASCII_ARRAY: &'static [(char, &'static str)] = &[('_', "Low Line"), ('-', "Hyphen-Minus"),

Detect confusing unicode characters and show the alternative... #29837

Detect confusing unicode characters and show the alternative... #29837

Uh oh!

Conversation

wafflespeanut commented Nov 14, 2015

Uh oh!

rust-highfive commented Nov 14, 2015

Uh oh!

wafflespeanut commented Nov 14, 2015

Uh oh!

wafflespeanut commented Nov 14, 2015

Uh oh!

Manishearth Nov 14, 2015

Choose a reason for hiding this comment

Uh oh!

Manishearth commented Nov 14, 2015

Uh oh!

Manishearth Nov 15, 2015

Choose a reason for hiding this comment

Uh oh!

wafflespeanut commented Nov 15, 2015

Uh oh!

Manishearth Nov 15, 2015

Choose a reason for hiding this comment

Uh oh!

wafflespeanut commented Nov 16, 2015

Uh oh!

Manishearth Nov 16, 2015

Choose a reason for hiding this comment

Uh oh!

Manishearth commented Nov 16, 2015

Uh oh!

nikomatsakis commented Nov 16, 2015

Uh oh!

wafflespeanut commented Nov 17, 2015

Uh oh!

Manishearth Nov 17, 2015

Choose a reason for hiding this comment

Uh oh!

Manishearth commented Nov 17, 2015

Uh oh!

bors commented Nov 17, 2015

Uh oh!

wafflespeanut commented Nov 17, 2015

Uh oh!

Havvy commented Nov 17, 2015

Uh oh!

bors commented Nov 17, 2015

Uh oh!

bors commented Nov 17, 2015

Uh oh!

Manishearth commented Nov 17, 2015

Uh oh!

brson commented Nov 17, 2015

Uh oh!

huonw commented Nov 17, 2015

Uh oh!

Manishearth commented Nov 17, 2015

Uh oh!

wafflespeanut commented Nov 18, 2015

Uh oh!

huonw commented Nov 18, 2015

Uh oh!

Manishearth commented Dec 20, 2015

Uh oh!

Uh oh!