Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warn on Fullwidth Exclamation Mark (U+FF01) in comment #134810

Open
dtolnay opened this issue Dec 27, 2024 · 8 comments
Open

Warn on Fullwidth Exclamation Mark (U+FF01) in comment #134810

dtolnay opened this issue Dec 27, 2024 · 8 comments
Labels
A-lints Area: Lints (warnings about flaws in source code) such as unused_mut. A-Unicode Area: Unicode C-feature-request Category: A feature request, i.e: not implemented / a PR. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-rustdoc Relevant to the rustdoc team, which will review and decide on the PR/issue.

Comments

@dtolnay
Copy link
Member

dtolnay commented Dec 27, 2024

//! A
//!B
//! C

For the above Rust source code, rustdoc produces documentation containing only "A C". The line containing "B" is only a comment. The comment starts with https://www.compart.com/en/unicode/U+FF01.

I caught this in a real-world documentation PR from a Chinese contributor. See #134241 (review).

In GitHub, the distinction is practically invisible.

In my text editor, it is more obvious.

Would it be reasonable for rustc and/or rustdoc to have reported a lint on such code?

@dtolnay dtolnay added A-lints Area: Lints (warnings about flaws in source code) such as unused_mut. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Dec 27, 2024
@rustbot rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Dec 27, 2024
@dtolnay
Copy link
Member Author

dtolnay commented Dec 27, 2024

@workingjubilee
Copy link
Member

workingjubilee commented Dec 27, 2024

Yes, we should either treat the characters identically (normalizing both to match //!) or lint on this. This character is emitted because someone starts writing something like

//! それから

and then notices they're writing using a CJK input method which will preferentially emit fullwidth characters, even when using "English" punctuation inputs. They then switch linguistic gears (or at least, those of their HID/IME), back up, and write,

//! Therefore,

If they don't delete the entire line, they leave behind the original fullwidth character.

I'm not aware of a fullwidth /, or at least, common input methods don't use it, so the original comment token is left intact.

@workingjubilee workingjubilee added A-Unicode Area: Unicode T-rustdoc Relevant to the rustdoc team, which will review and decide on the PR/issue. labels Dec 27, 2024
@est31
Copy link
Member

est31 commented Dec 27, 2024

We do error for the fullwidth exclamation mark if we encounter it in normal Rust code:

fn foo() -> ! {
    let v = „hi“;
}
output (expand)
error: unknown start of token: \u{ff01}
 --> src/lib.rs:1:13
  |
1 | fn foo() -> ! {
  |             ^^
  |
help: Unicode character '!' (Fullwidth Exclamation Mark) looks like '!' (Exclamation Mark), but it is not
  |
1 | fn foo() -> ! {
  |             ~

error: unknown start of token: \u{201e}
 --> src/lib.rs:2:13
  |
2 |     let v = „hi“;
  |             ^

error: unknown start of token: \u{201c}
 --> src/lib.rs:2:16
  |
2 |     let v = „hi“;
  |                ^
  |
help: Unicode character '“' (Left Double Quotation Mark) looks like '"' (Quotation Mark), but it is not
  |
2 |     let v = „hi";
  |                ~

error[E0425]: cannot find value `hi` in this scope
 --> src/lib.rs:2:14
  |
2 |     let v = „hi“;
  |              ^^ not found in this scope

error[E0308]: mismatched types
 --> src/lib.rs:1:13
  |
1 | fn foo() -> ! {
  |    ---      ^^ expected `!`, found `()`
  |    |
  |    implicitly returns `()` as its body has no tail or `return` expression
  |
  = note:   expected type `!`
          found unit type `()`
But there is no error or lint here. I don't see any existing unicode lint have this in its scope (say the ones [here](https://doc.rust-lang.org/stable/nightly-rustc/src/rustc_lint/non_ascii_idents.rs.html)).

We could maybe make a specialized lint like how there is text_direction_codepoint_in_comment.

This also gives an error btw:

fn foo() {
    /∗∗/
}
output (expand)
error: unknown start of token: \u{2217}
 --> src/lib.rs:2:6
  |
2 |     /∗∗/
  |      ^^
  |
  = note: character appears once more
help: Unicode character '∗' (Asterisk Operator) looks like '*' (Asterisk), but it is not
  |
2 |     /**/
  |      ~~

This doesn't lint either and on a first glance it looks like hello isn't commented out (but it is).

/*
some comment here
∗/
fn hello() {}
/*
some comment here
*/

There is no lint here either (I don't have a chinese keyboard, maybe one can hit this when trying for `?):

/*!
```
assert_eq!(true, false);
```
*/

or here (less visible and occurs once in the wild):

/*!
ˋˋˋ
assert_eq!(true, false);
ˋˋˋ
*/

/ has a bunch of confusables too, so this is not a doc comment:

//⁄ hello
fn foo() {}

Maybe one could make a confusables_in_comments lint? Maybe doc_comment_confusables?

@workingjubilee
Copy link
Member

workingjubilee commented Dec 27, 2024

There is no lint here either (I don't have a chinese keyboard, maybe one can hit this when trying for `?):

While specialty keyboards for these languages do exist, using them is actually... er... a specialization? Like steganography keyboards are to English. These input methods are mostly software-driven, so people often use perfectly ordinary QWERTY keyboards, and the "switch input method" keys that are sometimes present (in e.g. the 109 key layout) can be replicated by chorded commands using some combination of the Shift, Ctrl, Alt, and Fn keys.

@jieyouxu
Copy link
Member

jieyouxu commented Dec 27, 2024

So while unlikely, doesn't doing so technically warn/lint on "valid" comments? E.g. what if I happen to intentionally write a non-ASCII exclamation mark? Likely more unsual (but putting on an adversarial user's hat)

//难道正常的感叹号就不存在吗?当然有
//!左边这个不就是吗?

(The punctuation being formatted like this is not usual, but it is "technically" valid)

I'd be personally in favor of adding such a warning, just thinking if it can be broken (specifically false positives).

@Noratrieb Noratrieb removed the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Dec 27, 2024
@Noratrieb
Copy link
Member

"proper" formatting of comments puts a space after the //, which resolves the problem. I don't think this can have false positives, at worst it will complain about your comment formatting and you can fix it

@workingjubilee
Copy link
Member

workingjubilee commented Dec 27, 2024

Yes, this would become something like a nonstandard_style lint for comments: we deliberately adopt the opinion that you should format your comment differently if that case would occur.

Do we wish to try for rustfmt support for this? Can we?

@est31
Copy link
Member

est31 commented Dec 28, 2024

something like a nonstandard_style lint for comments

There is precedent of style lints protecting from bugs/misleading code: lints from the nonstandard_style group give readers clarity what FOOBAR => ... and foobar => ... mean in a match statement for example. In fact, these lints are the only thing saving people from typos here: if you type FOBAR (new binding) where you meant FOOBAR (existing constant), you only get a non_snake_case lint.

However, there style lints aren't giving a clear signal that there might be a potential bug hiding.

So that's why I think there should be two new lints: confusables_in_comments and doc_comment_confusables, where confusables of syntax with semantics in (doc)comments are linted for. This is a more restricted amount of code and we can clearly state what the syntax can be confused for.

Right now rustdoc lints if there is [foobar] in a doc comment and it can't be resolved, but we don't say anything if the fullwidth twin is present [foobar]. People might have wanted to create a link. And if they didn't want to, they can still use the various methods to escape it.

Stuff like that or stuff like ∗/ in block comments isn't well caught by the style umbrella.

Do we wish to try for rustfmt support for this? Can we?

IMO converting //foo to // foo everywhere should not be part of rustc, but of rustfmt. It's not a bad idea to do though.

Technically, I suppose in 2027 one could put it into a new style edition.

Practically, there is a lot of code out there that uses // without spaces, and I'm not sure that the style team will want a change that has so much diff.

I think one can pursue both projects (rustfmt changes plus confusables lints), they don't preclude each other.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-lints Area: Lints (warnings about flaws in source code) such as unused_mut. A-Unicode Area: Unicode C-feature-request Category: A feature request, i.e: not implemented / a PR. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-rustdoc Relevant to the rustdoc team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

6 participants