Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Debug printing of combining characters is wrong #41922

Closed
clarfonthey opened this issue May 11, 2017 · 10 comments
Closed

Debug printing of combining characters is wrong #41922

clarfonthey opened this issue May 11, 2017 · 10 comments
Labels
A-Unicode Area: Unicode C-bug Category: This is a bug. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@clarfonthey
Copy link
Contributor

clarfonthey commented May 11, 2017

Minimal example:

fn main() {
    let s = "e\u{301}";
    println!("str: {:?}", s);
    println!("bytes: {:?}", s.chars().collect::<Vec<_>>());
}

(playground link)

Expected output is either:

str: "é"
bytes: ['e', '\u{301}']

Or:

str: "é"
bytes: ['e', '◌́']

Actual output:

str: "é"
bytes: ['e', '́']

Note that the combining accent prints over the single quote. This is confusing and shouldn't happen.

@clarfonthey
Copy link
Contributor Author

clarfonthey commented May 11, 2017

cc @tbu- who made the change to debug printing and @alexcrichton who approved it

@tbu-
Copy link
Contributor

tbu- commented May 11, 2017

Python seems to do the same thing.

>>> '\u0301'
'́'

@tbu-
Copy link
Contributor

tbu- commented May 11, 2017

@clarcharr That is, do you know some implementation we could copy?

@clarfonthey
Copy link
Contributor Author

clarfonthey commented May 11, 2017

@tbu- not that I can think of; the current way seems wrong, though. Perhaps we could just check if a character is within the combining character range?

I found this and it probably could help: http://stackoverflow.com/a/17052803

Perhaps we could make a similar script?

@clarfonthey
Copy link
Contributor Author

Also got some help on Twitter for this:

https://twitter.com/FakeUnicode/status/862798986238873601

@Mark-Simulacrum Mark-Simulacrum added A-Unicode Area: Unicode T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels Jun 22, 2017
@Mark-Simulacrum Mark-Simulacrum added the C-bug Category: This is a bug. label Jul 27, 2017
@behnam
Copy link
Contributor

behnam commented Aug 11, 2017

I would not consider this a bug, as it's common behavior to not touch or change Unicode characters when printed out to stdout or a file, specially when it's for debug mode.

One reason to not do this is the fact that it can easily mislead the user. Let's say I got the output and copy-pasted the output in a Unicode decoder, to see what character we have in the spot. I will see two codepoints in the decoder, one of which had not existed in the original string.

So, IMHO, there are pros in doing so, specially nicer-looking output, but the main con being the Debug output not telling you the truth, which is very unfortunate, specially since there will be almost no work around it! I think it's better to keep these fancy features for the high-level parts of a stack, like Display, instead of Debug.

If Rust wants to do anything special about these characters, the filter would be GC=Mn (Nonspacing_Mark). But, it should be noted that this would mean the result would depend on the Unicode version of the compiler, and newly assigned characters won't get the special treatment until the internal Unicode data of Rust gets updated.

That said, I think we also need to take a look at what other modern Unicode-savvy languages, like Swift, are doing in this area, before making a decision.

@varkor
Copy link
Member

varkor commented Mar 22, 2018

For reference, in Swift:

let str = "e\u{301}";
// Array of unicode scalars, equivalent to Rust's chars
print("\(Array(str.unicodeScalars))"); // ["e", "\u{0301}"]
// Array of unicode scalars converted into strings
print("\(Array(str.unicodeScalars).map({ String.init($0) }))"); // ["e", "́"]

Swift opts to print code points for unicode scalars (but when converted to strings they display as in Rust).
This seems like reasonable behaviour (@clarcharr's first suggestion).

@varkor
Copy link
Member

varkor commented Mar 22, 2018

This seems to have been deliberately changed to the current output as a result of #24588.

@clarfonthey
Copy link
Contributor Author

I still just think that checking if the character is combining and then escaping if it's by itself is the best option.

@varkor
Copy link
Member

varkor commented Mar 22, 2018

Oh, I see: you already mentioned the earlier change! I agree: this would make sense for combining characters. The range described on Wikipedia should probably be sufficient?

bors added a commit that referenced this issue May 21, 2018
Escape combining characters in char::Debug

Although combining characters are technically printable, they make little sense to print on their own with `Debug`: it'd be better to escape them like non-printable characters.

This is a breaking change, but I imagine the fact `escape_debug` is rare and almost certainly primarily used for debugging that this is an acceptable change.
Resolves #41922.

r? @alexcrichton
cc @clarcharr
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Unicode Area: Unicode C-bug Category: This is a bug. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

5 participants