Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Escape fewer Unicode codepoints in Debug impl of str #34485

Merged
merged 5 commits into from
Jul 28, 2016

Conversation

tbu-
Copy link
Contributor

@tbu- tbu- commented Jun 26, 2016

Use the same procedure as Python to determine whether a character is
printable, described in PEP 3138. In particular, this means that the
following character classes are escaped:

  • Cc (Other, Control)
  • Cf (Other, Format)
  • Cs (Other, Surrogate), even though they can't appear in Rust strings
  • Co (Other, Private Use)
  • Cn (Other, Not Assigned)
  • Zl (Separator, Line)
  • Zp (Separator, Paragraph)
  • Zs (Separator, Space), except for the ASCII space ' ' 0x20

This allows for user-friendly inspection of strings that are not
English (e.g. compare "\u{e9}\u{e8}\u{ea}" to "éèê").

Fixes #34318.
CC #34422.

@rust-highfive
Copy link
Collaborator

r? @brson

(rust_highfive has picked a reviewer for you, use r? to override)

@sfackler
Copy link
Member

Looks like run-pass/ifmt.rs is failing on travis.

@ollie27
Copy link
Member

ollie27 commented Jun 26, 2016

Is this changing char::escape_default? That would be a breaking change.

@tbu-
Copy link
Contributor Author

tbu- commented Jun 26, 2016

It is changing that function. Why is it a breaking change?

@ollie27
Copy link
Member

ollie27 commented Jun 26, 2016

It's a stable function and this will break people's code which relies on the current behaviour.

@steveklabnik
Copy link
Member

/cc @rust-lang/libs and @rust-lang/lang

On Jun 26, 2016, 18:04 -0400, Oliver Middletonnotifications@github.com, wrote:

It's a stable function and this will break people's code which relies on the current behaviour.


You are receiving this because you are subscribed to this thread.
Reply to this email directly,view it on GitHub(#34485 (comment)), ormute the thread(https://github.com/notifications/unsubscribe/AABsijjdWpFL-HoADL8BmwCH9hBOw61Qks5qPveAgaJpZM4I-joR).

@ranma42
Copy link
Contributor

ranma42 commented Jun 27, 2016

The documentation explicitly states that any character that is not in the printable ASCII range 0x20 .. 0x7e will be escaped (some characters with ad-hoc sequences, the other ones with hexadecimal Unicode escapes), so this change would at the very least need to change the documentation of the function to match the new behaviour.
This is probably undesirable as the API is stable and the current specification guarantees that the output is only composed of ASCII characters, while the new implementation has much looser and complex guarantees (for example, it would be hard to provide an accurate specification without listing the Unicode character classes that are not escaped).

@BurntSushi
Copy link
Member

@ranma42 Agreed. I don't think we can change this. It's very clearly changing the contract of the function.

@tbu-
Copy link
Contributor Author

tbu- commented Jun 27, 2016

@ranma42 OK, assume for now that we don't change that function, but only the Debug implementation of str.

@eddyb
Copy link
Member

eddyb commented Jun 27, 2016

Didn't these exact conversations happen before? Was a previous attempt abandoned?

@tbu-
Copy link
Contributor Author

tbu- commented Jun 27, 2016

I don't know, I don't remember any.

@eddyb
Copy link
Member

eddyb commented Jun 27, 2016

I can't find it. Might have been char instead. Don't mind me.

@ranma42
Copy link
Contributor

ranma42 commented Jun 27, 2016

@tbu- , yes, I think that should work. We might also want to expose it as a function on char (escape_nonprintable?), but I am unsure if this should be done directly or if we should go for an RFC to discuss the details (which Debug impls should escape non printable chars and which ones should escape everything? should we expose the API on char? under what name? are there other conventions besides the one established by Python that might be worth evaluating?).

@brson brson added T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. I-nominated labels Jun 27, 2016
@brson
Copy link
Contributor

brson commented Jun 27, 2016

Not sure what I think of this.

The medium that Debug writes to is a byte sequence with no guarantees that it supports Unicode. Certainly most places the output would end up would speak UTF-8 but definitely not all. I don't recall the motivation for escaping Debug so aggressively, but it seems like the possibility of terminals not understanding it must be one reason.

@tbu-
Copy link
Contributor Author

tbu- commented Jun 28, 2016

Actually, the output of Debug is statically guaranteed to be UTF-8.

I would guess that the reason for this is that the people implementing it didn't need non-ASCII characters, and I mean if you don't need them they're just a nuisance. But if you're implementing a non-English program, then it basically makes the Debug implementation useless (see the linked issue in which @liigo shows the unreadable "file not found" message in Chinese.

If you write to a device that doesn't support UTF-8, you should just escape these characters later, when writing to said device -- like the ascii function in Python. The other way around doesn't work.

@ranma42
Copy link
Contributor

ranma42 commented Jun 28, 2016

But if you're implementing a non-English program, then it basically makes the Debug implementation useless

A possible objection is that

Debug should format the output in a programmer-facing, debugging context.

which seems to imply that it should not be exposed to the users, but rather to tools or developers.
For example, one could say that formatting errors using Debug is not correct if they are going to be shown to the user.

@tbu-
Copy link
Contributor Author

tbu- commented Jun 28, 2016

@ranma42 The linked example for a programmer, @liigo.

@ranma42
Copy link
Contributor

ranma42 commented Jun 28, 2016

@tbu- not really, in that case he is the user of the rustc program.

@tbu-
Copy link
Contributor Author

tbu- commented Jun 29, 2016

@ranma42 It's a runtime error provided by the operating system, encountered while programming.

EDIT: Also, you could probably look into the PEP, they also give a longer motivation in there.

@liigo
Copy link
Contributor

liigo commented Jun 29, 2016

Debug should format the output in a programmer-facing, debugging context.

I'd rather they don't do these (mostly useless) format/escape for me (a programmer). They (self-righteous people) thought these format may help programmers in debugging context, but it's not true. In some times, non-English programmers got unreadable formatted output, worse than Display output. I don't use very-old output devices, which maybe have problems to display Unicode characters, which really need escape.

@tbu- Maybe we can't change escape_default for compatible reasons, but changing impl Debug for str is just implementation details IMHO.

@ranma42
Copy link
Contributor

ranma42 commented Jun 29, 2016

@tbu- It is a runtime error provided by the operating system, encountered by rustc and dumped on the console in a bad way. What I am saying is that it is not obvious to me that the correct way to display it is using Debug instead of Display.

@liigo
Copy link
Contributor

liigo commented Jun 29, 2016

@brson: it seems like the possibility of terminals not understanding it must be one reason.

If this was the reason, we should also implement println! to escape Unicode characters.

@liigo
Copy link
Contributor

liigo commented Jun 29, 2016

@ranma42: to display it is using Debug instead of Display

That'd be a big breaking change. Not all Debug types implement Display.

@ranma42
Copy link
Contributor

ranma42 commented Jun 29, 2016

@liigo Yes, that would be a major breaking change (it would change the constraints on the Result type so that E: fmt:Display).

@ranma42
Copy link
Contributor

ranma42 commented Jun 29, 2016

To me, the major advantage of the current implementation of Debug for str (always escaping non-ASCII) is that it allows to distinguish between different strings that have the same visual appearance. This applies in the same way to whitespace ('\t' vs spaces), Unicode combining characters and control characters... and maybe more?

Of course this does not mean that it should be used for everything. Specifically, I would only use Debug when the representation of the data is relevant, not when it just needs to be shown to the user. Basically this reasoning is my understanding of the documentation of Debug and Display (programmer-facing vs user-facing). This can of course be incorrect, but in that case I would also ask for an improvement in the documentation ;)

#34318 shows an example where using Debug looks inappropriate (an OS-provided localised error). A partial solution that would have limited impact on other things would be to avoid escaping only the OS localised errors here. I would not even call this a breaking change, given that the output is already not explicitly stabilised (it obviously depends on the current locale and most likely on the operating system + version).

Even though Rust does not (yet) have its own localised error messages, it would not be hard to imagine the same issue affecting other types of output, so it might be a good idea to think of a more general solution to ensure a way forward in this direction.

@tbu-
Copy link
Contributor Author

tbu- commented Jun 29, 2016

@ranma42 If you want to see the exact code points, why only make an exception for English? That's very English-centric. :) EDIT: Imagine the Debug implementation showed an escape sequence for every latin character instead of the actual character. That would be quite a pain to program with, right? That's what it's like if the program deals with, say Chinese strings.

We should probably provide a function that does the same as Debug today, like Python has: ascii. This basically solves the problem. If you need a Debug string like today, you can just put the Debug output into said function and you receive what you wanted.

@liigo
Copy link
Contributor

liigo commented Jun 29, 2016

To me, the major advantage of the current implementation of Debug for str (always escaping non-ASCII) is that it allows to distinguish between different strings that have the same visual appearance.

This is not advantage for non-ASCII text. It just makes unreadable noise (\u{xxxx}...). If someone do think this is a major advantage, please let println! have this advantage too.

@tbu-
Copy link
Contributor Author

tbu- commented Jul 27, 2016

@brson Could you @bors r-, there's another test failure in the WTF-8 code.

@alexcrichton
Copy link
Member

@bors: r-

@tbu- tbu- force-pushed the pr_unicode_debug_str branch from f9bf85d to 3d09b4a Compare July 28, 2016 00:21
@tbu-
Copy link
Contributor Author

tbu- commented Jul 28, 2016

@alexcrichton It should be fixed now.

@alexcrichton
Copy link
Member

@bors: r+ 3d09b4a

@alexcrichton alexcrichton added the relnotes Marks issues that should be documented in the release notes of the next release. label Jul 28, 2016
@bors
Copy link
Contributor

bors commented Jul 28, 2016

⌛ Testing commit 3d09b4a with merge d1df3fe...

bors added a commit that referenced this pull request Jul 28, 2016
Escape fewer Unicode codepoints in `Debug` impl of `str`

Use the same procedure as Python to determine whether a character is
printable, described in [PEP 3138]. In particular, this means that the
following character classes are escaped:

- Cc (Other, Control)
- Cf (Other, Format)
- Cs (Other, Surrogate), even though they can't appear in Rust strings
- Co (Other, Private Use)
- Cn (Other, Not Assigned)
- Zl (Separator, Line)
- Zp (Separator, Paragraph)
- Zs (Separator, Space), except for the ASCII space `' '` `0x20`

This allows for user-friendly inspection of strings that are not
English (e.g. compare `"\u{e9}\u{e8}\u{ea}"` to `"éèê"`).

Fixes #34318.
CC #34422.

[PEP 3138]: https://www.python.org/dev/peps/pep-3138/
@bors bors merged commit 3d09b4a into rust-lang:master Jul 28, 2016
SimonSapin added a commit to SimonSapin/rust-std-candidates that referenced this pull request Aug 2, 2016
radix pushed a commit to radix/string-wrapper that referenced this pull request Jan 13, 2017
@SimonSapin
Copy link
Contributor

I’m very late to say this, but this adds 2102 bytes of static data to libcore, whereas previously all large Unicode tables were in the std_unicode crate that #![no_std] programs could opt not to use. 2 KB may not seem like much, but it’s significant when programming a micro-controller that has 16 KB of flash memory.

@aturon
Copy link
Member

aturon commented Feb 3, 2017

@SimonSapin I opened #39492 for this issue, and to propose a general policy.

@ariasuni
Copy link
Contributor

I’d like to know what the code do precisely (what are SINGLETONS0U, SINGLETONS0L, NORMAL0 and check()) so I could try to optimize it, at least in size. Also, this code should to belong to libstd_unicode, right?

@SimonSapin
Copy link
Contributor

@ariasuni The commit message and PR message give a list of Unicode categories of characters that are escape, but yes if it’s not already this list should also be in some doc-comment in the code.

I agree that I’d prefer to have these tables in libstd_unicode, but libcore still needs some impl of Debug for str. There can only be one such impl in a program, I don’t know if it’s possible to make it do something different based on whether libstd_unicode is available.

@tbu-
Copy link
Contributor Author

tbu- commented Jun 23, 2017

@ariasuni The high-level view is that you have to store 0x110000 booleans, one for each Unicode code point, which specify whether the given code point should be printed or not (according to the description in the first post of this PR).

The low-level view of this particular implementation seems to have changed since I have implemented it, you can find some notes on the new one in 44bcd26.

@ariasuni
Copy link
Contributor

ariasuni commented Jun 25, 2017

@SimonSapin We can split libstd_unicode in unicode-casing and unicode-categories, and use only the latter to implement is_printable(). What do you think about it?

@SimonSapin
Copy link
Contributor

@ariasuni even if we did that, that doesn’t solve the situation that a #[no_std] program needs a Debug for str impl and ideally should also be able to opt out of embedding large Unicode tables. If we move is_printable to a separate crate but make libcore depend on that crate, we’ve moved things around without actually changing anything for Rust users. If we add a Cargo feature to disable these including tables (#39492), it doesn’t change much that it’s a separate crate or not.

@ariasuni
Copy link
Contributor

ariasuni commented Jun 25, 2017

Sorry, I probably overthought it. The idea is to put the code for Unicode categories elsewhere so that we can use it in is_printable() even when using #[no_std] (right now without waiting for the Cargo feature), and this would reduce the code size for those who don’t. Not sure about the stability implications of moving code around thought.

@SimonSapin
Copy link
Contributor

I don’t understand how this would reduce code size, I may be missing something.

@ariasuni
Copy link
Contributor

ariasuni commented Jun 25, 2017

If I understand correctly, is_printable() is using Unicode categories data in char_private.rs, instead of the data in libstd_unicode used by is_control() is_alphabetic(), is_numeric(), etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
relnotes Marks issues that should be documented in the release notes of the next release. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.