Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validity of char #74

Closed
RalfJung opened this issue Jan 10, 2019 · 15 comments
Closed

Validity of char #74

RalfJung opened this issue Jan 10, 2019 · 15 comments
Labels
A-validity Topic: Related to validity invariants

Comments

@RalfJung
Copy link
Member

Discussing the validity invariant of the char type.

The "obvious" choice is that it must be a valid unicode codepoint, and must not contain any uninitialized bits.

However, a possible issue with this choice is that this means we will have to extend the set of valid bit patterns whenever new codepoints get added to unicode. Is that a problem, e.g. when old and new code interact? On first glance it seems like this will only make fewer programs have UB. (@nikomatsakis I think this is related to your "future proofing" concern that you raised elsewhere. Here might be a good place to discuss it with a concrete example.)

@RalfJung RalfJung added active discussion topic A-validity Topic: Related to validity invariants labels Jan 10, 2019
@nikomatsakis
Copy link
Contributor

Indeed, good point @RalfJung -- you can certainly imagine some unsafe code that takes advantage of existing unicode definitions and is later invalidated.

@RalfJung
Copy link
Member Author

Could you give an example? Reducing UB on its own can never cause a problem.

@RalfJung
Copy link
Member Author

RalfJung commented Jan 11, 2019

Oh, I think I see what you mean. We could imagine unsafe code establishing an isomorphism between Result<char, u16> and u32 by manually using the "niche" in char. If we ever relax the validity invariant of char, then no existing UB-free program will suddenly have UB -- but now there exist valid values for char for which this isomorphism misbehaves.

I propose the following solution to this problem: we define the validity invariant for char to be implementation defined. The invariant is something like: A 32bit-integer is a valid char if it is in the range 0x0000..=N and not in the range U+D800..=U+DFFF. N is implementation defined, but at least 0x10FFFF. A portable unsafe program has to prove that it behaves safely for any value of N. This way, it cannot use the "niche" at the top end of char to pack other data in there.

This is a lot like how I am thinking about layout: given a type like struct Foo { f1: u8, f2: u32, f3: u8 }, the unsafe program has to behave safely for any possible choice of offsets of the three fields.

@nikomatsakis
Copy link
Contributor

@RalfJung yes that is precisely what I meant. And I agree that is probably a good definition for a char "validity invariant".

@hanna-kruppe
Copy link

hanna-kruppe commented Jan 12, 2019

I don't think the validity invariant of char should change when new Unicode standards introduce new code points. It has been laid down for over 15 years that (due to UTF-16 limitations) the space of possible code points goes up to U+10FFFF and no further (see e.g. §2.4 in https://www.unicode.org/versions/Unicode11.0.0/ch02.pdf regardless of how many of those are assigned at the moment. While it's conceivable that some future standards deviate from this, that seems like it would have more far-reaching consequences (killing off the possibility of UTF-16 based processing, changing the definition of "valid UTF-8") than just breaking some hypothetical unsafe Rust code. I expect that if it ever happens, today's languages will largely have to adopt separate new data types for the new "UTF-16-incompatible Unicode" text representation, just as many languages predating Unicode have separate "single-byte" and "multi-byte/unicode" string types.

@nikomatsakis
Copy link
Contributor

So, it seems like we should say that a char is .. what? Just any valid u32? I would be in favor of that, personally. =)

(There does seem to perhaps be a safety invariant that a char must represent a valid unicode codepoint.)

@Amanieu
Copy link
Member

Amanieu commented Jan 31, 2019

Note that we currently emit LLVM metadata which asserts that the range of a char value is always in [0, 0x10FFFF].

@RalfJung
Copy link
Member Author

RalfJung commented Jan 31, 2019

So, it seems like we should say that a char is .. what? Just any valid u32? I would be in favor of that, personally. =)

Not sure how you got that out of the prior discussion.^^

From what @rkruppe said, I'd conclude the invariant is "a u32 inside 0..=0x10FFFF and not inside 0xD800..=0xDFFF".

@CAD97
Copy link

CAD97 commented Jan 31, 2019

The relevant definitions from the Unicode Glossary:

Code Point. (1) Any value in the Unicode codespace; that is, the range of integers from 0 to 10FFFF16. (See definition D10 in Section 3.4, Characters and Encoding.) Not all code points are assigned to encoded characters. See code point type.

Assigned Character.  A code point that is assigned to an abstract character. This refers to graphic, format, control, and private-use characters that have been encoded in the Unicode Standard. (See Section 2.4, Code Points and Characters.)

Designated Code Point. Any code point that has either been assigned to an abstract character (assigned characters) or that has otherwise been given a normative function by the standard (surrogate code points and noncharacters). This definition excludes reserved code points. Also known as assigned code point. (See Section 2.4 Code Points and Characters.)

Reserved Code Point. Any code point of the Unicode Standard that is reserved for future assignment. Also known as an unassigned code point. (See definition D15 in Section 3.4, Characters and Encoding, and Section 2.4, Code Points and Characters.)

Surrogate Code Point. A Unicode code point in the range U+D800..U+DFFF. Reserved for use by UTF-16, where a pair of surrogate code units (a high surrogate followed by a low surrogate) “stand in” for a supplementary code point.

@CAD97
Copy link

CAD97 commented Feb 1, 2019

Restricting char to 0..=0x10FFFF is consistent with the standard's definition of a Code Point and makes logical sense. Restricting it beyond 0..=0x10FFFF and not 0xD800..=0xDFFF would mean changing stable safe code, as all char that meet those requirements are safe to create today.

I personally don't see the reason for making surrogate code points against the validity variant of char. Safety variant, definitely. They're still valid Code Points, and in fact they are Designated Code Points. It's just that they're designated to never be used as an Assigned Character.

I also fail to see what is actually gained by making this a validity requirement. The niche of 0b11011_NNNNNNNNNNN is weird, and is tiny compared to the niche of any value over 0x10FFFF. I could see some code temporarily putting surrogates into a char, but honestly those should probably be using u16 to work with UTF-16.

I could go either way, but if surrogate code points are forbidden then char can't be simply defined as "a Unicode Code Point".

@RalfJung
Copy link
Member Author

RalfJung commented Feb 1, 2019

Restricting it beyond 0..=0x10FFFF and not 0xD800..=0xDFFF would mean changing stable safe code, as all char that meet those requirements are safe to create today.

I don't think this is accurate. The following panics:

use std::char;

fn main() {
    let x = char::from_u32(0xD801).unwrap();
}

@CAD97
Copy link

CAD97 commented Feb 1, 2019

That's not what I intended to state. What I intended to state is that every code point in 0..=0x10FFFF and not in 0xD800..=0xDFFF is a safe char to create today. This includes reserved (unassigned) code points, and at least one comment mentioned valid chars potentially being restricted to assigned code points (maybe unintentionally).

@RalfJung
Copy link
Member Author

RalfJung commented Feb 1, 2019

Ah sorry, seems like I misunderstood.

I agree that the above is the most restrictive possible validity invariant.

@RalfJung
Copy link
Member Author

FWIW, the Nomicon says that a char must be in [0x0, 0xD7FF] or [0xE000, 0x10FFFF]: https://doc.rust-lang.org/nightly/nomicon/what-unsafe-does.html

@JakobDegen
Copy link
Contributor

Closing as resolved

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-validity Topic: Related to validity invariants
Projects
None yet
Development

No branches or pull requests

6 participants