-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validity of char #74
Comments
Indeed, good point @RalfJung -- you can certainly imagine some unsafe code that takes advantage of existing unicode definitions and is later invalidated. |
Could you give an example? Reducing UB on its own can never cause a problem. |
Oh, I think I see what you mean. We could imagine unsafe code establishing an isomorphism between I propose the following solution to this problem: we define the validity invariant for This is a lot like how I am thinking about layout: given a type like |
@RalfJung yes that is precisely what I meant. And I agree that is probably a good definition for a char "validity invariant". |
I don't think the validity invariant of |
So, it seems like we should say that a (There does seem to perhaps be a safety invariant that a |
Note that we currently emit LLVM metadata which asserts that the range of a |
Not sure how you got that out of the prior discussion.^^ From what @rkruppe said, I'd conclude the invariant is "a |
The relevant definitions from the Unicode Glossary: Code Point. (1) Any value in the Unicode codespace; that is, the range of integers from 0 to 10FFFF16. (See definition D10 in Section 3.4, Characters and Encoding.) Not all code points are assigned to encoded characters. See code point type. Assigned Character. A code point that is assigned to an abstract character. This refers to graphic, format, control, and private-use characters that have been encoded in the Unicode Standard. (See Section 2.4, Code Points and Characters.) Designated Code Point. Any code point that has either been assigned to an abstract character (assigned characters) or that has otherwise been given a normative function by the standard (surrogate code points and noncharacters). This definition excludes reserved code points. Also known as assigned code point. (See Section 2.4 Code Points and Characters.) Reserved Code Point. Any code point of the Unicode Standard that is reserved for future assignment. Also known as an unassigned code point. (See definition D15 in Section 3.4, Characters and Encoding, and Section 2.4, Code Points and Characters.) Surrogate Code Point. A Unicode code point in the range U+D800..U+DFFF. Reserved for use by UTF-16, where a pair of surrogate code units (a high surrogate followed by a low surrogate) “stand in” for a supplementary code point. |
Restricting I personally don't see the reason for making surrogate code points against the validity variant of I also fail to see what is actually gained by making this a validity requirement. The niche of I could go either way, but if surrogate code points are forbidden then |
I don't think this is accurate. The following panics: use std::char;
fn main() {
let x = char::from_u32(0xD801).unwrap();
} |
That's not what I intended to state. What I intended to state is that every code point in |
Ah sorry, seems like I misunderstood. I agree that the above is the most restrictive possible validity invariant. |
FWIW, the Nomicon says that a |
Closing as resolved |
Discussing the validity invariant of the
char
type.The "obvious" choice is that it must be a valid unicode codepoint, and must not contain any uninitialized bits.
However, a possible issue with this choice is that this means we will have to extend the set of valid bit patterns whenever new codepoints get added to unicode. Is that a problem, e.g. when old and new code interact? On first glance it seems like this will only make fewer programs have UB. (@nikomatsakis I think this is related to your "future proofing" concern that you raised elsewhere. Here might be a good place to discuss it with a concrete example.)
The text was updated successfully, but these errors were encountered: