Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OsString::from_wide (in Windows OsStringExt) is unsound #72760

Closed
RalfJung opened this issue May 29, 2020 · 9 comments · Fixed by #72683
Closed

OsString::from_wide (in Windows OsStringExt) is unsound #72760

RalfJung opened this issue May 29, 2020 · 9 comments · Fixed by #72683
Labels
C-bug Category: This is a bug. I-unsound Issue: A soundness hole (worst kind of bug), see: https://en.wikipedia.org/wiki/Soundness O-windows Operating system: Windows T-libs Relevant to the library team, which will review and decide on the PR/issue.

Comments

@RalfJung
Copy link
Member

The following program causes UB:

use std::os::windows::ffi::{OsStrExt, OsStringExt};
use std::ffi::{OsStr, OsString};

fn main() {
    let base = "a\té \u{7f}💩\r";
    let mut base: Vec<u16> = OsStr::new(base).encode_wide().collect();
    base.push(0xD800);
    let _res = OsString::from_wide(&base);
}

Miri says:

error: Undefined Behavior: type validation failed: encountered 0x0000d800, but expected a valid unicode codepoint
   --> /home/r/.rustup/toolchains/miri/lib/rustlib/src/rust/src/libcore/char/convert.rs:102:5
    |
102 |     transmute(i)
    |     ^^^^^^^^^^^^ type validation failed: encountered 0x0000d800, but expected a valid unicode codepoint
    |
    = help: this indicates a bug in the program: it performed an invalid operation, and caused Undefined Behavior
    = help: see https://doc.rust-lang.org/nightly/reference/behavior-considered-undefined.html for further information
            
    = note: inside `std::char::from_u32_unchecked` at /home/r/.rustup/toolchains/miri/lib/rustlib/src/rust/src/libcore/char/convert.rs:102:5
    = note: inside `std::sys_common::wtf8::Wtf8Buf::push_code_point_unchecked` at /home/r/.rustup/toolchains/miri/lib/rustlib/src/rust/src/libstd/sys_common/wtf8.rs:204:26
    = note: inside `std::sys_common::wtf8::Wtf8Buf::from_wide` at /home/r/.rustup/toolchains/miri/lib/rustlib/src/rust/src/libstd/sys_common/wtf8.rs:194:21
    = note: inside `<std::ffi::OsString as std::os::windows::ffi::OsStringExt>::from_wide` at /home/r/.rustup/toolchains/miri/lib/rustlib/src/rust/src/libstd/sys/windows/ext/ffi.rs:101:44
note: inside `main` at wtf8.rs:8:16
   --> wtf8.rs:8:16
    |
8   |     let _res = OsString::from_wide(&base);
    |                ^^^^^^^^^^^^^^^^^^^^^^^^^^

The problem is this code:

pub fn push(&mut self, code_point: CodePoint) {
if let trail @ 0xDC00..=0xDFFF = code_point.to_u32() {
if let Some(lead) = (&*self).final_lead_surrogate() {
let len_without_lead_surrogate = self.len() - 3;
self.bytes.truncate(len_without_lead_surrogate);
self.push_char(decode_surrogate_pair(lead, trail as u16));
return;
}
}
// No newly paired surrogates at the boundary.
self.push_code_point_unchecked(code_point)
}

This calls push_code_point_unchecked unless the new code point is in 0xDC00..=0xDFFF, but what about surrogates in 0xD800..0xDC00?

This code is unchanged since its introduction in c5369eb. I am not sure what the intended safety contract of push_code_point_unchecked is. That method is not marked unsafe but clearly should be -- it calls char::from_u32_unchecked. So my guess is the safety precondition is that CodePoint must not be part of a surrogate pair, but the thing is, push calls it without actually ensuring that condition. The condition it does ensure is that the codepoint is not in 0xDC00..=0xDFFF, but that does not help.

@RalfJung RalfJung added the I-unsound Issue: A soundness hole (worst kind of bug), see: https://en.wikipedia.org/wiki/Soundness label May 29, 2020
@JohnTitor JohnTitor added C-bug Category: This is a bug. O-windows Operating system: Windows T-libs Relevant to the library team, which will review and decide on the PR/issue. labels May 29, 2020
@jonas-schievink jonas-schievink added the I-prioritize Issue: Indicates that prioritization has been requested for this issue. label May 30, 2020
@SimonSapin
Copy link
Contributor

0xD800..0xDC00 is excluded from Unicode scalar values a.k.a. char, but it’s fine as a Unicode code point.

I think this bug is not in the contract of push_code_point_unchecked (which is private anyway) but in its implementation, namely going through char. The fix is to either duplicate the logic of char::encode_utf8 into wtf8.rs, or (to avoid duplication) move that logic to a new function in libcore that takes a u32 parameter and that is called by both Wtf8::push_code_point_unchecked and char::encode_utf8. This function would need to be public because wtf8.rs is in a different crate, but it should be prema-unstable and #[doc(hidden)].

@SimonSapin
Copy link
Contributor

I would classify the priority of this bug as low.

Although this call to char::from_u32_unchecked is UB, I expect Miri checking for it explicitly is the only case where that has any consequence in today’s implementation:

  • char::encode_utf8 behaves as expected (for the purpose of WTF-8) in the surrogate range and does not exploit this UB
  • The "layout" of char in rustc excludes values beyond char::MAX but not surrogates:

ty::Char => tcx.intern_layout(Layout::scalar(
self,
Scalar { value: Int(I32, false), valid_range: 0..=0x10FFFF },
)),

@SimonSapin
Copy link
Contributor

By the way, the docs at https://doc.rust-lang.org/std/char/fn.from_u32_unchecked.html#safety feel insufficient:

Safety

This function is unsafe, as it may construct invalid char values.

For a safe version of this function, see the from_u32 function.

Which values are invalid?


Similarly, https://doc.rust-lang.org/std/char/index.html and https://doc.rust-lang.org/std/primitive.char.html say that char represents Unicode Scalar values, but I couldn’t find it documented anywhere that constructing an out of range char is Undefined Behavior rather than "merely" a logic bug.

Is this specified elsewhere?

What is the closest we have to a written down normative resource that specifies what is or isn’t UB in the the Rust language? (For APIs I assume this is the responsibility of their respective doc-comments.)

@RalfJung
Copy link
Member Author

Although this call to char::from_u32_unchecked is UB, I expect Miri checking for it explicitly is the only case where that has any consequence in today’s implementation

I agree the impact is low, but it's not just Miri -- this blocks #72683, which is how I discovered the problem.

What is the closest we have to a written down normative resource that specifies what is or isn’t UB in the the Rust language?

That would be https://doc.rust-lang.org/nightly/reference/behavior-considered-undefined.html

@RalfJung
Copy link
Member Author

I think this bug is not in the contract of push_code_point_unchecked (which is private anyway) but in its implementation, namely going through char. The fix is to either duplicate the logic of char::encode_utf8 into wtf8.rs, or (to avoid duplication) move that logic to a new function in libcore that takes a u32 parameter and that is called by both Wtf8::push_code_point_unchecked and char::encode_utf8. This function would need to be public because wtf8.rs is in a different crate, but it should be prema-unstable and #[doc(hidden)].

But then what is "unchecked" about it? Elsewhere in that file the code is careful not to construct a char from a CodePoint without checking for surrogates:

pub fn to_char(&self) -> Option<char> {
match self.value {
0xD800..=0xDFFF => None,
_ => Some(unsafe { char::from_u32_unchecked(self.value) }),
}
}

That made me assume the author was aware that surrogates in char are UB.

@SimonSapin
Copy link
Contributor

The Wtf8 type contains a (generalized-UTF-8-encoded) sequence of code point (including potentially surrogates) that doesn’t form a surrogate pair. push_code_point_unchecked doesn’t check for surrogate pairs.

CodePoint::to_char is a public method that is documented as returning a scalar value, therefore excluding all surrogates.

That made me assume the author was aware that surrogates in char are UB.

Ok I looked into it, there are three different people involved.

The original code in 2014 was YOLO transmute https://github.com/SimonSapin/rust-wtf8/blob/76e023dfd56eef27ce36108a0182c156ededde2e/src/lib.rs#L164-L171

Later in 2014, SimonSapin/rust-wtf8@8a42f9e moved to duplicating logic instead.

In 2015, PR #21488 which first imported WTF-8 support in libstd deduplicated that logic by adding a core::char::encode_utf8_raw function that takes u32. (I knew that approach sounded familiar…)

In 2016, PR #32204 changed the signature of char::encode_utf8 and in passing removed encode_utf8_raw. It made wtf8.rs use char::from_u32_unchecked + char::encode_utf8 instead, presumably in the middle of updating many callers other.

@RalfJung
Copy link
Member Author

Okay, so the fix would be to revert the part of #32204 that removed encode_utf8_raw?

@SimonSapin
Copy link
Contributor

That or copy encode_utf8_raw into wtf8.rs. I find them both not great, so meh.

@RalfJung
Copy link
Member Author

Even with that done, there's still UB in another testcase:

use std::os::windows::ffi::{OsStrExt, OsStringExt};
use std::ffi::{OsStr, OsString};

fn main() {
    let mut base: Vec<u16> = OsStr::new("aé ").encode_wide().collect();
    base.push(0xD83D);
    let mut _res: Vec<u16> = OsString::from_wide(&base).encode_wide().collect();
}

This is converting something to a char before calling encode_utf16 -- so probably the same issue as the encode_utf8 above, just for a different encoding.

RalfJung added a commit to RalfJung/rust that referenced this issue May 31, 2020
…imulacrum

from_u32_unchecked: check validity, and fix UB in Wtf8

Fixes rust-lang#72760
@bors bors closed this as completed in 3bbb475 May 31, 2020
bors added a commit to rust-lang/miri that referenced this issue May 31, 2020
test WTF8 encoding corner cases

This adds a Miri-side test for rust-lang/rust#72760.

Blocked on rust-lang/rust#72683.
@jyn514 jyn514 removed the I-prioritize Issue: Indicates that prioritization has been requested for this issue. label Feb 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category: This is a bug. I-unsound Issue: A soundness hole (worst kind of bug), see: https://en.wikipedia.org/wiki/Soundness O-windows Operating system: Windows T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants