Skip to content

Conversation

thaliaarchi
Copy link
Contributor

Motor OS (a new target added in #147000) guarantees that strings from the OS are valid UTF-8, yet the OsStrExt/OsStringExt traits, which aim to expose the inner UTF-8 representations, use checked UTF-8 conversions.

The only way for a user to construct an OsStr/OsString from arbitrary bytes is via from_encoded_bytes_unchecked, which requires:

As the encoding is unspecified, callers must pass in bytes that originated as a mixture of validated UTF-8 and bytes from OsStr::as_encoded_bytes from within the same Rust version built for the same target platform.

Thus, callers are required to supply bytes which originated from OsStr::as_encoded_bytes (i.e., guaranteed to be UTF-8) and/or are valid UTF-8, so it is library UB for an OsStr/OsString to contain invalid UTF-8 on Motor OS.

Since the standard library can make these guarantees, I think it is appropriate for these extension traits to perform unchecked conversions. As they stand, they offer no benefit over existing methods.

Also: Replace OsStringExt::as_str with OsStringExt::into_string (just use OsStr::as_str for the other) and mirror the Unix comments for these functions.

cc @lasiotus

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Oct 17, 2025
@rustbot
Copy link
Collaborator

rustbot commented Oct 17, 2025

r? @Mark-Simulacrum

rustbot has assigned @Mark-Simulacrum.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

@tgross35
Copy link
Contributor

My thinking during review was indeed that these should be zero-cost, but I missed that it used the to_ methods. What might actually be better here is to add a utf8 module to library/std/src/sys/os_str which is basically a thin wrapper over str/String; this has the same effect as what is in the PR here, but gives us protection against accidentally violating the invariants elsewhere.

@thaliaarchi
Copy link
Contributor Author

I like that approach better and will implement it soon

@thaliaarchi
Copy link
Contributor Author

Superseded by #147932

@rustbot rustbot removed the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Oct 21, 2025
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this pull request Oct 22, 2025
Create UTF-8 version of `OsStr`/`OsString`

Implement a UTF-8 version of `OsStr`/`OsString`, in addition to the existing bytes and WTF-8 platform-dependent encodings.

This is applicable for several platforms, but I've currently only implemented it for Motor OS:

- WASI uses Unicode paths, but currently reexports the Unix bytes-assuming `OsStrExt`/`OsStringExt` traits.
  - [wasi:filesystem](https://wa.dev/wasi:filesystem) APIs:
    > Paths are passed as interface-type `strings`, meaning they must consist of a sequence of Unicode Scalar Values (USVs). Some filesystems may contain paths which are not accessible by this API.
  - In [wasi-filesystem#17](WebAssembly/wasi-filesystem#17 (comment)), it was decided that applications can use any Unicode transformation format, so we're free to use UTF-8 (and probably already do). This was chosen over specifically UTF-8 or an ad hoc encoding which preserves paths not representable in UTF-8.
      > The current API uses strings for filesystem paths, which contains sequences of Unicode scalar values (USVs), which applications can work with using strings encoded in UTF-8, UTF-16, or other Unicode encodings.
    >
    > This does mean that the API is unable to open files which do not have well-formed Unicode encodings, which may want separate APIs for handling such paths or may want something like the arf-strings proposal, but if we need that we should file a new issue for it.
- As of Redox OS [0.7.0](https://www.redox-os.org/news/release-0.7.0/), "All paths are now required to be UTF-8, and the kernel enforces this". This appears to have been implemented in commit [d331f72f](https://gitlab.redox-os.org/redox-os/kernel/-/commit/d331f72f2a51fa577072f24bc2587829fd87368b) (Use UTF-8 for all paths, 2021-02-14). Redox does not have `OsStrExt`/`OsStringExt`.
- Motor OS guarantees that its OS strings are UTF-8 in its [current `OsStrExt`/`OsStringExt` traits](https://github.com/moturus/rust/blob/a828ffcf5f04be5cdd91b5fad608102eabc17ec7/library/std/src/os/motor/ffi.rs), but they're still internally bytes like Unix.

This is an alternate approach to rust-lang#147797, which reuses the existing bytes `OsString` and relies on the safety properties of `from_encoded_bytes_unchecked`. Compared to that, this also gains efficiency from propagating the UTF-8 invariant to the whole type, as it never needs to test for UTF-8 validity.

Note that Motor OS currently does not build until rust-lang#147930 merges.

cc `@tgross35` (for earlier review)
cc `@alexcrichton,` `@rylev,` `@loganek` (for WASI)
cc `@lasiotus` (for Motor OS)
cc `@jackpot51` (for Redox OS)
@thaliaarchi thaliaarchi deleted the motor-osstr branch October 22, 2025 20:41
github-actions bot pushed a commit to rust-lang/miri that referenced this pull request Oct 23, 2025
Create UTF-8 version of `OsStr`/`OsString`

Implement a UTF-8 version of `OsStr`/`OsString`, in addition to the existing bytes and WTF-8 platform-dependent encodings.

This is applicable for several platforms, but I've currently only implemented it for Motor OS:

- WASI uses Unicode paths, but currently reexports the Unix bytes-assuming `OsStrExt`/`OsStringExt` traits.
  - [wasi:filesystem](https://wa.dev/wasi:filesystem) APIs:
    > Paths are passed as interface-type `strings`, meaning they must consist of a sequence of Unicode Scalar Values (USVs). Some filesystems may contain paths which are not accessible by this API.
  - In [wasi-filesystem#17](WebAssembly/wasi-filesystem#17 (comment)), it was decided that applications can use any Unicode transformation format, so we're free to use UTF-8 (and probably already do). This was chosen over specifically UTF-8 or an ad hoc encoding which preserves paths not representable in UTF-8.
      > The current API uses strings for filesystem paths, which contains sequences of Unicode scalar values (USVs), which applications can work with using strings encoded in UTF-8, UTF-16, or other Unicode encodings.
    >
    > This does mean that the API is unable to open files which do not have well-formed Unicode encodings, which may want separate APIs for handling such paths or may want something like the arf-strings proposal, but if we need that we should file a new issue for it.
- As of Redox OS [0.7.0](https://www.redox-os.org/news/release-0.7.0/), "All paths are now required to be UTF-8, and the kernel enforces this". This appears to have been implemented in commit [d331f72f](https://gitlab.redox-os.org/redox-os/kernel/-/commit/d331f72f2a51fa577072f24bc2587829fd87368b) (Use UTF-8 for all paths, 2021-02-14). Redox does not have `OsStrExt`/`OsStringExt`.
- Motor OS guarantees that its OS strings are UTF-8 in its [current `OsStrExt`/`OsStringExt` traits](https://github.com/moturus/rust/blob/a828ffcf5f04be5cdd91b5fad608102eabc17ec7/library/std/src/os/motor/ffi.rs), but they're still internally bytes like Unix.

This is an alternate approach to rust-lang/rust#147797, which reuses the existing bytes `OsString` and relies on the safety properties of `from_encoded_bytes_unchecked`. Compared to that, this also gains efficiency from propagating the UTF-8 invariant to the whole type, as it never needs to test for UTF-8 validity.

Note that Motor OS currently does not build until rust-lang/rust#147930 merges.

cc `@tgross35` (for earlier review)
cc `@alexcrichton,` `@rylev,` `@loganek` (for WASI)
cc `@lasiotus` (for Motor OS)
cc `@jackpot51` (for Redox OS)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

T-libs Relevant to the library team, which will review and decide on the PR/issue.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants