ACP: A substring API for `OsStr` #306

blyxxyz · 2023-11-28T20:14:23Z

Proposal

Problem statement

OsStr and OsString provide access to their (unspecified) internal byte encoding using {as,into}_encoded_bytes() and from_encoded_bytes_unchecked() methods. It's possible to convert an OS string to bytes, and to convert bytes to an OS string.

However, from_encoded_bytes_unchecked() is unsafe, and there is no universal way to validate the safety invariants. Some common string operations (splitting, trimming, replacing) are impossible to implement without unsafe code.

New APIs should ideally discourage relying on any details of the internal encoding (which may be unstable and not meant for interchange).

Motivating examples or use cases

Argument parsers need to extract substrings from command line arguments. For example, --option=somefilename needs to be split into option and somefilename, and the original filename must be preserved without sanitizing it.

clap currently implements strip_prefix and split_once using transmute (equivalent to the stable encoded_bytes APIs).

lexopt (my own crate) currently uses the platform-specific APIs, but I'd like to move to the encoded_bytes API eventually. unsafe is holding me back since I have working code already and I think some of my users would consider it a regression.

The os_str_bytes and osstrtools crates provides high-level string operations for OS strings. os_str_bytes is in the wild mainly used to convert between raw bytes and OS strings (e.g. 1, 2, 3). osstrtools enables reasonable uses of split() to parse $PATH and replace() to fill in command line templates.

Solution sketch

I propose a method to take a substring of an OsStr, based on offsets into the result of as_encoded_bytes(). On Unix any slice (within bounds) would be valid. On Windows this method would panic if the string is not cut on UTF-8 boundaries, exactly like the requirements of from_encoded_bytes_unchecked:

Due to the encoding being self-synchronizing, the bytes from OsStr::as_encoded_bytes can be split either immediately before or immediately after any valid non-empty UTF-8 substring.

Note that this is stricter than the actual internal encoding of OS strings on Windows.

Proposed signature:

impl OsStr {
    /// Takes a substring based on a range that corresponds to the return value of
    /// [`OsStr::as_encoded_bytes`].
    ///
    /// The range's start and end must lie on valid `OsStr` boundaries.
    ///
    /// On Unix any boundaries are valid, as OS strings may contain arbitrary bytes.
    ///
    /// On other platforms such as Windows the internal encoding is currently
    /// unspecified, and a valid `OsStr` boundary is one of:
    /// - The start of the string
    /// - The end of the string
    /// - Immediately before a valid non-empty UTF-8 substring
    /// - Immediately after a valid non-empty UTF-8 substring
    ///
    /// # Panics
    ///
    /// Panics if `range` does not lie on valid `OsStr` boundaries or if it
    /// exceeds the end of the string.
    ///
    /// # Example
    ///
    /// ```
    /// use std::ffi::OsStr;
    ///
    /// let os_str = OsStr::new("foo=bar");
    /// let bytes = os_str.as_encoded_bytes();
    /// if let Some(index) = bytes.iter().position(|b| *b == b'=') {
    ///     let key = os_str.slice_encoded_bytes(..index);
    ///     let value = os_str.slice_encoded_bytes(index + 1..);
    ///     assert_eq!(key, "foo");
    ///     assert_eq!(value, "bar");
    /// }
    /// ```
    fn slice_encoded_bytes<R: RangeBounds<usize>>(&self, range: R) -> &Self;
}

Examples

A proof of concept is implemented here: os_str_slice.rs

With an example port of lexopt to this API: blyxxyz/lexopt@8077851

It should be trivial to port clap's transmuting operations to this API, since they all take substrings of an OsStr.

A string replace function can be implemented using OsStr::slice_encoded_bytes() and OsString::push():

fn replace(source: &OsStr, from: &str, to: &OsStr) -> OsString {
    assert!(!from.is_empty());

    let from = from.as_bytes();
    let mut result = OsString::new();
    let bytes = source.as_encoded_bytes();
    let mut prev = 0;
    let mut index = 0;
    while index < bytes.len() {
        if bytes[index..].starts_with(from) {
            result.push(source.slice_encoded_bytes(prev..index));
            result.push(to);
            index += from.len();
            prev = index;
        } else {
            index += 1;
        }
    }
    result.push(source.slice_encoded_bytes(prev..));
    result
}

Notice that it does require the needle to be non-empty UTF-8.

Guarding the internal encoding

This solution has two attractive properties:

It does not make it any easier to use the internal encoding for interchange (which WTF-8 prohibits).
It can enforce stricter requirements than the actual internal encoding, which makes it hard for user code to rely on unspecified details.

Any API that converts from bytes to an OS string would not have these advantages.

Behavior on niche platforms

The current behavior of OS strings is only documented prominently for Unix and for Windows.

All other platforms reuse the OS string internals of either of these platforms. Almost all of the Unix-alikes expose OsStrExt/OsStringExt, which means they specify the internal encoding to be arbitrary bytes.

Only wasm uses Unix OS strings without the extension traits. I couldn't find the history of this, but it's probably intentional. JavaScript uses potentially ill-formed UTF-16, like Windows, but WebAssembly can be used in other environments as well, so there is no obvious single set of semantics. (There is some loosely related discussion at rustwasm/wasm-bindgen#1348.)

In order to keep the slicing invariants encapsulated it might be necessary to create a third OS string implementation to be used by wasm. Since this platform can currently only legally construct OS strings from UTF-8 strings the implementation could be backed by str/String, with the side benefit of free conversion back into UTF-8 strings (currently UTF-8 validation is performed). The implementation of Unix OS strings is simple, and this implementation would be even simpler.

Alternatives

A checked, safe from_encoded_bytes(&[u8]) method. This is a natural counterpart to from_encoded_bytes_unchecked(&[u8]). But:
- It would encourage relying on the unspecified internal encoding and using it for interchange.
- It would require validating the entire string, which may take O(n) time, while a slicing method can run in O(1) time.
- It would require implementing a WTF-8 validator, which to my knowledge doesn't exist yet. (The original wtf8 crate deliberately does not implement this.)
A split_at(&self, pos: usize) -> (&OsStr, &OsStr) style method. This is ostensibly simpler than a slicing API. But:
- The OMG-WTF-8 encoding does not support this. It requires some splits to produce overlapping slices, so that left.len() + right.len() > orig.len(). The encoded bytes on one side of pos might form a valid OMG-WTF-8 string while those on the other side do not.
  - Even if we fudge it by shifting the boundary this prohibits a mutable variant of the method, since overlapping mutable references are unsound. A mutable version of slice_encoded_bytes() would be unproblematic.
    - (The only currently existing mutating methods on OsStr are make_ascii_lowercase() and make_ascii_uppercase().)
  - This would only become a concern if we switch to this (or a similar) encoding and start specifying it.
  - RFC #2295 that proposes this encoding was merged five years ago (but has not been implemented).
The method could return None rather than panicking.
- Incorrect offsets would typically imply a bug or an assumption about the internal encoding. I can't off-hand think of any uses for this.
The method could enforce the same requirements on all platforms, including Unix (where the internal encoding is specified to be arbitrary bytes).
- This would make it easier to test that user logic is encoding-independent.
  - Writing encoding-independent logic is not very hard to begin with, though.
The method could panic only if even the unspecified invariants of the internal encoding are violated.
- This would encourage depending on the details of the internal encoding.
The method could instead be an Index impl for OsStr.
- This would perhaps be too prominent.
- It would be harder to find the documentation with the exact requirements.
- It may make sense to add this if/when OsStr has a richer API that allows determining offsets without calling as_encoded_bytes().

The method can be implemented using existing APIs, see the proof of concept above. But the fact that it's a minimal safe API that can be used to implement higher-level opinionated operations makes it a natural fit for the standard library.

Links and related work

What happens now?

This issue contains an API change proposal (or ACP) and is part of the libs-api team feature lifecycle. Once this issue is filed, the libs-api team will review open proposals as capability becomes available. Current response times do not have a clear estimate, but may be up to several months.

Possible responses

The libs team may respond in various different ways. First, the team will consider the problem (this doesn't require any concrete solution or alternatives to have been proposed):

We think this problem seems worth solving, and the standard library might be the right place to solve it.
We think that this probably doesn't belong in the standard library.

Second, if there's a concrete solution:

We think this specific solution looks roughly right, approved, you or someone else should implement this. (Further review will still happen on the subsequent implementation PR.)
We're not sure this is the right solution, and the alternatives or other materials don't give us enough information to be sure about that. Here are some questions we have that aren't answered, or rough ideas about alternatives we'd want to see discussed.

The text was updated successfully, but these errors were encountered:

BurntSushi · 2023-11-28T21:07:46Z

Accepted.

cc @epage

epage · 2023-11-28T21:19:35Z

@blyxxyz (since there isn't a tracking issue yet)

The method could enforce the same requirements on all platforms, including Unix (where the internal encoding is specified to be arbitrary bytes).

Personally, I would prefer this alternative so that this API focused on the documented user-facing invariants rather than the per-platform-specific undocumented invariants.

blyxxyz · 2023-11-28T21:36:26Z

@epage
That was my initial preference as well, but I changed my mind while writing the proposal. The internal encoding for Unix is documented and exposed and locked-in through the extension traits, so it feels overly pedantic to go out of our way and perform these checks anyway.

I did really like fuzz testing on Unix and knowing that it would cover any future encoding on any platform. But getting the logic right didn't turn out to be very hard anyway.

epage · 2023-11-28T22:20:40Z

This came up when I was working on the encoded_bytes API (iirc in a very side conversation so none of what I say represents the libs api team endorsing this by merging that feature). We could document the per-platform behavior for the other encoded_bytes APIs but I felt it better to keep things simple from the user perspective and say that those users need to use OsStrExt for that.

This did inform the approach to the conversions docs takes which is that encoded_bytes can only offer cross-platform guarantees and that OsStrExt is still relevant for per-platform guarantees.

blyxxyz · 2023-11-28T23:07:38Z

It's definitely easier to explain, yeah. I wrote "on Unix" in the example doc comment but that's already not what I'm proposing.

It's also less work to implement, so I'll start with that and we can get more input later.

Would it be acceptable to relax the requirements after stabilization?

epage · 2023-11-28T23:50:15Z

From my understanding, it can be relaxed in the future as going from a panicking state to non-panicking shouldn't be breaking.

Add substring API for `OsStr` This adds a method for taking a substring of an `OsStr`, which in combination with [`OsStr::as_encoded_bytes()`](https://doc.rust-lang.org/std/ffi/struct.OsStr.html#method.as_encoded_bytes) makes it possible to implement most string operations in safe code. API: ```rust impl OsStr { pub fn slice_encoded_bytes<R: ops::RangeBounds<usize>>(&self, range: R) -> &Self; } ``` Motivation, examples and research at rust-lang/libs-team#306. Tracking issue: rust-lang#118485 cc `@epage` r? libs-api

Add substring API for `OsStr` This adds a method for taking a substring of an `OsStr`, which in combination with [`OsStr::as_encoded_bytes()`](https://doc.rust-lang.org/std/ffi/struct.OsStr.html#method.as_encoded_bytes) makes it possible to implement most string operations in safe code. API: ```rust impl OsStr { pub fn slice_encoded_bytes<R: ops::RangeBounds<usize>>(&self, range: R) -> &Self; } ``` Motivation, examples and research at rust-lang/libs-team#306. Tracking issue: #118485 cc `@epage` r? libs-api

blyxxyz added api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api labels Nov 28, 2023

BurntSushi added the ACP-accepted API Change Proposal is accepted (seconded with no objections) label Nov 28, 2023

BurntSushi closed this as completed Nov 28, 2023

This was referenced Nov 30, 2023

Add substring API for OsStr rust-lang/rust#118484

Merged

Tracking Issue for os_str_slice rust-lang/rust#118485

Open

epage mentioned this issue Dec 7, 2023

ACP: Pattern methods for OsStr without OsStr patterns #311

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ACP: A substring API for `OsStr` #306

ACP: A substring API for `OsStr` #306

blyxxyz commented Nov 28, 2023

BurntSushi commented Nov 28, 2023

epage commented Nov 28, 2023

blyxxyz commented Nov 28, 2023

epage commented Nov 28, 2023 •

edited

Loading

blyxxyz commented Nov 28, 2023

epage commented Nov 28, 2023

ACP: A substring API for OsStr #306

ACP: A substring API for OsStr #306

Comments

blyxxyz commented Nov 28, 2023

Proposal

Problem statement

Motivating examples or use cases

Solution sketch

Examples

Guarding the internal encoding

Behavior on niche platforms

Alternatives

Links and related work

What happens now?

Possible responses

BurntSushi commented Nov 28, 2023

epage commented Nov 28, 2023

blyxxyz commented Nov 28, 2023

epage commented Nov 28, 2023 • edited Loading

blyxxyz commented Nov 28, 2023

epage commented Nov 28, 2023

ACP: A substring API for `OsStr` #306

ACP: A substring API for `OsStr` #306

epage commented Nov 28, 2023 •

edited

Loading