-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ACP: A substring API for OsStr
#306
Comments
Accepted. cc @epage |
@blyxxyz (since there isn't a tracking issue yet)
Personally, I would prefer this alternative so that this API focused on the documented user-facing invariants rather than the per-platform-specific undocumented invariants. |
@epage I did really like fuzz testing on Unix and knowing that it would cover any future encoding on any platform. But getting the logic right didn't turn out to be very hard anyway. |
This came up when I was working on the This did inform the approach to the conversions docs takes which is that |
It's definitely easier to explain, yeah. I wrote "on Unix" in the example doc comment but that's already not what I'm proposing. It's also less work to implement, so I'll start with that and we can get more input later. Would it be acceptable to relax the requirements after stabilization? |
From my understanding, it can be relaxed in the future as going from a panicking state to non-panicking shouldn't be breaking. |
Add substring API for `OsStr` This adds a method for taking a substring of an `OsStr`, which in combination with [`OsStr::as_encoded_bytes()`](https://doc.rust-lang.org/std/ffi/struct.OsStr.html#method.as_encoded_bytes) makes it possible to implement most string operations in safe code. API: ```rust impl OsStr { pub fn slice_encoded_bytes<R: ops::RangeBounds<usize>>(&self, range: R) -> &Self; } ``` Motivation, examples and research at rust-lang/libs-team#306. Tracking issue: rust-lang#118485 cc `@epage` r? libs-api
Add substring API for `OsStr` This adds a method for taking a substring of an `OsStr`, which in combination with [`OsStr::as_encoded_bytes()`](https://doc.rust-lang.org/std/ffi/struct.OsStr.html#method.as_encoded_bytes) makes it possible to implement most string operations in safe code. API: ```rust impl OsStr { pub fn slice_encoded_bytes<R: ops::RangeBounds<usize>>(&self, range: R) -> &Self; } ``` Motivation, examples and research at rust-lang/libs-team#306. Tracking issue: #118485 cc `@epage` r? libs-api
Add substring API for `OsStr` This adds a method for taking a substring of an `OsStr`, which in combination with [`OsStr::as_encoded_bytes()`](https://doc.rust-lang.org/std/ffi/struct.OsStr.html#method.as_encoded_bytes) makes it possible to implement most string operations in safe code. API: ```rust impl OsStr { pub fn slice_encoded_bytes<R: ops::RangeBounds<usize>>(&self, range: R) -> &Self; } ``` Motivation, examples and research at rust-lang/libs-team#306. Tracking issue: #118485 cc `@epage` r? libs-api
Add substring API for `OsStr` This adds a method for taking a substring of an `OsStr`, which in combination with [`OsStr::as_encoded_bytes()`](https://doc.rust-lang.org/std/ffi/struct.OsStr.html#method.as_encoded_bytes) makes it possible to implement most string operations in safe code. API: ```rust impl OsStr { pub fn slice_encoded_bytes<R: ops::RangeBounds<usize>>(&self, range: R) -> &Self; } ``` Motivation, examples and research at rust-lang/libs-team#306. Tracking issue: #118485 cc `@epage` r? libs-api
Proposal
Problem statement
OsStr
andOsString
provide access to their (unspecified) internal byte encoding using{as,into}_encoded_bytes()
andfrom_encoded_bytes_unchecked()
methods. It's possible to convert an OS string to bytes, and to convert bytes to an OS string.However,
from_encoded_bytes_unchecked()
is unsafe, and there is no universal way to validate the safety invariants. Some common string operations (splitting, trimming, replacing) are impossible to implement without unsafe code.New APIs should ideally discourage relying on any details of the internal encoding (which may be unstable and not meant for interchange).
Motivating examples or use cases
Argument parsers need to extract substrings from command line arguments. For example,
--option=somefilename
needs to be split intooption
andsomefilename
, and the original filename must be preserved without sanitizing it.clap
currently implementsstrip_prefix
andsplit_once
usingtransmute
(equivalent to the stableencoded_bytes
APIs).lexopt
(my own crate) currently uses the platform-specific APIs, but I'd like to move to theencoded_bytes
API eventually.unsafe
is holding me back since I have working code already and I think some of my users would consider it a regression.The
os_str_bytes
andosstrtools
crates provides high-level string operations for OS strings.os_str_bytes
is in the wild mainly used to convert between raw bytes and OS strings (e.g. 1, 2, 3).osstrtools
enables reasonable uses ofsplit()
to parse$PATH
andreplace()
to fill in command line templates.Solution sketch
I propose a method to take a substring of an
OsStr
, based on offsets into the result ofas_encoded_bytes()
. On Unix any slice (within bounds) would be valid. On Windows this method would panic if the string is not cut on UTF-8 boundaries, exactly like the requirements offrom_encoded_bytes_unchecked
:Note that this is stricter than the actual internal encoding of OS strings on Windows.
Proposed signature:
Examples
A proof of concept is implemented here:
os_str_slice.rs
With an example port of
lexopt
to this API: blyxxyz/lexopt@8077851It should be trivial to port
clap
's transmuting operations to this API, since they all take substrings of anOsStr
.A string replace function can be implemented using
OsStr::slice_encoded_bytes()
andOsString::push()
:Notice that it does require the needle to be non-empty UTF-8.
Guarding the internal encoding
This solution has two attractive properties:
Any API that converts from bytes to an OS string would not have these advantages.
Behavior on niche platforms
The current behavior of OS strings is only documented prominently for Unix and for Windows.
All other platforms reuse the OS string internals of either of these platforms. Almost all of the Unix-alikes expose
OsStrExt
/OsStringExt
, which means they specify the internal encoding to be arbitrary bytes.Only
wasm
uses Unix OS strings without the extension traits. I couldn't find the history of this, but it's probably intentional. JavaScript uses potentially ill-formed UTF-16, like Windows, but WebAssembly can be used in other environments as well, so there is no obvious single set of semantics. (There is some loosely related discussion at rustwasm/wasm-bindgen#1348.)In order to keep the slicing invariants encapsulated it might be necessary to create a third OS string implementation to be used by
wasm
. Since this platform can currently only legally construct OS strings from UTF-8 strings the implementation could be backed bystr
/String
, with the side benefit of free conversion back into UTF-8 strings (currently UTF-8 validation is performed). The implementation of Unix OS strings is simple, and this implementation would be even simpler.Alternatives
from_encoded_bytes(&[u8])
method. This is a natural counterpart tofrom_encoded_bytes_unchecked(&[u8])
. But:wtf8
crate deliberately does not implement this.)split_at(&self, pos: usize) -> (&OsStr, &OsStr)
style method. This is ostensibly simpler than a slicing API. But:left.len() + right.len() > orig.len()
. The encoded bytes on one side ofpos
might form a valid OMG-WTF-8 string while those on the other side do not.slice_encoded_bytes()
would be unproblematic.OsStr
aremake_ascii_lowercase()
andmake_ascii_uppercase()
.)None
rather than panicking.Index
impl forOsStr
.OsStr
has a richer API that allows determining offsets without callingas_encoded_bytes()
.The method can be implemented using existing APIs, see the proof of concept above. But the fact that it's a minimal safe API that can be used to implement higher-level opinionated operations makes it a natural fit for the standard library.
Links and related work
os_str_bytes
osstrtools
(str, OsStr)
#114What happens now?
This issue contains an API change proposal (or ACP) and is part of the libs-api team feature lifecycle. Once this issue is filed, the libs-api team will review open proposals as capability becomes available. Current response times do not have a clear estimate, but may be up to several months.
Possible responses
The libs team may respond in various different ways. First, the team will consider the problem (this doesn't require any concrete solution or alternatives to have been proposed):
Second, if there's a concrete solution:
The text was updated successfully, but these errors were encountered: