Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ACP: Pattern methods for OsStr without OsStr patterns #311

Closed
epage opened this issue Dec 7, 2023 · 32 comments
Closed

ACP: Pattern methods for OsStr without OsStr patterns #311

epage opened this issue Dec 7, 2023 · 32 comments
Labels
ACP-accepted API Change Proposal is accepted (seconded with no objections) api-change-proposal A proposal to add or alter unstable APIs in the standard libraries I-libs-api-nominated Indicates that an issue has been nominated for discussion during a team meeting. T-libs-api

Comments

@epage
Copy link

epage commented Dec 7, 2023

Proposal

Problem statement

With rust-lang/rust#115443, developers, like those writing CLI parsers, can now perform (limited) operations on OsStr but it requires unsafe to get an OsStr back, requiring the developer to understand and follow some very specific safety notes that cannot be checked by miri.

RFC #2295 exists for improving this but its been stalled out. The assumption here is that part of the problem with that RFC is how wide its scope is and that by shrinking the scope, we can get some benefits now.

Motivating examples or use cases

Mostly copied from #306

Argument parsers need to extract substrings from command line arguments. For example, --option=somefilename needs to be split into option and somefilename, and the original filename must be preserved without sanitizing it.

clap currently implements strip_prefix and split_once using transmute (equivalent to the stable encoded_bytes APIs).

The os_str_bytes and osstrtools crates provides high-level string operations for OS strings. os_str_bytes is in the wild mainly used to convert between raw bytes and OS strings (e.g. 1, 2, 3). osstrtools enables reasonable uses of split() to parse $PATH and replace() to fill in command line templates.

Solution sketch

Provide strs Pattern-accepting methods on &OsStr.

Defer out OsStr being used as a Pattern and OsStr indexing support which are specified in RFC #2295.

Example of methods to be added:

impl OsStr {
    pub fn contains<'a, P>(&'a self, pat: P) -> bool
    where
        P: Pattern<&'a Self>;

    pub fn starts_with<'a, P>(&'a self, pat: P) -> bool
    where
        P: Pattern<&'a Self>;

    pub fn ends_with<'a, P>(&'a self, pat: P) -> bool
    where
        P: Pattern<&'a Self>,
        P::Searcher: ReverseSearcher<&'a Self>;

    pub fn find<'a, P>(&'a self, pat: P) -> Option<usize>
    where
        P: Pattern<&'a Self>;

    pub fn rfind<'a, P>(&'a self, pat: P) -> Option<usize>
    where
        P: Pattern<&'a Self>,
        P::Searcher: ReverseSearcher<&'a Self>;

    // (Note: these should return a concrete iterator type instead of `impl Trait`.
    //  For ease of explanation the concrete type is not listed here.)
    pub fn split<'a, P>(&'a self, pat: P) -> impl Iterator<Item = &'a Self>
    where
        P: Pattern<&'a Self>;

    pub fn split_inclusive<'a, P>(&'a self, pat: P) -> impl Iterator<Item = &'a Self>
    where
        P: Pattern<&'a Self>;

    pub fn rsplit<'a, P>(&'a self, pat: P) -> impl Iterator<Item = &'a Self>
    where
        P: Pattern<&'a Self>,
        P::Searcher: ReverseSearcher<&'a Self>;

    pub fn split_terminator<'a, P>(&'a self, pat: P) -> impl Iterator<Item = &'a Self>
    where
        P: Pattern<&'a Self>;

    pub fn rsplit_terminator<'a, P>(&'a self, pat: P) -> impl Iterator<Item = &'a Self>
    where
        P: Pattern<&'a Self>,
        P::Searcher: ReverseSearcher<&'a Self>;

    pub fn splitn<'a, P>(&'a self, n: usize, pat: P) -> impl Iterator<Item = &'a Self>
    where
        P: Pattern<&'a Self>;

    pub fn rsplitn<'a, P>(&'a self, n: usize, pat: P) -> impl Iterator<Item = &'a Self>
    where
        P: Pattern<&'a Self>,
        P::Searcher: ReverseSearcher<&'a Self>;

    pub fn split_once<'a, P>(&'a self, delimiter: P) -> Option<(&'a Self, &'a Self)>where
        P: Pattern<&'a Self>;

    pub fn rsplit_once<'a, P>(&'a self, delimiter: P) -> Option<(&'a Self, &'a Self)>where
        P: Pattern<&'a Self>;

    pub fn matches<'a, P>(&'a self, pat: P) -> impl Iterator<Item = &'a Self>
    where
        P: Pattern<&'a Self>;

    pub fn rmatches<'a, P>(&self, pat: P) -> impl Iterator<Item = &'a Self>
    where
        P: Pattern<&'a Self>,
        P::Searcher: ReverseSearcher<&'a Self>;

    pub fn match_indices<'a, P>(&self, pat: P) -> impl Iterator<Item = (usize, &'a Self)>
    where
        P: Pattern<&'a Self>;

    pub fn rmatch_indices<'a, P>(&self, pat: P) -> impl Iterator<Item = (usize, &'a Self)>
    where
        P: Pattern<&'a Self>,
        P::Searcher: ReverseSearcher<&'a Self>;

    pub fn trim_matches<'a, P>(&'a self, pat: P) -> &'a Self
    where
        P: Pattern<&'a Self>,
        P::Searcher: DoubleEndedSearcher<&'a Self>;

    pub fn trim_start_matches<'a, P>(&'a self, pat: P) -> &'a Self
    where
        P: Pattern<&'a Self>;

    pub fn strip_prefix<'a, P>(&'a self, prefix: P) -> Option<&'a Self> where
    P: Pattern<&'a Self>;

    pub fn strip_suffix<'a, P>(&'a self, prefix: P) -> Option<&'a Self> where
    P: Pattern<&'a Self>;

    pub fn trim_end_matches<'a, P>(&'a self, pat: P) -> &'a Self
    where
        P: Pattern<&'a Self>,
        P::Searcher: ReverseSearcher<&'a Self>;

    pub fn replace<'a, P>(&'a self, from: P, to: &'a Self) -> Self::Owned
    where
        P: Pattern<&'a Self>;

    pub fn replacen<'a, P>(&'a self, from: P, to: &'a Self, count: usize) -> Self::Owned
    where
        P: Pattern<&'a Self>;
}

impl Pattern<&OsStr> for char {}
impl Pattern<&OsStr> for &str {}
impl Pattern<&OsStr> for &String {}
impl Pattern<&OsStr> for &[char] {}
impl Pattern<&OsStr> for &&str {}
impl<const N: usize> Pattern<&OsStr> for &[char; N] {}
impl<F: FnMut(char) -> bool> Pattern<&OsStr> for F {}
impl<const N: usize> Pattern<&OsStr> for [char; N] {}
  • This is meant to match str and if there are any changes between the writing of this ACP and implementation, the focus should be on what str has at the time of implementation (e.g. not adding a deprecated variant but the new one)
  • We likely want to add trim, trim_start, and trim_end to be consistent with trim_start_matches / trim_end_matches
  • for more details, see Add pattern matching API to OsStr rust#109350

This should work because

From an API design perspective, there is strong precedence for it

  • Its copying methods over from str
  • The design is a subset of RFC #2295 (approved) and RFC #1309 (postponed)
    • By deferring support for OsStr as a pattern, we bypass the main dividing point between proposals (split APIs, panic on unpaired surrogates, switching away from WTF-8)

Alternatives

#306 proposes a OsStr::slice_encoded_bytes

  • Still requires writing higher level operations on top, but at least its without unsafe
  • Either takes a performance hit to be consistent across platforms or has per-platform caveats that will be similarly hard to get right for less common platforms among developers (e.g. Windows)
  • As far as I can tell, there isn't precedence for an API design like this meaning more new ground has to be set (naming, deciding the above preconditions, etc)

Links and related work

What happens now?

This issue contains an API change proposal (or ACP) and is part of the libs-api team feature lifecycle. Once this issue is filed, the libs-api team will review open proposals as capability becomes available. Current response times do not have a clear estimate, but may be up to several months.

Possible responses

The libs team may respond in various different ways. First, the team will consider the problem (this doesn't require any concrete solution or alternatives to have been proposed):

  • We think this problem seems worth solving, and the standard library might be the right place to solve it.
  • We think that this probably doesn't belong in the standard library.

Second, if there's a concrete solution:

  • We think this specific solution looks roughly right, approved, you or someone else should implement this. (Further review will still happen on the subsequent implementation PR.)
  • We're not sure this is the right solution, and the alternatives or other materials don't give us enough information to be sure about that. Here are some questions we have that aren't answered, or rough ideas about alternatives we'd want to see discussed.
@epage epage added api-change-proposal A proposal to add or alter unstable APIs in the standard libraries T-libs-api labels Dec 7, 2023
@joshtriplett
Copy link
Member

This seems like a good idea to me. Putting these methods on OsStr will allow code to do simple parsing/splitting/etc in safe code. And because this does not add OsStr to Pattern, it remains a simple addition to API surface without any representation changes.

@pitaj
Copy link

pitaj commented Dec 8, 2023

What implements Pattern<&OsStr>?

@epage
Copy link
Author

epage commented Dec 8, 2023

I've updated the proposal to call that out

impl Pattern<&OsStr> for &str {}
impl Pattern<&OsStr> for char {}
impl Pattern<&OsStr> for &[char] {}
impl<F: FnMut(char) -> bool> Pattern<&OsStr> for F {}
impl Pattern<&OsStr> for &&str {}
impl Pattern<&OsStr> for &String {}

Basically, this is a direct mirror of &strs functionality. I don't think I've seen this called out anywhere in the previous RFCs but my assumption is non-UTF8 byte sequences in OsStr are just considered non-matches.

@pitaj
Copy link

pitaj commented Dec 8, 2023

impl<F: FnMut(char) -> bool> Pattern<&OsStr> for F {}

IIRC, this is currently forbidden due to coherence rules.

@joshtriplett
Copy link
Member

joshtriplett commented Dec 12, 2023

@pitaj Presumably we'd do this the same way we already did for Pattern.

@pitaj
Copy link

pitaj commented Dec 12, 2023

AFAIK all of today's Patterns are defined in core, but OsStr lives in std. That's where the coherence issues come in.

@joshtriplett
Copy link
Member

@pitaj Ah, I see. I think we have a mechanism that allows us to have impls of core traits in std without regard for the orphan rule, which would address that.

@pitaj
Copy link

pitaj commented Dec 12, 2023

Last I checked, there's a mechanism for inherent impls, but not for traits. That said, it's probably something that could be added.

@programmerjake
Copy link
Member

we could put just the OsStr struct in core as an implementation detail, along with necessary trait impls, and put the rest in std with a re-export of OsStr...this would allow implementing:

impl<F: FnMut(char) -> bool> Pattern<&OsStr> for F {}

@m-ou-se
Copy link
Member

m-ou-se commented Dec 19, 2023

We briefly discussed this in last week's libs-api meeting. While we agree it'd be good for OsStr to have a more complete API, we're worried about the amount of string types: should CStr and [ascii::Char] have the same api, for example?

It'd be good to first explore solutions that could benefit all string types, before continuing with this proposal to extend OsStr itself. For example, could a trait or another mechanism be used to make this api availabel to all string types?

@epage
Copy link
Author

epage commented Dec 19, 2023

So if I understand correctly, the desire is to explore the design of a Pattern Extension trait with methods like contains, starts_with, etc that can apply to all string-like types?

Including CStr in this list is interesting because some methods, like strip_prefix, will require allocating into a CString. To be clear, that is expected and part of the design requirements for this?

@programmerjake
Copy link
Member

Including CStr in this list is interesting because some methods, like strip_prefix, will require allocating into a CString.

i think you meant strip_suffix? strip_prefix should be able to return &CStr

@epage
Copy link
Author

epage commented Dec 20, 2023

Good point, and figuring out how we should handle strip_prefix vs all other slicing operations adds another area for design exploration and discussions. Do we have all slicing operations treated the same, having CStr always return an owned type? Or do we hard code into the trait that only leading/middle/arbitrary content might be owned and trailing content is always borrowed.

For example, CStr::split_once could return either (Owned, &Borrowed) or (Owned, Owned) while split would always return impl Iterator<Item=Owned> (or maybe Cow?).

@programmerjake
Copy link
Member

maybe CStr::split_once could return something more like (&[u8], &CStr)?

@blyxxyz
Copy link

blyxxyz commented Dec 20, 2023

CStr operations could return &[u8] and let the caller decide whether to turn it into CString (at the cost of a null byte check). Or they could mutate the original string to insert null bytes, strtok-style (probably a bad idea). Or there could be a CStrSlice type that has no terminator but is guaranteed to contain no null bytes...

Should function arguments be full-blown CStrs? It's important that they not have internal null bytes but for e.g. a search pattern there's no technical reason for the terminator, and in some cases that could require an allocation. (The recently stabilized C string literals would help with this at least.)

But mainly: Should CStr have this API at all? My experience is very limited but it does include a case where raw C string pointers were passed around arbitrarily when they should have been converted ASAP after the function call. How big of a use case is there for manipulating these instead of converting them immediately at the FFI boundary? Unlike OS strings they're just bytes with a validity requirement, so they're easier to convert.

Should [u8] get this API? Cstr would then get lossless access through to_bytes().

@programmerjake
Copy link
Member

wouldn't CStrSlice just be [NonZeroU8] (except that's kinda hard to work with currently due to lacking lots of byte string methods)

@tgross35
Copy link

tgross35 commented Jan 9, 2024

I like the trait idea, especially since it will allow writing combinators that work for any string type and being able to reuse them on both CStrs from FFI and &strs from users.

Maybe a trait could look something like this (ignoring lifetimes):

trait SliceLikePattern: ToOwned {
    // Yes, we don't have associated type defaults...

    /// Result of splitting items
    type Partial = Self;

    /// Rightmost result of split items if different than `Partial`, e.g. for `CStr`
    type PartialRight = Self::Partial;

    /// Pattern type used when a single element will be extracted. `u8` for `&[u8]`,
    /// `str::pattern::Pattern` for str, maybe `u8` or `u16` for `OsStr`
    /// Or maybe  `FnMut(&u8) -> bool` for slices, as in `split_once`
    type ItemPattern;

    /// Pattern type used when a partial (slice) is expected, `&[u8]` for `&[u8]`
    /// still `str::pattern::Pattern` for `str`
    type PartialPattern;

    /// PartialPattern but if there is a specific right-first search
    /// e.g. str's `<P as Pattern<'a>>::Searcher: ReverseSearcher<'a>`
    type PartialPatternReverse = Self::PartialPattern;
    
    fn split_at(&self, mid: usize) -> (&Self::Partial, &Self::PartialRight);
    fn split_at_mut(&self, mid: usize) -> (&mut Self::Partial, &mut Self::PartialRight);
    fn contains<P: Self::ItemPattern>(&self, pat: P) -> bool;
    fn starts_with<P: Self::PartialPattern>(&self, pat: P) -> bool;
    fn ends_with<P: Self::PartialPatternReverse>(&self, pat: P) -> bool;
    fn find<P: Self::PartialPattern>(&self, pat: P) -> Option<usize>;
    fn rfind<P: Self::PartialPatternReverse>(&self, pat: P) -> Option<usize>;
    fn split<P: Self::PartialPattern>(&self, pat: P) -> Split<P>;
    // ... similar variants of iterating splits and matches
    fn split_once<P: Self::ItemPattern>(&self, pat: P) -> Option<(&Self::Partial, &Self::PartialRight)>;
    fn rsplit_once<P: Self::ItemPatternReverse>(&self, pat: P) -> Option<(&Self::Partial, &Self::PartialRight)>;
    // I don't think we can do simple `trim_{start, end}` here or anything else that
    // relies on whitespace knowledge
    fn trim_start_matches<P: Self::PartialPattern>(&self, pat: P) -> &Self::PartialRight;
    fn trim_end_matches<P: Self::PartialPatternReverse>(&self, pat: P) -> &Self::Partial;
    fn strip_prefix<P: Self::PartialPattern>(&self, pat: P) -> Option<&Self::PartialRight>;
    fn strip_suffix<P: Self::PartialPatternReverse>(&self, pat: P) -> Option<&Self::Partial>;
    fn replace<P: Self::PartialPattern>(&'a self, from: P, to: &Self::PartialRight) -> <Self as ToOwned>::ToOwned;
    fn repeat<P: Self::PartialPattern>(&'a self, from: P, repeat: usize) -> <Self as ToOwned>::ToOwned;
}

There probably isn't anything that restricts this to string-like types, I could see a lot of this being beneficial to let this apply to anything.

@joshtriplett
Copy link
Member

joshtriplett commented Dec 10, 2024

We discussed this one again in today's @rust-lang/libs-api meeting, in light of #499.

We'd like to accept this, using the same API proposal as #499: use the same BytePattern type, make sure that BytePattern is implemented for u8/[u8; N]/&[u8]/Vec<u8>``OsStr/OsString/Path/PathBuf/str/String, and use the same *_bytes methods being proposed there.

Sorry that this has been such a long and storied road getting to this point.

@joshtriplett joshtriplett added the ACP-accepted API Change Proposal is accepted (seconded with no objections) label Dec 10, 2024
@epage
Copy link
Author

epage commented Dec 10, 2024

@joshtriplett from my understanding of #499, we'd have functionality like

trait ByteSlice {
    fn split_bytes(&self, pat: impl BytePattern) -> BytesSplit<'_, P>;
}

trait BytePattern { ... }

impl BytePattern for &[u8] { ... }

// `BytesSplit<'_, P>` would be an `Iterator<&[u8]>`

If that is correct, that misses the key goal of this ACP: use of OsStr without unsafe. Safe manipulation of OsStr requires patterns only be UTF-8 and that the output type is OsStr (we could add checked constructors for OsStr but that is extra work that isn't needed in these cases and has its own design decisions to work out).

Use of a ByteSlice trait would require unsafe to turn the result back into an OsStr and a // SAFETY requirement about what types or values are used for BytePattern.

If we limit BytePattern to 7-bit ASCII or UTF-8 and have an associated type for the return value so OsStr can preserve itself, then we could avoid unsafe.

@Amanieu
Copy link
Member

Amanieu commented Dec 11, 2024

No, what is being proposed is that we have inherent methods on OsStr that return OsStr slices, but that the pattern accepted for matching can be any sequence of bytes (via the BytePattern trait).

@blyxxyz
Copy link

blyxxyz commented Dec 11, 2024

I don't think it's sound to accept arbitrary byte patterns for Windows/WTF-8 OS strings. Consider:

let crab = std::ffi::OsStr::new("🦀");
assert_eq!(crab.as_encoded_bytes(), b"\xF0\x9F\xA6\x80");
let (head, tail) = crab.split_once(b"\x9F").unwrap();

head would now be "\xF0", but that's not valid WTF-8, so it causes UB and it can't be transformed into UTF-16.

  • Accepting OsStr patterns is (I think) the most permissive API that's still safe and sensible.
  • Accepting only str patterns like this ACP proposes is slightly less permissive but much easier to implement while still providing almost all the value.
  • Accepting [u8] patterns would be too permissive. (Unless you reject/ignore problematic patterns at runtime, but that seems worse than restricting the patterns in the type system.)

@ChrisDenton
Copy link
Member

split_once can return None if there's no valid way to split the OsStr.

@epage
Copy link
Author

epage commented Dec 11, 2024

In that case, I assume we'd have slicing logic like #306.

@joshtriplett
Copy link
Member

@epage Thanks for drawing attention to this; I think we missed that. Clarifying the issue here: the set of methods isn't a problem here, it's just a question of what types they accept and return, right? It makes sense that BytePattern may be too general here.

@epage
Copy link
Author

epage commented Dec 11, 2024

Agreed

Some options for BytePattern:

  • Mark the functions as unsafe
  • Require UTF-8 boundaries (cross-platform expectations) and
    • panic if its not
    • fine some sentinel answer
  • Require platform-specific bounadires and the same above ways of responding

We could also go back to the original proposal and use Pattern and skip over the non-UTF8 parts. We could still offer the BytePattern variants and return &[u8].

@joshtriplett
Copy link
Member

I definitely don't think we'll want to mark the functions unsafe. That would lose much of the value of providing them.

Of the various solutions, which path would you recommend, balancing simplicity, usability, and consistency with #499?

@joshtriplett joshtriplett reopened this Dec 11, 2024
@joshtriplett joshtriplett added the I-libs-api-nominated Indicates that an issue has been nominated for discussion during a team meeting. label Dec 11, 2024
@tgross35
Copy link

Could #311 (comment) (or a simplified version of it) work? This would provide methods similar to those from BytePattern, with the exception that the return values are associated types.

@epage
Copy link
Author

epage commented Dec 11, 2024

I would propose making Pattern work, matching names to &str (ie the original proposal).

If you consider BytePattern the base case, &str is specialized for its encoding requirements. We could add the _bytes methods to &str but they would be redundant with the str versions. You can always call as_bytes().

I think OsStr mostly fits in the &str camp. There are special encoding requirements and as_encoded_bytes can always be used. When working with platform-specific details, you could benefit from BytePattern but I assume you'd already be wanting to drop down to &[u8].

In both cases, we could always add _bytes functions later to &str and &[u8]

EDIT: We could design a new interface for OsStr but I don't think that will be worth it.

@joshtriplett
Copy link
Member

@epage Got it! So, use Pattern to restrict the input to UTF-8, and rely on OsStr/OsString being supersets of UTF-8?

@epage
Copy link
Author

epage commented Dec 11, 2024

Yes, to restrict the patterns/needles to UTF-8 while the haystack can be anything. Non-UTF-8 content won't match. There might be some details with that to work out but i figure thats what the unstable period will be for.

@joshtriplett
Copy link
Member

We discussed this in today's @rust-lang/libs-api meeting.

We realized that using Pattern would be a problem because of the FnMut(char) -> bool case.

We considered three alternatives:

Option A: use Pattern and call the function with a Unicode replacement character for non-UTF-8 (using the logic from https://doc.rust-lang.org/std/str/struct.Utf8Error.html for how many bytes to treat as the invalid character)

Option B: AsRef<OsStr>.

Option C: A sealed OsStrPattern trait, which initially has all the variants of Pattern except FnMut(char) -> bool.

We decided on option C for the initial experiment, and we can evaluate whether this was the right choice at stabilization time.

@epage
Copy link
Author

epage commented Dec 18, 2024

We realized that using Pattern would be a problem because of the FnMut(char) -> bool case.

For the record, the FnMut problem is that it requires turning parts of the OsStr to a char and you can do |c| c != '/'. In doing a negative match, we need to do something regarding the non-UTF8 data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ACP-accepted API Change Proposal is accepted (seconded with no objections) api-change-proposal A proposal to add or alter unstable APIs in the standard libraries I-libs-api-nominated Indicates that an issue has been nominated for discussion during a team meeting. T-libs-api
Projects
None yet
Development

No branches or pull requests

9 participants