Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a generic string pattern matching API #528

Merged
merged 10 commits into from
Feb 18, 2015

Conversation

Kimundi
Copy link
Member

@Kimundi Kimundi commented Dec 16, 2014

Summary

Stabilize all string functions working with search patterns around a new
generic API that provides a unified way to define and use those patterns.

Motivation

Right now, string slices define a couple of methods for string
manipulation that work with user provided values that act as
search patterns. For example, split() takes an type implementing CharEq
to split the slice at all codepoints that match that predicate.

Among these methods, the notion of what exactly is being used as a search
pattern varies inconsistently: Many work with the generic CharEq,
which only looks at a single codepoint at a time; and some
work with char or &str directly, sometimes duplicating a method to
provide operations for both.

This RFC proposes a set of traits for defining such patterns, which enable efficient iteration through all matches in a string.

Rendered

impl<'a, 'b> Pattern<'a> for &'b str { /* ... */ }

impl<'a, 'b> Pattern<'a> for &'b [char] { /* ... */ }
impl<'a, F> Pattern<'a> for F where F: FnOnce(char) -> bool { /* ... */ }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would have to be FnMut, not FnOnce, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, yeah.

implementations, as the different results would break the requirement for double ended iterators
to behave like a double ended queues where you just pop elements from both sides.

_However_, all iterators will still implement `DoubleEndedIterator` if the underling
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/underling/underlying/

@aturon
Copy link
Member

aturon commented Dec 17, 2014

I wanted to thank @Kimundi for writing this up -- it's been a long process getting to this stage.

I really like this direction. Not only does it clean up the API (by avoiding _char and _str variants and ad hoc traits), but it makes it extensible -- and I love the idea of hooking in regexes to these methods. It's a bit of a shame that, with HKT, we could simplify the Pattern trait a bit more (to remove the lifetime), but given that we need to stabilize the str API for 1.0, I think this is our best bet.

The only hesitation I have is around performance, but I'd say we can move forward with with a prototype at least, and try to get numbers early in the process to ensure that there's little, if any, regression.

@erickt
Copy link

erickt commented Dec 17, 2014

I like the direction this is going, and I think the pattern of putting something we think we might deprecate in the future in an opt-in trait is a nice way to give us expandability in the future. My first reaction to this is if it'd be possible to broaden this to also support &[T] types, or at least &[u8]. I've directly wanted to do this kind of thing on &[u8] when implementing some network protocols, and I can imagine various scientific applications that may want to use regexes to search large datasets as well.

Also, to bikeshed a tiny bit, I'd prefer Pattern::Matcher instead of Pattern::MatcherImpl.

@chris-morgan
Copy link
Member

Also bear in mind that associated types are unusable at present; notably, trait constraints don’t work.

@aturon
Copy link
Member

aturon commented Dec 17, 2014

@chris-morgan

Also bear in mind that associated types are unusable at present; notably, trait constraints don’t work.

That... is an excellent point. While std is relying on associated types working to some degree, it is not currently planned for alpha to ship with constraints on types in trait definitions, though the beta should.

@Kimundi, what do you think of the following migration plan. We stabilize the full set of methods on str roughly as they are today, but leave _char and _str variants unstable. Then we roll out this API when associated types are ready -- it should be a nonbreaking change. Since on alpha you'll still be able to use unstable features, there's no loss in functionality.

That would also relieve some of the pressure to get this done in the alpha timeframe; we have plenty else on our plate.

cc @alexcrichton @nikomatsakis

@aturon
Copy link
Member

aturon commented Dec 17, 2014

(Note, this plan banks on getting bounds done by beta1.)

@aturon
Copy link
Member

aturon commented Dec 17, 2014

In more detail, I think we could do something like the following, where the #[stable] methods should be able to be switched to use the proposed RFC backwards-compatibly, and the #[unstable] methods can be removed at that point:

#[stable]
fn contains(&self, needle: &str) -> bool;

#[unstable]
fn contains_char(&self, needle: char) -> bool;

#[stable]
fn split<Sep: CharEq>(&'a self, sep: Sep) -> CharSplits<'a, Sep>;
#[stable]
fn splitn<Sep: CharEq>(&'a self, count: uint, sep: Sep) -> CharSplitsN<'a, Sep>;
#[stable]
fn rsplitn<Sep: CharEq>(&'a self, count: uint, sep: Sep) -> CharSplitsN<'a, Sep>;
#[stable]
fn split_terminator<Sep: CharEq>(&'a self, sep: Sep) -> CharSplits<'a, Sep>;

#[stable]
fn match_indices(&'a self, sep: &'a str) -> MatchIndices<'a>;

#[unstable]
fn split_str(&'a self, &'a str) -> StrSplits<'a>;

#[stable]
fn starts_with(&self, needle: &str) -> bool;
#[stable]
fn ends_with(&self, needle: &str) -> bool;

#[unstable]
fn trim_chars<C: CharEq>(&'a self, to_trim: C) -> &'a str;
#[unstable]
fn trim_left_chars<C: CharEq>(&'a self, to_trim: C) -> &'a str;
#[unstable]
fn trim_right_chars<C: CharEq>(&'a self, to_trim: C) -> &'a str;

#[stable]
fn find<C: CharEq>(&self, search: C) -> Option<uint>;
#[stable]
fn rfind<C: CharEq>(&self, search: C) -> Option<uint>;
#[unstable]
fn find_str(&self, &str) -> Option<uint>;

#[stable]
fn replace(&self, from: &str, to: &str) -> String { ... }

@Kimundi
Copy link
Member Author

Kimundi commented Dec 17, 2014

@erickt: I haven't really thought about generalizing it beyond string slices, but the exact same system should work there as well. Maybe there is a way to abstract the trait families here over the actual slice type, which would also enable using them for the proposed os_str types... I'll have to think about it.

I mainly used MatcherImpl to have a different name for sake of discussion, and to avoid weird doc issues around the fact that the associated Matcher type implements the Matcher trait.

@chris-morgan, @aturon: Trait constraints on associated types are not actually necessary for this proposal to work. They are nice because the allow early type error detection, but the condition they check can also happen later at the concrete Iterator impls:

trait Pattern {
    type MatcherImpl;
}

impl<P> Iterator<...> for Split where P::MatcherImpl: Matcher { ... }

(Or does that constrain in where clauses also not work?)

@aturon: If we temporary stabilize on a subset of the design, then I'd try to do it the way I layed out as a primary alternative: Change the StrExt methods as proposed here, but make them bound on CharEq instead of Pattern, while keeping the _str variants around as unstable.

@Kimundi
Copy link
Member Author

Kimundi commented Dec 17, 2014

@erickt: Maybe something like this could exist later on as a general mechanism for pattern searches on slices? (With HKT and trait bounds working):

trait Pattern {
    type Slice<'a>;
    type Matcher<'a> where Matcher: Matcher;
    fn into_matcher<'a>(self, slice: Slice<'a>) -> Matcher<'a>;
}

unsafe trait Matcher for Self<'a> {
    fn next_match(&mut self) -> Option<(uint, uint)>
}

This could live in std::pattern or so, with the here-proposed &str version turning into sub trait where Slice<'a> = &'a str.

@arthurprs
Copy link

looks great, but I share the same concern for performance as aturon

@aturon
Copy link
Member

aturon commented Dec 17, 2014

@Kimundi

@erickt: I haven't really thought about generalizing it beyond string slices, but the exact same system should work there as well. Maybe there is a way to abstract the trait families here over the actual slice type, which would also enable using them for the proposed os_str types... I'll have to think about it.

We should definitely give this some careful thought prior to stabilizing these APIs, because it would be great to be usable across many string/slice types.

At the very least, it seems like the Matcher trait could have an associated haystack type, rather than tying to &'a str; the Pattern trait could then use this for its haystack type as well. That might be enough -- there's nothing else that seems to tie these traits to strings.

@Kimundi
Copy link
Member Author

Kimundi commented Dec 18, 2014

Yeah, that might actually be enough

@Kimundi
Copy link
Member Author

Kimundi commented Dec 19, 2014

@aturon: I think I now agree with your plan to stabilize the methods as-is where a later upgrade path is possible. No need to cause unnecessary churn by renaming back and forth if we want to support them anyway in the long term.

@Kimundi
Copy link
Member Author

Kimundi commented Dec 19, 2014

Btw, I think we might want to change match_indices to be a iterator over (uint, &str) instead of the current (uint, uint). Its more consistent in regard to chars, char_indices and the proposed matches, and more useful because you can always cheaply recover the end index by doing start+match.len(), while recovering the matched slice requires an slice() call that will either involve unsafe code or unnecessary bounds checking.

bors added a commit to rust-lang/rust that referenced this pull request Dec 29, 2014
This stabilizes most methods on `&str` working with patterns in a way that is forwards-compatible with a generic string pattern matching API:
- Methods that are using the primary name for their operation are marked as `#[stable]`, as they can be upgraded to a full `Pattern` API later without existing code breaking. Example: `contains(&str)`
- Methods that are using a more specific name in order to not clash with the primary one are marked as `#[unstable]`, as they will likely be removed once their functionality is merged into the primary one. Example: `contains_char<C: CharEq>(C)`
- The method docs got changed to consistently refer to the pattern types as a pattern.
- Methods whose names do not match in the context of the more generic API got renamed. Example: `trim_chars -> trim_matches` 

Additionally, all methods returning iterators got changed to return unique new types with changed names in accordance with the new naming guidelines.

See also rust-lang/rfcs#528

Due to some deprecations and type changes, this is a 

[breaking-change]
@Kimundi
Copy link
Member Author

Kimundi commented Feb 4, 2015

I extended the Matcher traits with methods for finding the next rejected match, which allowed the trim functions to use the previous early-out logic. The results look promising, at worst there is now only a 2x slowdown for the cases of splitting on an ascii symbol in string of only 4-byte codepoints, which might be attributable to the naive implementation of the char matcher.

https://docs.google.com/spreadsheets/d/181oKA1H5oNCR4NuVSUSB6QVxIi36Xq52cOPK5YL7510/edit?usp=sharing

@Kimundi
Copy link
Member Author

Kimundi commented Feb 16, 2015

I managed to get the previous worst case of 2x slowdown down to 1.2 or so locally, but in the combined bencher run its still at 2x:

https://docs.google.com/spreadsheets/d/1q02kDVLc6YL3j73Ig-93IY1DyYtCjODp-9Ls8ztVZ_s/edit?usp=sharing

Looking at the coretest IR, it seems that llvm does not inline some of the generic instantiations at every use side, which means it misses some test-specific optimizations.

@aturon
Copy link
Member

aturon commented Feb 16, 2015

@Kimundi would you mind updating the RFC with the changes to Matcher?

@aturon
Copy link
Member

aturon commented Feb 16, 2015

My only other concern is starts_with and ends_with, which presumably could be much faster with a match_at API or something like that.

I know when I mentioned that before, you worried that we were growing the trait to encompass basically all of the methods provided by str, but:

  • That's not quite true, because the methods offer some additional functionality like providing iterators on top of searching;
  • Even if it does amount to roughly the same functionality, the main motivation (as I see it) isn't so much code reuse as more principled overloading.

@aturon
Copy link
Member

aturon commented Feb 18, 2015

This has been a long time in the making! Thanks for your persistence, @Kimundi

This RFC will go a long way toward making the str API more consistent, ergonomic and flexible, allowing for a wide (and open-ended) variety of patterns to be used with the same set of methods. The prototype performance has gotten within spitting distance of the existing API, and there do not appear to be any fundamental performance drawbacks to the approach. We will likely keep the Searcher trait unstable for longer than the rest of the API, however, to allow continued experimentation and tweaking.

Tracking issue

bors added a commit to rust-lang/rust that referenced this pull request Feb 22, 2015
This is not a complete implementation of the RFC:

- only existing methods got updated, no new ones added
- doc comments are not extensive enough yet
- optimizations got lost and need to be reimplemented

See rust-lang/rfcs#528

Technically a

[breaking-change]
@krdln
Copy link
Contributor

krdln commented Apr 23, 2015

Since the API is still unstabe, I was wondering if we could remove lifetime parameter from the Pattern by renaming current Pattern<'a> to eg. BoundedPattern<'a> and providing new extension trait Pattern: for<'a> BoundedPattern<'a>. From what I've tested, it is possible. Pattern implementors would implement BoundedPattern, but users would just use Pattern. This would simplify a lot of function signatures in std and will make several usecases simpler: currently you sometimes need to make a bound P: for<'s>Pattern<'s>, which is quite complicated syntax for just using one pattern for multiple strings.

@Kimundi
Copy link
Member Author

Kimundi commented Apr 23, 2015

@krdln Wow, didn't realize that this is possible. This sounds very doable indeed.

@aturon
Copy link
Member

aturon commented Apr 23, 2015

@krdln This is a fantastic idea! The 'a was originally working around the lack of HKT, but it didn't occur to me that HRTBs are likely expressive enough already. @Kimundi, would you be willing to prototype this?

@Kimundi
Copy link
Member Author

Kimundi commented Apr 23, 2015

I just tried it, an unfortunately it seems that the HRTB approach does not play well with associated types:

impl<'a, P: Pattern> Clone for $t<'a, P>
    where P::Searcher: Clone
{
    fn clone(&self) -> Self {
        let $s = self;
        $e
    }
}
/media/linux_data/dev/rust/fork2/rust/src/libcore/str/mod.rs:412:19: 412:30 error: cannot extract an associated type from a higher-ranked trait bound in this context [E0212]
/media/linux_data/dev/rust/fork2/rust/src/libcore/str/mod.rs:412             where P::Searcher: Clone
                                                                                   ^~~~~~~~~~~

I get ~54 errors, all because of attempting to use

::Searcher. This includes str methods like rsplit(), so just changing all offending lines to BoundedPattern<'a> would subvert the change.

I could just change all those cases to use BoundedPattern<'a> directly, but that

@krdln
Copy link
Contributor

krdln commented Apr 23, 2015

@Kimundi Does changing the bound to the following work? It should help.

where <P as BoundedPattern<'a>>::Searcher: Clone

By "from what I've tested" I meant that this black magic compiles:

    use std::str::pattern::Pattern;
    trait Pat: for<'a>Pattern<'a> {
        fn into_searcher<'a>(self, haystack: &'a str) -> <Self as Pattern<'a>>::Searcher {
            <Self as Pattern<'a>>::into_searcher(self, haystack)
        }
    }
    impl<P: for<'a>Pattern<'a>> Pat for P {}

And I concluded that rest of usecases are simpler and will work. (btw, by this example I don't want to say if it's a good idea to copy BoundedPattern's method into Pattern)

@Kimundi
Copy link
Member Author

Kimundi commented Apr 24, 2015

@krdln: The issue is that apparently most function signatures that would become easier by using P::Pattern instead become even more verbose by requiring <P as BoundedPattern<'a>>::Pattern.

Like, rsplit() and everything similar seem to require it.

bors added a commit to rust-lang/rust that referenced this pull request Jan 13, 2016
It appears this was left out of RFC rust-lang/rfcs#528 because it might be useful to
also generalize the second argument in some way.  That doesn't seem to
prevent generalizing the first argument now, however.

This is a [breaking-change] because it could cause type-inference to
fail where it previously succeeded.

Also update docs for a few other methods that still referred to `&str` instead of patterns.
@Centril Centril added the A-needle Needle API related proposals & ideas label Nov 23, 2018
@RustyYato
Copy link

RustyYato commented Apr 26, 2020

Edit: there used to be a question here, but it has since been deleted. I am leaving this amswer up because someone linked to it on users.rust-lang.org

You can ask questions like these on users.rust-lang.org or reddit on ghe learnrust subreddit.

You can do this already using the splitn iterator.

let mut split2 = name.splitn(',', 2);
if let (Some(last), Some(first)) = (split2.next(), split2.next()) {
    ...
} else ...

If you have further questions please go to one of the forums. I'm active on users and would be happy to help there.

spikespaz pushed a commit to spikespaz/dotwalk-rs that referenced this pull request Aug 29, 2024
This is not a complete implementation of the RFC:

- only existing methods got updated, no new ones added
- doc comments are not extensive enough yet
- optimizations got lost and need to be reimplemented

See rust-lang/rfcs#528

Technically a

[breaking-change]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-needle Needle API related proposals & ideas
Projects
None yet
Development

Successfully merging this pull request may close these issues.