Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deprecate contains to occursin, deprecate callable regexes #26283

Merged
merged 2 commits into from
Mar 20, 2018

Conversation

ararslan
Copy link
Member

@ararslan ararslan commented Mar 1, 2018

Ref #26211 (comment). There are two distinct commits here:

  1. Add an isfound function that's effectively findfirst(needle, haystack) !== nothing. Edit: this is now a generalized replacement for contains.
  2. Deprecate callable Regex objects in favor of isfound(::Regex, s)

@ararslan ararslan added deprecation This change introduces or involves a deprecation search & find The find* family of functions labels Mar 1, 2018
@ararslan ararslan requested a review from StefanKarpinski March 1, 2018 21:14
@mbauman
Copy link
Member

mbauman commented Mar 1, 2018

Maybe also change the deprecation message for ismatch to use isfound?

@mbauman mbauman added the needs news A NEWS entry is required for this change label Mar 1, 2018
StefanKarpinski
StefanKarpinski previously approved these changes Mar 1, 2018
@ararslan
Copy link
Member Author

ararslan commented Mar 1, 2018

I also haven't yet done the complete rename of contains to isfound as suggested in Stefan's comment.

isfound(needle, haystack)

Determine whether `needle` is found within the collection `haystack`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"is found" isn't very specific. I'd copy the description given for findfirst, and point to it. I think it makes sense to define officially isfound in terms of findfirst, even if more efficient implementations can be used under the hood.

```
"""
isfound(needle, haystack) = findfirst(needle, haystack) !== nothing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would put this in array.jl near findfirst.

@ararslan
Copy link
Member Author

ararslan commented Mar 2, 2018

There's an issue here during bootstrap, where joinpath fails with an UndefVarError from isfound, but oddly enough it happens at the first include(joinpath(...)) in base/grisu/grisu.jl, which is way down the base/sysimg.jl inclusion list. 🤔

@ararslan ararslan removed the needs news A NEWS entry is required for this change label Mar 3, 2018
@JeffBezanson
Copy link
Member

isfound isn't exported yet; maybe the error is in a different module?

@ararslan ararslan force-pushed the aa/isfound branch 4 times, most recently from cb1feeb to ea55930 Compare March 3, 2018 07:50
@ararslan
Copy link
Member Author

ararslan commented Mar 3, 2018

The macOS failure is the spawn thing and the 32-bit Windows failure is a timeout.

@mbauman mbauman changed the title Add isfound and deprecate callable regexes Add isfound and deprecate contains and callable regexes Mar 5, 2018
@mbauman
Copy link
Member

mbauman commented Mar 5, 2018

I've edited your first post and the title to make it a bit more clear what's happening here. Deprecating contains is now the biggest change here. I'm in support, but for posterity, here's the major reason why: #26211 (comment)

It's occurred to me that contains is kind of problematic. We currently only officially define it for strings. This implies that it does contiguous subsequence matching. However, we don't define it for arrays and their subsequences. Moreover, the word "contains" suggests that contains(Set([1,3,2,4]), Set([2,3])) should be true since the first set contains the second set. So there's an ambiguity baked into the very term "contains": does it mean subset containment or subsequence containment?

@mbauman mbauman dismissed StefanKarpinski’s stale review March 5, 2018 16:16

Now also includes deprecation of contains

@ararslan
Copy link
Member Author

ararslan commented Mar 5, 2018

Good call, thanks Matt. 🙂

@ararslan ararslan added this to the 1.0 milestone Mar 6, 2018
@ararslan ararslan requested a review from mbauman March 6, 2018 21:13
@JeffBezanson
Copy link
Member

👍

But, there might be something to be said for contains(haystack::AbstractString, needle::Union{Char, AbstractString}), since that's such a common operation and exists in several other languages. I would consider adding just that method for convenience. Does not block this PR of course.

base/array.jl Outdated

# Examples
```jldoctest
julia> isfound('x', "hexonxonx")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This currently leads to a deprecation warning and an error. I think it would make sense to add a method isfound(::Char, ::AbstractString) to handle this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Though for Char containment in String should we just encourage using in? Having isfound(scalar, collection) work when its documented fallback, findfirst(scalar, collection) !== nothing, does not work is a little weird.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the only reason findfirst doesn't support that is due to a deprecation, but that case would not have been useful (looking for a string in a char; yes you read that right). So we could add the method to findfirst as well, either now or after 0.7 if we want to be super conservative.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC I didn't add findfirst(::Char, ::String) just because it wasn't strictly necessary (i.e. the standard pattern for general collections based on equalto works well), and because can always be added later. At least that's what I noted in the description of #24673. OTOH the findfirst(::String ::String) method is needed because there is no generic API to looking for a subsequence in general collections at the moment.

It could make sense to add a special case for findfirst(::Char, ::String) for convenience, though, now that the basic API has stabilized.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so action items here:

  • Add findfirst(::Char, ::AbstractString), which facilitates a corresponding isfound method
  • Add back contains(::AbstractString, ::Union{Char,AbstractString})

Currently all methods for contains deal with strings and regular expressions. So effectively rather than replacing contains with isfound, which this PR currently does, we'll add isfound and replace the Regex methods for contains with methods for isfound. That is, deprecate contains(::AbstractString, ::Regex) in favor of isfound and otherwise leave contains alone.

Is that right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could possibly use in(::String, ::String) for this. I think a special case there would be just as valid as having a special contains method. Big plus side: that'd use the same argument ordering as isfound and match, whereas contains would be backwards (cf. #25584, which this PR fixes).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather not do that; in is exceptionally clean and consistent in having all of its methods do the same thing (checking for element containment).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so updated action items:

  • Add findfirst(::Char, ::AbstractString)
  • Revert deprecation of contains

So the PR would effectively be what it originally was, just adding isfound and deprecating callable regexes. Is that right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant to go ahead with removing contains as agreed, and separately consider whether we want contains(string, string) as a convenience function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so just add findfirst(::Char, ::AbstractString) and leave the rest of the PR as-is.

NEWS.md Outdated
@@ -1070,6 +1070,10 @@ Deprecated or removed
* The `remove_destination` keyword argument to `cp`, `mv`, and the unexported `cptree`
has been renamed to `force` ([#25979]).

* `contains` has been deprecated in favor of a more general `isfound` function ([#26283]).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also mention the arguments, since their order is reversed.

base/regex.jl Outdated
@@ -141,20 +141,18 @@ function getindex(m::RegexMatch, name::Symbol)
end
getindex(m::RegexMatch, name::AbstractString) = m[Symbol(name)]

function contains(s::AbstractString, r::Regex, offset::Integer=0)
function isfound(r::Regex, s::AbstractString; offset::Integer=0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't offset be a positional argument just like the index passed to findnext? Then there could be a generic definition isfound(needle, haystack, i) = findnext(needle, haystack, i) !== nothing.

Another possibility would be to have keyword arguments like start and stop, which would allow specifying the ending index without passing the starting index when needed.

base/path.jl Outdated
contains(a[end:end], path_separator_re) ? string(C,a,b) :
string(C,a,pathsep(a,b),b)
isempty(a) ? string(C,b) :
isfound(path_separator_re, a[end:end]) ? string(C,a,b) :
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really related, but a[end:end] is a one-character string, right? So why call isfound rather than ==? Because of the regex?

@@ -101,6 +101,8 @@ julia> findfirst("Julia", "JuliaLang")
findfirst(pattern::AbstractString, string::AbstractString) =
findnext(pattern, string, firstindex(string))

findfirst(c::AbstractChar, s::AbstractString) = findfirst(equalto(c), s)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While you're at it, better add a findnext method, and change the deprecations to use the new methods where appropriate.

@@ -419,7 +419,7 @@ function afterusing(string::String, startpos::Int)
r = findfirst(r"\s(gnisu|tropmi)\b", rstr)
r === nothing && return false
fr = reverseind(str, last(r))
return contains(str[fr:end], r"^\b(using|import)\s*((\w+[.])*\w+\s*,\s*)*$")
return isfound(r"^\b(using|import)\s*((\w+[.])*\w+\s*,\s*)*$", str[fr:end])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could use offset here.

@@ -65,6 +65,7 @@ end
for str in (u8str, GenericString(u8str))
@test_throws BoundsError findnext(equalto('z'), str, 0)
@test_throws BoundsError findnext(equalto('∀'), str, 0)
@test findfirst('z', str) == nothing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also test in a case that works, just in case? An easy way would be to loop over the predicate, using identity and equalto.

@StefanKarpinski
Copy link
Member

That's a good point: find for looking for items, search for looking for contiguous subsequences, match for looking for patterns that return special objects.

@nalimilan
Copy link
Member

We've just got rid of the confusing search vs. find distinction with great effort, and we would reintroduce search with the non-obvious meaning "look for a contiguous subsequence"? I'd much rather use a term whose definition indicates at least a bit that it works with sequences. The only difference between "search" and "find" appears to be that the former refers to the process and the latter to the result -- there's nothing about subsequences in that distinction.

Anyway there's no hurry to add this to 1.0, right?

@ararslan
Copy link
Member Author

The only thing I really want out of this in 1.0 is something like ismatch for regular expressions. The rest I'm pretty indifferent toward in terms of timeline, but it makes sense to get things ironed out all at once.

@JeffBezanson
Copy link
Member

Yes, the highest priority is to rename contains and ismatch. We can worry about the rest later. Here are all the proposed names:

within, occursin, hasmatch, anymatch, matchany, issubstring

@fredrikekre
Copy link
Member

I like within

@ararslan
Copy link
Member Author

I'm not quite sold on within. We'd have in for a single element and within for some subset?

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Mar 14, 2018

find is currently overloaded with two meanings: look for an item and look for a contiguous subsequence. How do you write the method of findnext which looks for the sequence [2, 3] in the array [1, [2, 3], 2, 3, [4, 5]])? Because that's the analogue of what the regex methods of findnext do but not analogous to looking for individual elements. There is a slight linguistic distinction between finding an element, searching for a substring, and matching a pattern.

@JeffBezanson
Copy link
Member

Yes; "string search" is the classical name for the problem of looking for a substring.

@nalimilan
Copy link
Member

find is currently overloaded with two meanings: look for an item and look for a contiguous subsequence.

To be fair, that's only the case for string and regex needles (because there's no ambiguity).

How do you write the method of findnext which looks for the sequence [2, 3] in the array [1, [2, 3], 2, 3, [4, 5]])? Because that's the analogue of what the regex methods of findnext do but not analogous to looking for individual elements.

Yes, currently there's no way to do that. The Search & Find Julep evoked several possible solutions. Proposals 1 and 2 included searchseq and findseq, but they don't fit in the general API which has been implemented. Proposal 3 (the one by Jeff, which has been implemented in large parts) mentioned the possibility of simply interpreting non-predicate needles as subsequences (since they are currently an error):

Predicates are always used instead of separate functions for different kinds of searches when possible. This potentially allows using the same function for sequence searching, since a subsequence to look for is unlikely to be confused with a predicate.

I'm not saying that would necessarily be a good idea. I'd also like something like findfirst(subseq(needle), haystack), another series of functions (with a better name than search), or maybe even a subsequence::Bool keyword argument.

Anyway I think for 1.0 it would be enough to find a good replacement for contains.

@JeffBezanson
Copy link
Member

Triage sees some more votes for occursin (which is now available).

@StefanKarpinski StefanKarpinski removed the triage This should be discussed on a triage call label Mar 15, 2018
@ararslan
Copy link
Member Author

Game plan from triage:

  • Rename contains(a, b) to occursin(b, a)
  • Deprecate callable regular expressions

Will update the PR for this.

@ararslan ararslan changed the title Add isfound and deprecate contains and callable regexes Deprecate contains to occursin, deprecate callable regexes Mar 15, 2018
@ararslan
Copy link
Member Author

Windows failures are timeouts.

@ararslan ararslan requested a review from JeffBezanson March 16, 2018 22:12
NEWS.md Outdated
* `contains` has been deprecated in favor of a more general `occursin` function, which
takes its arguments in reverse order from `contains` ([#26283]).

* `Regex` objects are no longer callable. Use `isfound` instead ([#26283]).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

occursin !

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HAH whoops. 🙃

base/array.jl Outdated
false
```
"""
occursin(needle, haystack) = findfirst(needle, haystack) !== nothing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be removed. The change should just deprecate contains to occursin.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then the findfirst commit is irrelevant, so I'll drop that

Verified

This commit was signed with the committer’s verified signature.
ssbarnea Sorin Sbarnea
occursin takes its arguments in reverse order from contains.
@JeffBezanson JeffBezanson merged commit 33fdd32 into master Mar 20, 2018
@JeffBezanson JeffBezanson deleted the aa/isfound branch March 20, 2018 02:32
Keno pushed a commit that referenced this pull request Jun 5, 2024
Deprecate contains to occursin, deprecate callable regexes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deprecation This change introduces or involves a deprecation search & find The find* family of functions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants