Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change sentinel in find(first|next|prev|last) to nothing #25472

Merged
merged 2 commits into from
Jan 13, 2018

Conversation

timholy
Copy link
Sponsor Member

@timholy timholy commented Jan 9, 2018

Closes JuliaLang/Juleps#47. One could discuss returning a NotFound value instead.

base/array.jl Outdated
@@ -1547,15 +1546,15 @@ julia> findfirst(A)
2

julia> findfirst(falses(3))
0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe do findfirst(...) == nothing giving true, that will be more explicit? Same below.

base/array.jl Outdated
```
"""
findlast(testf::Function, A) = findprev(testf, A, endof(A))

zero_sentinel(i) = i === nothing ? 0 : i
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, this is just coalesce(i, 0), so probably not worth defining a new function?

Copy link
Sponsor Member Author

@timholy timholy Jan 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some bootstrap issues (can't use coalesce in files in Core), but yes overall this is better.

base/bitset.jl Outdated
@@ -68,14 +68,14 @@ function _bits_findnext(b::Bits, start::Int)
# start is 0-based
# @assert start >= 0
_div64(start) + 1 > length(b) && return -1
unsafe_bitfindnext(b, start+1) - 1
zero_sentinel(unsafe_bitfindnext(b, start+1)) - 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not keep unsafe_bitfindnext returning 0 for no match, since it's internal? Wouldn't that involve less changes?

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's one of each; I decided to change it for greater consistency. But indeed I followed this advice for other internal methods.

throw(BoundsError(s, i))
end
@inbounds isvalid(s, i) || string_index_err(s, i)
c = pred.x
c ≤ '\x7f' && return _search(s, c % UInt8, i)
c ≤ '\x7f' && return nothing_sentinel(_search(s, c % UInt8, i))
Copy link
Member

@nalimilan nalimilan Jan 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really useful to change _search too, given that it's internal?

EDIT: actually, I read it backwards. Shouldn't _search be changed too? It appears to be always used with nothing_sentinel (which could be removed since it's only used in this file AFAICT). BTW, looks like you missed this method:

findnext(t::AbstractString, s::AbstractString, i::Integer) = _search(s, t, i)

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't that one always return a range? _search caused me more headaches than any other here, so I could be misunderstanding.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right. Though my first remark about changing _search still holds I think.

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but I don't feel like doing it 😄. In truth this is what I tried first, spent half a day just trying to get Julia to build, and then punted and started from scratch. It's possible I was stymied by something very trivial, but at this stage even half a day is not worth it for an internal interface that can be changed later.

I think the problem is that the range-returning version calls the scalar-returning version and I don't know this part of Julia well enough to untangle it without investing more time than I can afford right now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that feeling too, that code is complex.

Let's leave it that way, but please move nothing_sentinel to this file since it isn't used elsewhere.

@@ -1114,7 +1114,7 @@ timesofar("nnz&find")
b1 = trues(v1)
b2 = falses(v1)
for i = 1:v1
@test findprev(b1, i) == findprev(b1, true, i) == findprev(identity, b1, i)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not test findprev(equalto(true), b1, i)?

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same codepath is covered by identity.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, given findnext(pred::EqualTo, B::BitArray, start::Integer) now exists.

@@ -125,7 +125,7 @@ function invpermute!!(a, p::AbstractVector{<:Integer})
count = 0
start = 0
while count < length(a)
start = findnext(!iszero, p, start+1)
start = findnext(!iszero, p, start+1)::Int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we actually sure all AbstractArray implementations will return an Int? You could also try using notnothing(...), which throws when nothing is passed to it, which should allow the compiler to assert the type (need to check that).

BTW, how did you find the places where this assertion can be added?

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyplace where the code didn't seem to check for a 0 return but then indexed it, I assumed it was guaranteed to succeed.

base/bitarray.jl Outdated
@@ -1540,18 +1541,18 @@ function unsafe_bitfindprev(Bc::Vector{UInt64}, start::Integer)
end
end
end
return 0
return nothing
end

# returns the index of the previous non-zero element, or 0 if all zeros
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"or nothing"

base/bitarray.jl Outdated
@@ -1459,13 +1459,13 @@ function unsafe_bitfindnext(Bc::Vector{UInt64}, start::Integer)
end
end
end
return 0
return nothing
end

# returns the index of the next non-zero element, or 0 if all zeros
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"or nothing"

@timholy
Copy link
Sponsor Member Author

timholy commented Jan 10, 2018

New version.

@andyferris
Copy link
Member

This seems a safer than 0 and is infinitely better for offset arrays! :)

This was said in another forum but I’ll say it one time more - this seems like the kind of condition that can only be dealt with properly when the caller explicitly checks the result. To me this is a textbook case of using something like Nullable or the much discussed Some. Using some wrapped (and then explicitly unrwapped, by the caller) return type would allow the API to extend to other containers like AbstractDicts which allow anything as keys.

Just thoughts - don’t let this slow progress on AbstractArray.

@nalimilan
Copy link
Member

@andyferris Better continue the discussion about Some at JuliaLang/Juleps#47. The problem with that solution (which is indeed more correct in general) is that it would be quite inconvenient when working with arrays, which are a major use case for find*.

@andyferris
Copy link
Member

@nalimilan On one hand, I completely agree, let's just move forward with a simple thing which is correct for arrays (including offset arrays).

The problem with that solution (which is indeed more correct in general) is that it would be quite inconvenient when working with arrays

On the other hand, the underlying thing which bothers me here and in JuliaLang/Juleps#47 is that we discuss this "inconvenience" in a strange way - even for arrays, it is absolutely necessary for the caller to do something to deal with the fact that the index might not be found. Whichever way we disguise it (with sentinel values, union types, wrapping types, etc), the caller really needs to check some boolean condition whether a matching index was found or not. My judgement is that wrapping the return in a Nullable or Some is the best way to make the caller explicitly aware that this simply has to happen (or else an error is thrown). The union approach might mean that a suite of unit tests could pass if the author doesn't remember to cover the general case you might see with unknown, production data where no index is found.

But yes, perhaps this is something better discussed elsewhere! Like I tried to say earlier, I only want to write this for the record on JuliaLang/julia for future reference (say, for v2.0 development), rather than do anything to block this PR (which is clearly better than the status quo, and therefore quite acceptable).

@timholy timholy force-pushed the teh/find_sentinel branch 2 times, most recently from f6d7053 to d195c91 Compare January 10, 2018 13:57
@timholy
Copy link
Sponsor Member Author

timholy commented Jan 10, 2018

OK, barring more objections I think this is ready. Let's first get a performance test:

@nanosoldier runbenchmarks(ALL, vs=":master")

@nalimilan
Copy link
Member

Thanks Tim! BTW, I think you missed my question at #24774 (comment), which is now hidden.

@nanosoldier
Copy link
Collaborator

Something went wrong when running your job:

NanosoldierError: failed to run benchmarks against primary commit: failed process: Process(`sudo cset shield -e su nanosoldier -- -c ./benchscript.sh`, ProcessExited(1)) [1]

Logs and partial data can be found here
cc @ararslan

@ararslan
Copy link
Member

Nanosoldier is hitting a MethodError related to the new return type from find. Perhaps something that will need to be addressed in BaseBenchmarks before Nanosoldier will work here.

@nalimilan
Copy link
Member

Unfortunately, it's going to be difficult to support both the new and the old API. I guess we'll need to add a Compat.find function returning nothing even on 0.6.

@timholy
Copy link
Sponsor Member Author

timholy commented Jan 11, 2018

I see slight regressions in

julia> @btime perf_findnext($x)    # length-1000 Bool vector
  1.234 μs (0 allocations: 0 bytes)

vs

julia> @btime perf_findnext($x)
  837.929 ns (0 allocations: 0 bytes)

There are a couple of apparent big regressions in what appears to be completely unrelated code (e.g., a ccall inside of resize!), I'm pretty puzzled about that, but it also doesn't seem to be very reproducible.

@timholy
Copy link
Sponsor Member Author

timholy commented Jan 12, 2018

I'm not seeing any complaints about that regression. Any objections to merging? If not, I'll fix the conflicts.

@timholy timholy merged commit 5c3f580 into master Jan 13, 2018
@timholy timholy deleted the teh/find_sentinel branch January 13, 2018 21:17
samoconnor added a commit to JuliaWeb/HTTP.jl that referenced this pull request Jan 17, 2018
@garrison
Copy link
Sponsor Member

garrison commented Jan 20, 2018

I just noticed that indexin still uses 0 to denote missing values. I previously proposed some indexin improvements in #23845, but I am not sure how to handle this. Note that the function, with or without #23845, fails when an array with custom indices is passed as the second argument. A (relatively extensive) list of options I see:

  1. Leave it as is (perhaps eventually incorporating Make indexin first argument accept any iterable #23845). But then there is pretty serious inconsistency in the API, as I know of no other function that still uses 0 to denote missing data.
  2. Use nothing as the sentinel for indexin, which would allow expanding its usage to arrays with custom indices. But then the most common case of returning an array means that each element is boxed IIUC, as the data type of an element is Union{Nothing,Int}.
  3. Error on missing data. This will be most efficient in the use cases where missing data is not anticipated, and will also allow using arrays with custom indices. But it is a break from current behavior.
  4. Some combination of the above two options, based on a keyword argument (or a similar mechanism, if that were to break inference).
  5. Remove indexin from Base altogether. It is not used in Base, and it seems to be used somewhat rarely throughout the package ecosystem (complete list of uses here). Given that there are multiple behaviors one might expect for missing data, perhaps it is better for each individual user to write this few-line function using the error handling that they expect. The (only) downside (that I can think of) is that this removes an interface that might be nice for some data structures, which would be missed by people rolling their own implementations. For instance, UniqueVectors implements this method in a way that skips the step of building the lookup dictionary (since it already has access to such a Dict). But this seems a relatively specialized use case, not a compelling case for keeping this function in Base.

My overall vote (at the moment) is the last option: removing indexin from Base. I think a decent case could also be made for the third or fourth options, but even so I doubt the function actually belongs in Base, rather it should be in a (standard?) library somewhere.

@nalimilan
Copy link
Member

I had noticed that too, and I think we should return nothing instead of 0 for consistency with other functions. Union{Nothing,Int} arrays use an efficient memory representation on 0.7 already, and since we return nothing with all other find* functions, we already assume performance is OK (or can be fixed).

@andreasnoack andreasnoack added the needs news A NEWS entry is required for this change label Feb 15, 2018
@andreasnoack
Copy link
Member

This needs a NEWS entry

@ararslan
Copy link
Member

#26067

@KristofferC KristofferC removed the needs news A NEWS entry is required for this change label Nov 13, 2018
c42f pushed a commit to JuliaWeb/URIs.jl that referenced this pull request Oct 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:search & find The find* family of functions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants