Add `insorted` for `in` function in sorted arrays #37490

remi-garcia · 2020-09-09T17:09:41Z

Closes #37442
The purpose is to add a function that determine whether an item is in the given sorted collection.

EDIT: Usage:

julia> insorted(4, [1, 2, 4, 5, 5, 7]) # single match
true
julia> insorted(5, [1, 2, 4, 5, 5, 7]) # multiple matches
true
julia> insorted(3, [1, 2, 4, 5, 5, 7]) # no match
false
julia> insorted(9, [1, 2, 4, 5, 5, 7]) # no match
false
julia> insorted(0, [1, 2, 4, 5, 5, 7]) # no match
false

Implemented in a similar way as searchsorted. Left a TODO in insorted(a::AbstractRange{<:Real}, x::Real, o::DirectOrdering)::Bool

jw3126 · 2020-09-11T17:31:51Z

Thanks a lot! It would be nice to have such a function! This PR adds a lot of code though, maybe that can be reduced.

Can this maybe share more code with searchsorted? Following @goretkin Feature Request: in sorted array #37442 (comment)

function insorted(x, coll; kw...)
    !isempty(searchsorted(coll, x; kw...))
end

There is are already optimized in methods for ranges. If these are faster then the above for ranges, we could just call them.

base/sort.jl

remi-garcia · 2020-09-13T06:23:44Z

* Can this maybe share more code with `searchsorted`? Following @goretkin [#37442 (comment)](https://github.com/JuliaLang/julia/issues/37442#issuecomment-689248118)

function insorted(x, coll; kw...)
    !isempty(searchsorted(coll, x; kw...))
end

I'll report this possibility on benchmarks. It should be close because intuitively O(insorted)=O(searchsorted) in both cases. Yet I tried to avoid using length(coll) since, if I remember well, it is in O(n).

* There is are already optimized `in` methods for ranges. If these are faster then the above for ranges, we could just call them.

To be honnest I didn't check the source code of in for ranges. But, now that got me thinking, why there is a searchsorted for ranges while find should be as efficient. Ranges are not always sorted?

jw3126 · 2020-09-13T06:41:30Z

I'll report this possibility on benchmarks.

Awesome!

Yet I tried to avoid using length(coll) since, if I remember well, it is in O(n).

It of depends on the collection, but usually either a collection does not implement length at all or length is fast.
If we have slow/no length, then I don't think we can implement a fast insorted anyway. At least not in a generic way, maybe in a specialized way for a collection at hand.
I think the right way is to have a short insorted method as above. And if it turns out to be too slow on some collection, prefer adding faster searchsorted methods over adding insorted methods.

But, now that got me thinking, why there is a searchsorted for ranges while find should be as efficient.

The main reason is that a code that uses searchsorted does not crash, when you feed it a range.

Ranges are not always sorted?

Yes they are. At least by the standard <. They are not sorted e.g. by=sin. Anyway if searchsorted (with standard order) is not superfast on a range, that's a bug.

remi-garcia · 2020-09-14T11:00:47Z

Julia 1.0.5:

using Base.Order

midpoint(lo::T, hi::T) where T<:Integer = lo + ((hi - lo) >>> 0x01)
midpoint(lo::Integer, hi::Integer) = midpoint(promote(lo, hi)...)

function _insortedfirst(v::AbstractVector, x, lo::T, hi::T, o::Ordering)::Bool where T<:Integer
    if lt(o, v[hi], x)
        return false
    end
    u = T(1)
    lo = lo - u
    hi = hi + u
    @inbounds while lo < hi - u
        m = midpoint(lo, hi)
        if lt(o, v[m], x)
            lo = m
        else
            hi = m
        end
    end
    if v[hi] == x
        return true
    end
    return false
end

function _insortedlast(v::AbstractVector, x, lo::T, hi::T, o::Ordering)::Bool where T<:Integer
    if lt(o, x, v[lo])
        return false
    end
    u = T(1)
    lo = lo - u
    hi = hi + u
    @inbounds while lo < hi - u
        m = midpoint(lo, hi)
        if lt(o, x, v[m])
            hi = m
        else
            lo = m
        end
    end
    if v[lo] == x
        return true
    end
    return false
end

function insorted(v::AbstractVector, x, ilo::T, ihi::T, o::Ordering)::Bool where T<:Integer
    if isempty(v)
        return false
    elseif lt(o, v[ihi], x)
        return false
    end
    u = T(1)
    lo = ilo - u
    hi = ihi + u
    @inbounds while lo < hi - u
        m = midpoint(lo, hi)
        if lt(o, v[m], x)
            lo = m
        elseif lt(o, x, v[m])
            hi = m
        else
            return _insortedfirst(v, x, max(lo,ilo), m, o) || _insortedlast(v, x, m, min(hi,ihi), o)
        end
    end
    if hi <= ihi && v[hi] == x
        return true
    end
    return false
end

function other_insorted(v::AbstractVector, x, ilo::T, ihi::T, o::Ordering)::Bool where T<:Integer
    r = searchsorted(v, x)
    return length(r) > 0
end

for s in [:insorted, :other_insorted]
    @eval begin
        $s(v::AbstractVector, x, o::Ordering) = (inds = axes(v, 1); $s(v,x,first(inds),last(inds),o))
        $s(v::AbstractVector, x;
           lt=isless, by=identity, rev::Union{Bool,Nothing}=nothing, order::Ordering=Forward) =
            $s(v,x,ord(lt,by,rev,order))
    end
end

using BenchmarkTools

@benchmark:

julia> @benchmark insorted([1:5; 1:5; 1:5], 1, 6, 10, Forward)
BenchmarkTools.Trial: 
  memory estimate:  208 bytes
  allocs estimate:  1
  --------------
  minimum time:     66.106 ns (0.00% GC)
  median time:      67.544 ns (0.00% GC)
  mean time:        80.966 ns (11.95% GC)
  maximum time:     47.349 μs (99.72% GC)
  --------------
  samples:          10000
  evals/sample:     979

julia> @benchmark other_insorted([1:5; 1:5; 1:5], 1, 6, 10, Forward)
BenchmarkTools.Trial: 
  memory estimate:  208 bytes
  allocs estimate:  1
  --------------
  minimum time:     70.034 ns (0.00% GC)
  median time:      70.964 ns (0.00% GC)
  mean time:        84.005 ns (11.40% GC)
  maximum time:     46.641 μs (99.75% GC)
  --------------
  samples:          10000
  evals/sample:     976

julia> @benchmark insorted([1,2,3], 0)
BenchmarkTools.Trial: 
  memory estimate:  112 bytes
  allocs estimate:  1
  --------------
  minimum time:     45.492 ns (0.00% GC)
  median time:      46.430 ns (0.00% GC)
  mean time:        58.798 ns (17.14% GC)
  maximum time:     47.850 μs (99.82% GC)
  --------------
  samples:          10000
  evals/sample:     990

julia> @benchmark other_insorted([1,2,3], 0)
BenchmarkTools.Trial: 
  memory estimate:  112 bytes
  allocs estimate:  1
  --------------
  minimum time:     45.584 ns (0.00% GC)
  median time:      46.561 ns (0.00% GC)
  mean time:        59.128 ns (17.20% GC)
  maximum time:     46.606 μs (99.75% GC)
  --------------
  samples:          10000
  evals/sample:     990

julia> @benchmark insorted([1,2,3], 2)
BenchmarkTools.Trial: 
  memory estimate:  112 bytes
  allocs estimate:  1
  --------------
  minimum time:     47.190 ns (0.00% GC)
  median time:      48.155 ns (0.00% GC)
  mean time:        62.074 ns (17.19% GC)
  maximum time:     48.128 μs (99.82% GC)
  --------------
  samples:          10000
  evals/sample:     988

julia> @benchmark other_insorted([1,2,3], 2)
BenchmarkTools.Trial: 
  memory estimate:  112 bytes
  allocs estimate:  1
  --------------
  minimum time:     48.591 ns (0.00% GC)
  median time:      49.597 ns (0.00% GC)
  mean time:        62.978 ns (17.44% GC)
  maximum time:     50.001 μs (99.83% GC)
  --------------
  samples:          10000
  evals/sample:     988

julia> @benchmark insorted([1,2,3], 4)
BenchmarkTools.Trial: 
  memory estimate:  112 bytes
  allocs estimate:  1
  --------------
  minimum time:     43.004 ns (0.00% GC)
  median time:      43.998 ns (0.00% GC)
  mean time:        57.017 ns (19.12% GC)
  maximum time:     47.788 μs (99.85% GC)
  --------------
  samples:          10000
  evals/sample:     991

julia> @benchmark other_insorted([1,2,3], 4)
BenchmarkTools.Trial: 
  memory estimate:  112 bytes
  allocs estimate:  1
  --------------
  minimum time:     45.670 ns (0.00% GC)
  median time:      46.598 ns (0.00% GC)
  mean time:        59.793 ns (18.05% GC)
  maximum time:     46.788 μs (99.80% GC)
  --------------
  samples:          10000
  evals/sample:     990

Using a dedicated insorted implementation or searchsorted (other_insorted) does not make a big change for AbstractVector. Should I remove the dedicated function?

I checked, for ranges, insorted could be equal to in.

jw3126 · 2020-09-14T11:18:34Z

I think this PR introduces a lot of code, not all of which is really needed. So I would approach it like this. Take a step back and replace all the code by just:

function insorted(x, coll; kw...)
    !isempty(searchsorted(coll, x; kw...))
end

Is it missing features? Is it slow in some cases? Is it buggy?
If the answer is yes to one of these, we should think about how to fix it.

remi-garcia · 2020-09-14T14:19:37Z

Ok, here are some @benchmark with the simplified insorted:

julia> @benchmark insorted(1, coll) setup=(coll=sort(rand(1:100, 50)))
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     12.659 ns (0.00% GC)
  median time:      13.394 ns (0.00% GC)
  mean time:        14.758 ns (0.00% GC)
  maximum time:     57.431 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

julia> @benchmark in(1, coll) setup=(coll=sort(rand(1:100, 50)))
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.641 ns (0.00% GC)
  median time:      30.255 ns (0.00% GC)
  mean time:        19.357 ns (0.00% GC)
  maximum time:     68.061 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark insorted(100, coll) setup=(coll=sort(rand(1:100, 50)))
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     10.492 ns (0.00% GC)
  median time:      10.973 ns (0.00% GC)
  mean time:        12.726 ns (0.00% GC)
  maximum time:     43.886 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

julia> @benchmark in(100, coll) setup=(coll=sort(rand(1:100, 50)))
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     34.232 ns (0.00% GC)
  median time:      43.665 ns (0.00% GC)
  mean time:        44.449 ns (0.00% GC)
  maximum time:     101.706 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     994

julia> @benchmark insorted(30, coll) setup=(coll=sort(rand(1:100, 50)))
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     11.148 ns (0.00% GC)
  median time:      15.348 ns (0.00% GC)
  mean time:        15.536 ns (0.00% GC)
  maximum time:     37.629 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

julia> @benchmark in(30, coll) setup=(coll=sort(rand(1:100, 50)))
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.726 ns (0.00% GC)
  median time:      30.250 ns (0.00% GC)
  mean time:        22.406 ns (0.00% GC)
  maximum time:     57.277 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark insorted(60, coll) setup=(coll=sort(rand(1:100, 50)))
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     11.010 ns (0.00% GC)
  median time:      14.994 ns (0.00% GC)
  mean time:        15.677 ns (0.00% GC)
  maximum time:     43.850 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

julia> @benchmark in(60, coll) setup=(coll=sort(rand(1:100, 50)))
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     11.078 ns (0.00% GC)
  median time:      44.064 ns (0.00% GC)
  mean time:        39.295 ns (0.00% GC)
  maximum time:     77.879 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     998

For ranges it is slower:

julia> @benchmark insorted(1, coll) setup=(coll=rand(1:100):rand(1:100))
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     9.176 ns (0.00% GC)
  median time:      9.247 ns (0.00% GC)
  mean time:        9.483 ns (0.00% GC)
  maximum time:     24.069 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

julia> @benchmark in(1, coll) setup=(coll=rand(1:100):rand(1:100))
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.208 ns (0.00% GC)
  median time:      2.217 ns (0.00% GC)
  mean time:        2.225 ns (0.00% GC)
  maximum time:     12.624 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark insorted(100, coll) setup=(coll=rand(1:100):rand(1:100))
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     8.752 ns (0.00% GC)
  median time:      10.023 ns (0.00% GC)
  mean time:        10.087 ns (0.00% GC)
  maximum time:     38.033 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

julia> @benchmark in(100, coll) setup=(coll=rand(1:100):rand(1:100))
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.208 ns (0.00% GC)
  median time:      2.219 ns (0.00% GC)
  mean time:        2.236 ns (0.00% GC)
  maximum time:     34.658 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark insorted(50, coll) setup=(coll=rand(1:100):rand(1:100))
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     8.745 ns (0.00% GC)
  median time:      9.210 ns (0.00% GC)
  mean time:        9.273 ns (0.00% GC)
  maximum time:     37.059 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     999

julia> @benchmark in(50, coll) setup=(coll=rand(1:100):rand(1:100))
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.207 ns (0.00% GC)
  median time:      2.218 ns (0.00% GC)
  mean time:        2.225 ns (0.00% GC)
  maximum time:     30.371 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

I suggest to keep insorted = in for AbstractRange.

jw3126 · 2020-09-14T14:44:09Z

Thanks for taking a look at ranges. I think for arrays insorted will really begin to shine for larger arrays. Say > 100 elements.
insorted doing binary search has more branches then in doing a for loop. And for small array sizes the former cost can matter.
So how about this:

"""
some docs....
"""
function insorted end
insorted(x, coll; kw...) = !isempty(searchsorted(coll, x; kw...))
insorted(x, coll::AbstractRange; kw...) = in(x, coll)

remi-garcia · 2020-09-14T19:04:29Z

I just pushed the last changes that roughly correspond to what you suggested 👍🏼

base/sort.jl

test/sorting.jl

base/sort.jl

jw3126

Thanks! Needs also an entry in News.md.

jw3126

Looks good to me. Thanks a lot @remi-garcia !

remi-garcia · 2020-09-15T07:47:04Z

I just removed tests that called the "old" insorted(x,v,lo,hi,o)` and fixed a typo. Should be working fine now!
Thanks to you @jw3126 for guiding me through this PR! 👍

base/sort.jl

remi-garcia · 2020-09-23T06:11:51Z

I don't get why the buildbot is so "mad" at me. Is that normal or I'm missing something I should fix?

dkarrasch

You can click on the "details" button behind the individual CI tests, and check the stack trace of error messages, just like in the normal REPL. Then you can start fixing bugs as usual. Some more style comments on the tests: for functions that return Bools and sound like "a test" already (insorted), you don't need to compare to true or false, but can simply do @test insorted(...) and @test !insorted(...), resp.

test/sorting.jl

remi-garcia · 2020-09-23T07:05:52Z

Thanks for the correction and suggestions. I just understood that I had to click on view all 17373 lines to see something that I understand in CI tests Details... Whoops.

dkarrasch · 2020-09-24T20:16:48Z

base/operators.jl

@@ -1063,7 +1063,8 @@ splat(f) = args->f(args...)
    in(x)

 Create a function that checks whether its argument is [`in`](@ref) `x`, i.e.
-a function equivalent to `y -> y in x`.
+a function equivalent to `y -> y in x`. See also [`insorted`](@ref) for the use
+in sorted collections.


Suggested change

in sorted collections.

with sorted collections.

?

One last comment, maybe a native speaker could quickly help, and then we merge this.

Co-authored-by: Daniel Karrasch <daniel.karrasch@posteo.de>

JeffBezanson · 2020-09-28T20:28:39Z

Is there any precedent for this function in other systems? If so, I wonder what it has been called before? We should make sure to use a standard name if one exists.

remi-garcia · 2020-09-29T05:36:36Z

I found binary_search or bsearch. I wanted to be consistent with search-sorted but I should have looked into the std before 😕

jw3126 · 2020-09-29T05:48:54Z

I like insorted. I agree, if there is sufficient precedence in other languages, we should adapt out name. binary_search is not a good name for us I think. Since it is neither search (in the sense of searchsorted and friends) nor guaranteed to be "binary" (e.g. it is O(1) for ranges).

kmsquire · 2020-09-29T12:43:40Z

There has been some discussion in the past about creating a separate SortedArray type. This would have the benefit of code which uses in (or ∈) just working, as it does for most other container types, and wouldn't require an additional function.

More generally, a simple Sorted wrapper class could have the same benefits, again without introducing a new function.

xs = sort(ys)  # xs is an array
...
if element in Sorted(xs)
    ...
end

For a more pie-in-the sky idea, it would be nice to have a way of tagging arrays with attributes (like "sorted"):

xs = sort(ys)     # xs is tagged with "sorted"
issorted(xs)      # O(1) if xs is tagged with "sorted"
if element in xs  # O(log N) if xs is tagged with "sorted"
    ...
end

Related: @andyferris's AcceleratedArrays.jl

jw3126 · 2020-09-29T13:39:57Z

@goretkin had basically the same suggestion in the issue fixed by this PR

remi-garcia added 5 commits September 8, 2020 08:59

Implement insorted

5f353ed

Compiling

7589e6f

Add tests

1463dd8

Modify

2fb34a7

Fixed

fa55b06

remi-garcia changed the title ~~Add insorted for in function in sorted arrays~~ Add insorted for in function in sorted arrays Sep 9, 2020

jw3126 reviewed Sep 11, 2020

View reviewed changes

base/sort.jl Outdated Show resolved Hide resolved

insorted(x, coll) and insorted=in for AbsractRange

8777311

insorted = !isempty(searchsorted)

9491a4f