Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What to do with searchsorted* functions #24883

Closed
nalimilan opened this issue Dec 2, 2017 · 2 comments
Closed

What to do with searchsorted* functions #24883

nalimilan opened this issue Dec 2, 2017 · 2 comments
Labels
search & find The find* family of functions

Comments

@nalimilan
Copy link
Member

As part of #10593, I have tried to find a way to unify searchsorted, searchsortedfirst and searchsortedlast with the rest of find* functions. In the Search & Find Julep, I suggested replacing them with find* methods dispatching on a Sorted/SortedVector wrapper which would indicate that the input vector is sorted (see also JuliaCollections/DataStructures.jl#290 for a similar but more powerful type).

For searchsorted itself, this plan would work quite well. It could be replaced with find(equalto(needle), Sorted(haystack)), which would just call a specialization of find which would return a range instead of a Vector, while still following the generic AbstractArray API.

The situation is more complex for searchsortedfirst and searchsortedlast. I'll detail the problems for the former, knowing that they are equivalent for the latter. searchsortedfirst returns "the first value in a greater than or equal to x, according to the specified order". With the introduction of the Sorted wrapper, the specified order is the one passed when constructing the wrapper, which indicates how the underlying vector is sorted. So far, so good. But what kind of predicate can we use to reflect this order?

  • It would be natural to replace searchsortedfirst(needle, haystack) with findfirst(greaterthan(needle), Sorted(haystack)), but then greaterthan cannot have any meaning on its own: it would depend on the order used by haystack, and "greater" could even mean "lower" if rev=true.
  • We could use findfirst(sortedafter(needle), Sorted(haystack)), which would better reflect that the predicate has no intrinsic meaning. But then we lose the generality of the syntax: it cannot be used on an unsorted vector to find the first value greater than x, which defeats the idea of unifying functions.

Then there are other subtleties, but these are less serious:

  • greaterthan should actually be greaterthanorequalto and sortedafter be sortedatorafter to accurately reflect the behavior of the function.
  • searchsortedfirst has an internal variant which allow passing a start and end index indicating the range to be searched, but findfirst only accepts a start index. This feature is used by the sparse matrix code. It could be replaced with array views, or we could just have a method which is more flexible than the other findfirst methods.
  • searchsortedfirst(x) returns length(x)+1 when no match is found, while findfirst currently returns 0, and may well return nothing in the future (Find Julep: issue with sentinel values Juleps#47). This doesn't sound like an issue, as the caller can easily replace 0 or nothing with length(x)+1 without a significant performance cost if needed.

I would appreciate any help, especially from people who use these functions (I don't). If fixing this in time for 0.7 is too complex, we could unexport these functions (which are needed for sparse matrices), and move them e.g. to SortingAlgorithms.

@nalimilan
Copy link
Member Author

#25133 moves these functions to the SortedSearch stdlib module, so that we can consider improving the API later.

@nalimilan
Copy link
Member Author

Looks like the general opinion is that we should keep these functions as they are.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
search & find The find* family of functions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant