Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ordered query API #1195

Closed
wants to merge 5 commits into from
Closed

ordered query API #1195

wants to merge 5 commits into from

Conversation

Gankra
Copy link
Contributor

@Gankra Gankra commented Jul 9, 2015

rendered

Add the following to BTreeMap

  • min
  • max
  • get_le
  • get_lt
  • get_ge
  • get_gt
  • min_mut
  • max_mut
  • get_le_mut
  • get_lt_mut
  • get_ge_mut
  • get_gt_mut

and to BTreeSet:

  • min
  • max
  • get_le
  • get_lt
  • get_ge
  • get_gt

@Gankra Gankra added the T-libs-api Relevant to the library API team, which will review and decide on the RFC. label Jul 9, 2015
@Gankra Gankra self-assigned this Jul 9, 2015
@Gankra
Copy link
Contributor Author

Gankra commented Jul 9, 2015

CC @bluss @aturon @apasel422

@apasel422
Copy link
Contributor

Looks good to me.

I assume you went with {first, last} instead of {min, max} to be consistent with the sequence types?

Should there also be remove_{first, last, pred_inc, pred_exc, succ_inc, succ_exc}?

Along those lines, we could also provide something like

fn last_entry(&mut self) -> Option<OccupiedEntry<K, V>>;

fn pred_inc_entry<Q: ?Sized>(&mut self, key: &Q) -> Option<OccupiedEntry<K, V>>
    where K: Borrow<Q>, Q: Ord;

...

I've been experimenting with this in my BST library. It's a niche use-case, but allows code to inspect the key and value before deciding whether to remove it.


where `pred(Unbounded)` is max, and `succ(Unbounded)` in min by assuming you're getting the
predecessor and successor of positive and negative infinity. This RFC does not propose this
API because it is crazy-pants and would make our users cry.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a serious alternative.

Bound and .range() have existed for a while, are they not something we want to keep? Can I drag the alternative of using range syntax into this? (Bounded2 = Inclusive | Exclusive) so std::ops::Range<Bounded2> etc could be an alternative.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to just shuffle the combinatorics around and make the calling convention more awkard, as far as I can tell. No?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's inconsistent if we want to keep using Bound as it is (or even changed) in some places (.range()), and then have these methods not use it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I regard Bound as a necessary evil for range because the combinatorics there seem truly catastrophic (18 iterator methods). That said I've never been super happy with the range design. Someone once suggested a builder pattern to me like:

// unbounded RHS
.range().from(x).into_iter()
// bounded RHS
.range().from(x).to(y).into_iter()
...etc

Might be worth considering that more seriously.

@Gankra
Copy link
Contributor Author

Gankra commented Jul 9, 2015

@apasel422 Also if your type is actually ordered max..min; min and max is super confusing IMO.

I actually intended to add remove stuff this morning, but obviously totally forgot!

I had also concluded that an Entry API was silly since VacantEntry is nonsensical, but I suppose Option<OccupiedEntry> actually makes sense. If we have Entry do we also want remove?

@apasel422
Copy link
Contributor

@gankro Presumably a dedicated removal method can be (slightly) more efficient than removing through the entry API, due to less bookkeeping. I hate to increase the combinatoric problem even more, but since the map types already have OccupiedEntry::remove and Map::remove, I'm inclined to add both.

@Gankra
Copy link
Contributor Author

Gankra commented Jul 9, 2015

It's not clear to me that Only OccupiedEntry would have overhead. Constructing an OccupiedEntry is literally running remove and then copying some local variables into a struct instead of finishing the job. Creating an Entry has extra potential overhead because you need to be ready to make a VacantEntry.

@apasel422
Copy link
Contributor

It might be fine to just have the *_entry methods for now then. We could always add remove_* later.

* succ_exc
* first
* last

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these names have precedence from other libraries? They seem a bit too succinct to me (although a big plus one to the actual functionality, I've wanted this).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java: higher/lower/ceil/floor
C++: lower_bound/upper_bound (these names are terrible and I explicitly killed them in collections reform)
Everything else I looked at: chaos or doesn't seem to have this precise collection/functionality.

I briefly pondered before/after and next/prev before letting my theory background take over and demand predecessor/successor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another potential naming scheme could involve {lt, le, ge, gt}, optionally with a prefix or suffix if we're concerned about conflicting with PartialOrd's methods.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some ideas:
before, after, before_eq, after_eq
find_{lt, le, ge, gt}
get_{lt, le, ge, gt}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incidentally the lack of genericity over mutability is killing me. Don't how I'd do it, but there's so much repetition in API's these days because of it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh damn right I wanted to avoid that auuuugh.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option is {next, next_or_eq, prev, prev_or_eq, first, last}.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do really like that lt/leq/etc is an established naming convention that people can bring into understanding.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leq or le? The former might be easier to grok, but the latter is consistent with PartialOrd and has the minor benefit of having the same number of characters as {lt, gt}.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh whoops, I thought that PartialOrd used leq.

@benaryorg
Copy link
Contributor

Honestly, I would prefer using the builder suggestion from above.

@Gankra
Copy link
Contributor Author

Gankra commented Jul 17, 2015

@benaryorg These are orthogal API discussions. One is for doing direct queries, one is for iterating ranges. While one can be implemented in terms of the other, this is not necessarily efficient or desirable.

@benaryorg
Copy link
Contributor

@gankro So you are planning to build two APIs, which one of them might (please) use a builder pattern and the other being cursor-like?

Sorry if I do not quite get the idea behind the second API.

@Gankra
Copy link
Contributor Author

Gankra commented Jul 17, 2015

This RFC is proposing an API just for answering queries of the form "who is the predecessor/successor/minimum/etc". All it does is return Option<(&K, &V)>. In principle this can be implemented as efficiently as possible.

The range API that was being discussed above would produce an Iterator<Item = (&K, &V)>. In principle, this can be as effecient as the query API, but in general it can't (more state needs to be maintained to be able to go to the next/prev).

Cursors are Yet Another thing that are not currently being proposed here, and that the standard library does not currently have a notion of. Cursors and iterators -- particularly &mut ones -- must be implemented as separate types because they have different semantics. Iterators say you can always call next, which enables you to get multiple mutable references into the same container safely. However Cursors explicitly must forbid this to be able to soundly perform other mutation operations or even to just "revisit" an element to avoid invalidating references or aliasing mutable references (both are Undefined Behaviour).

@benaryorg
Copy link
Contributor

Okay, I understand now.

I'll leave function naming to you as I am the worst at that.

@Gankra
Copy link
Contributor Author

Gankra commented Jul 19, 2015

@apasel422 Would you be fine with punting on remove/entry APIs until BTreeMap is rewritten to use parent pointers?

I believe they can be added afterwards without an RFC based on "natural API holes" logic.

@apasel422
Copy link
Contributor

@gankro Absolutely.

@Gankra
Copy link
Contributor Author

Gankra commented Jul 20, 2015

I've renamed the APIs per discussion.

@shepmaster
Copy link
Member

@gankro you may want to update your original comment. I read succ_exc_mut and was about to go on a trip to pick out a new color for the bikeshed.

@Kimundi
Copy link
Member

Kimundi commented Jul 29, 2015

Hm, how about combating the combinatoric explosion with type paramters?

.get_rel::<LE>(&Q) -> Option<(&K, &V)>;

If get where unstable it could have even been defined as

fn get<Ord=EQ>(&Q) -> Option<(&K, &V)>;

@apasel422
Copy link
Contributor

@Kimundi I've talked about something like that with @gankro in the past: Gankra/collect-rs#120 (comment).

@Gankra
Copy link
Contributor Author

Gankra commented Jul 29, 2015

🔔 HERE YE HERE YE THIS RFC IS ENTERING ITS FINAL COMMENT PERIOD 🔔

@Gankra Gankra added the final-comment-period Will be merged/postponed/closed in ~10 calendar days unless new substational objections are raised. label Jul 29, 2015
modulo, but this is a more general problem for the *ordered map* API. There are surely types for
which a straight-up query will be cheaper than iterator initialization.

It is also siginificantly more ergonomic/discoverable to have `pred_inc_mut(&K)` over
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/pred_inc_mut/get_le_mut/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/siginificantly/significantly/

@huonw
Copy link
Member

huonw commented Aug 5, 2015

(It would be good for the typos that @apasel422 has noticed to be fixed if/before this is merged.)

I find the combinatorics here really really bad. It seems a little crazy to add so many methods for what I suspect are relatively niche use-cases. @gankro, I know you're rabidly against any use of enums in APIs to collapse functionality (i.e. the pred/next Bound-based API you propose as an alternative) but it seems to me that this case is a pretty good way to collapse an explosion into something that's actually managable. The methods could even be named something like max_before/min_after or max_upto/min_from (or something like that) which makes their use much more obvious.

@apasel422
Copy link
Contributor

Returning an OccupiedEntry for {min, max, lt, le, ge, gt} subsumes remove and get_mut. I propose that we only add get_* and *_entry varieties of these queries, regardless of whether we combine them via an enum or provide separate methods:

impl<K, V> Map<K, V> {
    fn get_lt<Q: ?Sized>(&self, key: &Q) -> Option<(&K, &V)>;
    fn lt_entry<Q: ?Sized>(&mut self, key: &Q) -> Option<OccupiedEntry<K, V>>;
    // ... `le`, `ge`, `gt`
}

or

enum Query<T> {
    Min,
    Lt(T),
    Le(T),
    Ge(T),
    Gt(T),
    Max,
}

impl<K, V> Map<K, V> {
    fn query<Q: ?Sized = K>(&self, query: Query<&Q>) -> Option<(&K, &V)>;
    fn query_entry<Q: ?Sized = K>(&mut self, query: Query<&Q>) -> Option<OccupiedEntry<K, V>>;
}

@Gankra
Copy link
Contributor Author

Gankra commented Aug 5, 2015

I disagree that it's niche -- it's one of the primary reasons to use an ordered map.

@diwic
Copy link

diwic commented Aug 5, 2015

Is there also use for a "nearest" version? I e, if the treemap looks for 1000 and can only find 900 and 1010, it will choose 1010 because it is nearest. That seems useful - although maybe that will require some additional trait bound (e g Sub or Add)?

Bikeshed wise, I don't know why min and max are not labelled get_min and get_max, if they - like get_le and friends - get an item in the map? That seems inconsistent.

@arthurprs
Copy link

This, please! I'm missing this for a while. Otherwise there's very little reason to have an ordered map!

It's a shame it requires so much code though, we need those parent pointers.

@Gankra
Copy link
Contributor Author

Gankra commented Aug 6, 2015

@arthurprs parent pointers wouldn't solve the duplication, it's a pure descent algorithm. (I suppose it would reduce duplication with other APIs.)

@gnzlbg
Copy link
Contributor

gnzlbg commented Aug 6, 2015

In think that the following is missing:

  • a description of what each function in the API does. I cannot find anywhere a description of what get_le does (should be obvious to everybody, but it also should be in the text of the RFC).
  • what is the algorithmic complexity of each API function (at least the worst case complexity should be specified)
  • what is the space complexity of each API function (does any of the functions allocate memory?)

When seeing all the get_le|gt|... I wonder:

  • why can't a function get(K, CMP) be provided instead, where the user can pass whatever cmp it wants (e.g. via a lambda)?

I see two main use-cases for the query API:

  • queries with the same "order" as the one used by the data-structure,
  • queries with an arbitrary different order.

So I basically expected to see two functions get(K) and get(K, cmp) (+ their _mut counterparts) for these two use-cases.

@gankro you asked for feedback, I hope this is some constructive one :P

@apasel422
Copy link
Contributor

The time and space complexity of these operations is an implementation
detail that will end up changing once parent pointers are added. However,
it may be useful to point out that the simplest implementation is based on
iterators. get_lt(k), for example, is map.iter().rev().find(|e| e.0 < k).

Can you provide an example of how you would use the user-provided closure
API, and the full signature of the method? I'm not sure how the predecessor
function, for example, could be written in terms of it.

On Thursday, August 6, 2015, gnzlbg notifications@github.com wrote:

In think that the following is missing:

  • a description of what each function in the API does. I cannot find
    anywhere a description of what get_le does (should be obvious to
    everybody, but it also should be in the text of the RFC).
  • what is the algorithmic complexity of each API function (at least
    the worst case complexity should be specified)
  • what is the space complexity of each API function (does any of the
    functions allocate memory?)

When seeing all the get_le|gt|... I wonder:

  • why can't a function get(K, CMP) be provided instead, where the user
    can pass whatever cmp it wants (e.g. via a lambda)? (why restrict ourself
    to le|gt|...).

I see two main use-cases for the query API:

  • queries with the same "order" as the one used by the data-structure,
  • queries with an arbitrary different order.

So I basically expected to see two functions get(K) and get(K, cmp) (+
their _mut counterparts) for these two use-cases.

@gankro https://github.com/gankro you asked for feedback, I hope this
is some constructive one :P


Reply to this email directly or view it on GitHub
#1195 (comment).

@gnzlbg
Copy link
Contributor

gnzlbg commented Aug 6, 2015

The time and space complexity of these operations is an implementation
detail that will end up changing once parent pointers are added. However,
it may be useful to point out that the simplest implementation is based on
iterators. get_lt(k), for example, is map.iter().rev().find(|e| e.0 < k).

Basically I just want to know if I can call these in a loop without ending up in N^2 complexity or blowing up the stack. If I'm doing something latency-related I also need to know if they allocate any memory in the heap.

If the implementation improves the complexity in the future, that is a non-breaking change, but the current complexity guarantees should be there.

Can you provide an example of how would you use the user-provided closure
API, and the full signature of the method? I'm not sure how the predecessor
function, for example, could be written in terms of it.

I'm not sure either, but for the _lt|_gt|_... methods I'd rather write get(k, |a, e| e.0 < a.0) or get(k, lt) (supposing we provide a std::lt that works here).

@apasel422
Copy link
Contributor

I don't think this RFC needs to make complexity guarantees. People can decide to use these methods based on the public documentation of their current complexity, not the contents of this RFC. But if you need the predecessor of a key for a certain algorithm, you have to find it somehow, so providing these APIs will be beneficial regardless of their complexity. Even if they are implemented completely naively at first, code that calls the methods instead of doing a manual iterator-based search will be made more efficient automatically when the implementation improves.

I don't understand what get(k, |a, e| e.0 < a.0) would do. What is the purpose of passing k there?

@cristicbz
Copy link

An alternative to the enum would also be a static dispatch version, similar to the way Range worked out

struct Min;
struct Max;
struct Le<Q: ?Sized>(Q);
// ...

trait Query<K, V, Selector: ?Sized> {
    fn query(&self, query: &Selector) -> Option<(&K, &V)>;
    fn query_mut(&mut self, query: &Selector) -> Option<(&K, &mut V)>;
    // maybe query_entry as well
}

impl<K, V> Query<K, V, Min> for Map<K, V> {
    /* ... */
}

impl<K, V, Q> Query<K, V, Le<Q>> for Map<K, V>
        where K: Borrow<Q> {
    /* ... */
}
// ...

Cons: you'd have to import std::collections::Query to get these methods on your map.

@apasel422
Copy link
Contributor

You actually wouldn't have to import the query trait, because we could add inherent methods to the map that simply call out to the appropriate impl. You would have to import the query structs themselves, though, and there would have to be a different trait for set queries, which won't expose mutable elements.

I'm not opposed to the static dispatch approach, because it can be nice to represent the queries themselves as values (e.g. passing around Lt(&5)).

@cristicbz
Copy link

To avoid having a separate trait for Set, you could have Query and QueryMut and make the return an associated type and implement them on referefences (to avoid needing HKT-s).

trait Query<Selector: ?Sized> {
    type Output;
    fn query(self, query: &Selector) -> Option<Self::Output>;
}

trait QueryMut<Selector: ?Sized> {
    type Output;
    fn query_mut(self, query: &Selector) -> Option<Self::Output>;
}

impl<'a, K, V> Query<Min> for &'a Map<K, V> {
    type Output = (&'a K, &'a V);

    fn query(self, query: &Selector) -> Option<Self::Output> {
        /* ... */
    }
}

impl<'a, K, V> QueryMut<Min> for &'a mut Map<K, V> {
    type Output = (&'a K, &'a mut V);

    fn query_mut(self, query: &Selector) -> Option<Self::Output> {
        /* ... */
    }
}

impl<'a, E> Query<Min> for &'a Set<E> {
    type Output = &'a E;

    fn query(self, query: &Selector) -> Option<Self::Output> {
        /* ... */
    }
}

If you do provide inherent methods though, I don't know if there is much value in using the same trait (which would really only ever show up to bound the argument of the inherent methods).

@apasel422
Copy link
Contributor

@cristicbz I've put a POC implementation of what you are suggesting here: https://github.com/apasel422/bst/tree/query.

@aturon
Copy link
Member

aturon commented Aug 7, 2015

Here's a thought on an API variant to deal with combinatorics while still being friendly:

fn max<Q: ?Sized, R>(&self, range: R) -> Option<(&K, &V)>
    where K: Borrow<Q>, Q: Ord, AnyRange<&Q>;

fn min<Q: ?Sized, R>(&self, range: R) -> Option<(&K, &V)>
    where K: Borrow<Q>, Q: Ord, R: AnyRange<&Q>;

fn max_entry<Q: ?Sized, R>(&mut self, range: R) -> Option<OccupiedEntry<K, V>>
    where K: Borrow<Q>, Q: Ord, R: AnyRange<&Q>;

fn min_entry<Q: ?Sized, R>(&mut self, range: R) -> Option<OccupiedEntry<K, V>>
    where K: Borrow<Q>, Q: Ord, R: AnyRange<&Q>;

Given inclusive ranges, you can cover all of the cases you wanted to with your API, without requiring any extra imports or names to be used.

UPDATE: in case the above is unclear, here are some examples:

// get_le
map.max(...&k)

// get_lt
map.max(..&k)

// get_ge
map.min(&k..)

// get the smallest element:
map.min(..)

However, @gankro points out on IRC that since exclusive ranges only exclude on the right, we can't express get_gt this way. Too bad!

@cristicbz
Copy link

@aturon The inability to express get_gt is what lead me to suggest new types. Agreed, ranges would be neater, but short of introducing a new range type (>x.., the unary greater than operator :P) the result would be inconsistent.

@huonw
Copy link
Member

huonw commented Aug 12, 2015

FWIW, it seems C++'s equivalent container std::map only offers functionality (essentially) equivalent to what we currently offer, but with C++'s iterators (i.e. independent endpoints, and syntactically nicer for the simplest case) and const-overloading.

@Gankra
Copy link
Contributor Author

Gankra commented Aug 12, 2015

The libs team has decided to close this RFC pending investigating alternative API solutions.

In particular I think there's a promising opportunity with a range builder pattern.

For now this functionality could be provided by an external crate -- at least semantically, not necessarily perf-wise -- on top of range. A parent pointer impl should address perf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
final-comment-period Will be merged/postponed/closed in ~10 calendar days unless new substational objections are raised. T-libs-api Relevant to the library API team, which will review and decide on the RFC.
Projects
None yet
Development

Successfully merging this pull request may close these issues.