Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Add item recovery collection APIs #1194

Merged
merged 6 commits into from
Aug 27, 2015
Merged

RFC: Add item recovery collection APIs #1194

merged 6 commits into from
Aug 27, 2015

Conversation

apasel422
Copy link
Contributor

@apasel422 apasel422 commented Jul 8, 2015

@Gankra Gankra self-assigned this Jul 8, 2015
@Gankra Gankra added the T-libs-api Relevant to the library API team, which will review and decide on the RFC. label Jul 8, 2015
@Gankra
Copy link
Contributor

Gankra commented Jul 8, 2015

CC @cmr @eddyb @bluss @seppo0010 (can't remember all the people interested...)


# Alternatives

Do nothing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hold on, to be clear: "Do nothing" here means "Do nothing and let users write such caches via, e.g., HashMap<T, ()> " ... right?

I don't particular mind adding the functionality described here to HashSet, but I'm also not sure its strictly necessary, unless I have missed something with how HashMap<T, ()> would work.

Update: Ah, re-reading the RFC, I now see that our current HashMap API would not support that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not possible to use HashMap that way, because it doesn't provide any methods that return &K (or K) other than via its iterators.

@blaenk
Copy link
Contributor

blaenk commented Jul 16, 2015

It'd be cool if you could add another copy of the code that demonstrates the problem but showing how to use your proposed APIs to resolve it.

@apasel422
Copy link
Contributor Author

@blaenk I'd actually like to use a more concrete motivating example, but I'll add the revised code as well.

@apasel422
Copy link
Contributor Author

@gankro Do you have any ideas for a better motivating example, like an algorithm that uses a set as a cache?

@eddyb
Copy link
Member

eddyb commented Jul 17, 2015

@apasel422 that's the usecase in the compiler: sets of hundreds of thousands of elements, used for interning/caching, that have to be identity maps right now, wasting some memory space.

@benaryorg
Copy link
Contributor

@eddyb Does that mean, if this RFC is implemented, the compiler can be tweaked to use less memory?

@apasel422
Copy link
Contributor Author

I've added a WIP implementation of this RFC for BTreeMap at https://github.com/apasel422/rust/tree/rfc-1194.

@Gankra
Copy link
Contributor

Gankra commented Jul 20, 2015

@apasel422 It could be argued by metaphor to the current naming in #1195 that these methods could just called get_eq, remove_eq, etc...

@Gankra
Copy link
Contributor

Gankra commented Jul 20, 2015

This is a bit more dubious for HashMap; but not crazy.

@apasel422
Copy link
Contributor Author

@gankro I actually had that same thought a little while ago, but I don't think it fully translates, and would be even weirder for entries:

impl<'a, K, V> OccupiedEntry<'a, K, V> {
    fn get(&self) -> &V;
    fn get_eq(&self) -> (&K, &V); // what does eq have to do with this?
    ...
}

{key_value, key_value_mut, remove_key_value} are quite verbose, but have parallel set names {element, remove_element}. {kv, kv_mut, remove_kv} are more succinct, but don't have an obvious set version.

@Gankra
Copy link
Contributor

Gankra commented Jul 20, 2015

Hmm... Entry does seem to mess things up. That said... is it a tragedy if it's a bit misaligned from everything else?

@apasel422
Copy link
Contributor Author

I think they should be consistent.

Here are some options that work for both maps and occupied entries (in addition to Map::replace, which I think is non-controversial):

  1. {kv, kv_mut, remove_kv} with VacantEntry::insert_kv
  2. {get_kv, get_kv_mut, remove_kv} with VacantEntry::insert_kv
  3. {key_val, key_val_mut, remove_key_val} with VacantEntry::insert_key_val
  4. {key_value, key_value_mut, remove_key_value} with VacantEntry::insert_key_value

And here are some options for sets (assuming that the changes in rust-lang/rust#27135 canonicalize "element" over "value" when referring to sets):

  1. {elem, remove_elem}
  2. {element, remove_element}

@i30817
Copy link

i30817 commented Jul 23, 2015

One common optimization that can't be done in java because of set item recovery is to just store hashes instead of elements, for the case where identity-mapping is not desirable. Are you going to give up this special case?

@apasel422
Copy link
Contributor Author

@i30817 Rust's HashMap and HashSet already prevent that optimization: The keys (elements) must implement Eq in addition to Hash, and the types provide iterators over their contents. A hypothetical map or set providing that optimization could omit these operations, but would also have to omit iteration. In any case, just storing hashes seems more like a bloom filter than a hash table, due to the absence of a check on Eq.

@i30817
Copy link

i30817 commented Jul 23, 2015

Mmm makes sense. Still, it's a somewhat common optimization, maybe a bloom filter type could be added to the language.

@apasel422
Copy link
Contributor Author

@i30817 Off-topic, but [https://crates.io/search?q=bloom filter](https://crates.io/search?q=bloom filter)

@shepmaster
Copy link
Member

I'm definitely in favor of the ideas laid out here. I have the same motivating problem - a cache of strings. I've used some unsafe code to avoid double-allocating the strings, but I still have essentially a HashMap<&str, &str>.

@bkoropoff
Copy link

How would you feel about modifying OccupiedEntry to hold on to the key passed to entry so you can recover it?

@apasel422
Copy link
Contributor Author

@bkoropoff That could be done, but it has the problem that a new key is only present for OccupiedEntrys that are obtained through Map::entry. The theoretical {max_entry, min_entry, ...} provided by RFC #1195 will not have a new key to return.

I'm therefore more inclined to add that kind of key-recovery functionality as

pub enum Entry<'a, K: 'a, V: 'a> {
    Occupied(OccupiedEntry<'a, K, V>, K),
    Vacant(VacantEntry<'a, K, V>),
}

instead of

pub struct OccupiedEntry<'a, K: 'a, V: 'a> {
    new_key: K,
    // ...
}

impl<'a, K, V> OccupiedEntry<'a, K, V> {
    /// Returns the key that was used to acquire this entry.
    // This could return `Option<K>` in order to better model the `max_entry` situation
    pub fn into_new_key(self) -> K { self.key }
}

but that would not be a backwards-compatible change. Additionally, storing the new key in the struct itself has the benefit of allowing us to provide an additional OccupiedEntry::replace_key method that updates the new key and returns the old one, much as Map::replace would:

impl<'a, K, V> OccupiedEntry<'a, K, V> {
    /// Replaces the entry's key with the one that was used to acquire this entry, if any, and
    /// returns the old key.
    ///
    /// This method always return `None` after the first call to it and for all entries
    /// acquired through `max_entry` etc.
    pub fn replace_key(&mut self) -> Option<K>;
}

This adds some complexity to the API surface and makes it harder to reason about what the behavior is. It's possible that we could add what you're proposing in a subsequent RFC instead.

@Gankra
Copy link
Contributor

Gankra commented Jul 29, 2015

🔔 HERE YE HERE YE THIS RFC IS ENTERING ITS FINAL COMMENT PERIOD 🔔

@aturon
Copy link
Member

aturon commented Aug 7, 2015

Sorry to be late to this party (I also had to miss the libs meeting this week). I'm on board with the basic motivation here, and regret the stabilization of the bool-centric methods on sets. As @gankro said, the team decided to go forward with this RFC, modulo some bikeshedding.

That said, I feel like the RFC is proposing significantly more API expansion than is actually needed to solve the original problem -- in particular, I don't see why any changes to the entry API are needed. Could we instead take the following as a starting point (bikesheds painted in my favorite colors):

impl<T> Set<T> {
    // Like `contains`, but returns a reference to the element if the set contains it.
    fn get<Q: ?Sized>(&self, element: &Q) -> Option<&T>;

    // Like `remove`, but returns the element if the set contained it.
    fn take<Q: ?Sized>(&mut self, element: &Q) -> Option<T>;

    // Like `insert`, but replaces the element with the given one and returns the previous element
    // if the set contained it.
    fn replace(&mut self, element: T) -> Option<T>;
}

impl<K, V> Map<K, V> {
    // Like `get`, but additionally returns a reference to the entry's key.
    fn key_value<Q: ?Sized>(&self, key: &Q) -> Option<(&K, &V)>;

    // Like `get_mut`, but additionally returns a reference to the entry's key.
    fn key_value_mut<Q: ?Sized>(&mut self, key: &Q) -> Option<(&K, &mut V)>;

    // Like `remove`, but additionally returns the entry's key.
    fn remove_key_value<Q: ?Sized>(&mut self, key: &Q) -> Option<(K, V)>;

    // Like `insert`, but additionally replaces the key with the given one and returns the previous
    // key and value if the map contained it.
    fn replace(&mut self, key: K, value: V) -> Option<(K, V)>;
}

In particular, the fact that the entry APIs need an owned key to use (today, at least) seems to make the key-accessing functionality questionable. But maybe I'm missing something?

@apasel422
Copy link
Contributor Author

@aturon We will presumably want the entry methods once #1195 is accepted, but they could be omitted for now. I think that both RFCs need to be considered together though, and it probably makes sense to avoid a proliferation of {get, get_mut, remove, and entry} methods when just {get, entry} suffice. We don't need to add take or key_value_mut on Map directly if we just add them to OccupiedEntry now, and this will benefit the queries added in #1195 as well. For example:

impl<K, V> Map<K, V> {
    fn get_pair(&self, key: &Q) -> Option<(&K, &V)>;
    fn get_entry(&mut self, key: &Q) -> Option<OccupiedEntry<K, V>>;
    fn replace(&mut self, key: K, val: V) -> Option<(K, V)>;

    fn get_max(&self) -> Option<(&K, &V)>;
    fn max_entry(&mut self) -> Option<OccupiedEntry<K, V>>>;

    fn get_lt(&self, key: &Q) -> Option<(&K, &V)>;
    fn lt_entry(&mut self, key: &Q) -> Option<OccupiedEntry<K, V>>>;

    // get_* and *_entry for le, ge, gt, min
}

impl<'a, K, V> OccupiedEntry<'a, K, V> {
    fn pair(&self) -> (&K, &V);
    fn pair_mut(&mut self) -> (&K, &mut V);
    fn into_pair_mut(self) -> (&'a K, &'a mut V);
    fn take(self) -> (K, V);
}

@alexcrichton
Copy link
Member

The libs team discussed this RFC today, and our conclusion was that it may be best to hone this down to what's precisely necessary to satisfy the motivation in the outset. To that end would it be possible to only include the set methods? Specifically:

impl<T> Set<T> {
    fn get<Q: ?Sized>(&self, element: &Q) -> Option<&T>;
    fn take<Q: ?Sized>(&mut self, element: &Q) -> Option<T>;
    fn replace(&mut self, element: T) -> Option<T>;
}

@Gankra
Copy link
Contributor

Gankra commented Aug 12, 2015

Specifically, I believe the supporting map methods were also decided to just be doc(hidden) and unstable.

@alexcrichton
Copy link
Member

I'd personally prefer the methods to be freestanding in the module so they're private to the outside world but public to the crate rather than having them in the inherent API at all.

@Gankra
Copy link
Contributor

Gankra commented Aug 13, 2015

@alexcrichton How does that work? Privacy can only reach up, and not down or sideways. Maps and Sets are defined in sibling modules.

@sfackler
Copy link
Member

They could be defined in a crate private trait. That's how I've gotten
around visibility issues before.

On Wed, Aug 12, 2015, 8:01 PM Alexis Beingessner notifications@github.com
wrote:

@alexcrichton https://github.com/alexcrichton How does that work?
Privacy can only reach up, and not down or sideways. Maps and Sets are
defined in sibling modules.


Reply to this email directly or view it on GitHub
#1194 (comment).

@Gankra
Copy link
Contributor

Gankra commented Aug 20, 2015

@apasel422 Can you amend the RFC to be minimal per aturon's request? I think we're good to go when that's done.

@seppo0010
Copy link

I don't understand the motivation to allow item recover from a Set but not key recover from a Map.

I was actually expecting that feature to move items from one Map to another without cloning its keys.

@SimonSapin
Copy link
Contributor

Same here, the use case that got me here was with a HashMap, not a set.

@eddyb
Copy link
Member

eddyb commented Aug 20, 2015

@seppo0010 The usecase I hit that needed the feature for Set was sort of a cache.
One quick example would be HashSet<Rc<str>>, you want to be able to find an existing key with &str and clone the Rc handle.
To get closer to the compiler, HashSet<&'arena str> could be used as a cache, queried with a temporary &str, that would get copied on the arena if not found in the set.

@alexcrichton
Copy link
Member

@apasel422 ping about the RFC update, would love to merge!

@apasel422
Copy link
Contributor Author

@alexcrichton I haven't updated yet because it seems like there's still some dissent, based on the last few comments.

@shepmaster
Copy link
Member

As another voice, only having it on sets would be acceptable for me. I am in the same boat as @eddyb — a cache.

@apasel422
Copy link
Contributor Author

@alexcrichton Updated.

@alexcrichton
Copy link
Member

Ok, thanks @apasel422! The consensus of the libs team is that this is a great step forward for sets and we can continue to explore the problem space for maps as the needs arise, but it seems like the most pressing parts to work with are sets today.

And of course, thanks again for the RFC @apasel422!

@rpjohnst
Copy link

rpjohnst commented Oct 3, 2017

I find myself needing this for maps.

In my case, I am building a string interner using a HashMap<Box<str>, u32> where the u32s track the order of insertion. I use this to determine whether a string is in one of a few fixed sets of symbols by doing range checks on its corresponding u32.

The OrderMap crate provides this part of the API under the name get_pair, so that name now has some precedent.

Alternatively, it would be useful for the Entry API to provide a reference to the key as well as the value after an or_insert/or_insert_with-like operation. This might be the better option since it allows the lookup to be reused when inserting new entries.

What is the best route forward here? Should I write up a new RFC?

fschutt added a commit to fschutt/polyclip that referenced this pull request Dec 16, 2017
@fschutt
Copy link

fschutt commented Dec 16, 2017

Well, I needed this for a function where I insert into a set, but then I immediately want an iterator to that last, inserted element in the set. Since a BTreeSet is just a BTreeMap internally, this should be possible.

The application is a scanline algorithm, where the set consists out of ordered points. I need to insert a point into a scanline and then know where it has been inserted (the position), so that I can construct an iterator to the next and previous point in the (ordered) scanline.

So for now I've forked the std::collections::BTreeSet. Watching this for when it eventually gets merged.

@Centril Centril added the A-collections Proposals about collection APIs label Nov 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-collections Proposals about collection APIs final-comment-period Will be merged/postponed/closed in ~10 calendar days unless new substational objections are raised. T-libs-api Relevant to the library API team, which will review and decide on the RFC.
Projects
None yet
Development

Successfully merging this pull request may close these issues.