Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove the raw feature and make RawTable private #546

Merged
merged 1 commit into from
Aug 23, 2024

Conversation

Amanieu
Copy link
Member

@Amanieu Amanieu commented Aug 22, 2024

This will give more freedom for the internal implementation details of hashbrown to evolve without the need for regular releases with breaking changes.

All existing users of RawTable should migrate to the HashTable API which is entirely safe while providing the same flexibility as RawTable.

This also removes the following features which were only exposed under RawTable:

  • RawTable::iter_hash
  • RawIter::reflect_insert and RawIter::reflect_remove
  • RawTable::clone_from_with_hasher
  • RawTable::insert_no_grow and RawTable::try_insert_no_grow
  • RawTable::allocation_info
  • RawTable::try_with_capacity(_in)
  • HashMap::raw_table(_mut) and HashSet::raw_table(_mut)

If anyone was previously relying on this functionaly, please raise a comment. It may be possible to re-introduce it as a safe API in HashTable and/or HashMap.

@Amanieu
Copy link
Member Author

Amanieu commented Aug 22, 2024

cc @clarfonthey

This will give more freedom for the internal implementation details of
hashbrown to evolve without the need for regular releases with breaking
changes.

All existing users of `RawTable` should migrate to the `HashTable` API
which is entirely safe while providing the same flexibility as
`RawTable`.

This also removes the following features which were only exposed under
`RawTable`:
- `RawTable::iter_hash`
- `RawIter::reflect_insert` and `RawIter::reflect_remove`
- `RawTable::clone_from_with_hasher`
- `RawTable::insert_no_grow` and `RawTable::try_insert_no_grow`
- `RawTable::allocation_info`
- `RawTable::try_with_capacity(_in)`
- `HashMap::raw_table(_mut)` and `HashSet::raw_table(_mut)`
@Dandandan
Copy link

Dandandan commented Aug 23, 2024

I think https://github.com/Apache/arrow-datafusion

uses:

  • RawTable::try_insert_no_grow
  • RawTable::allocation_info

@clarfonthey
Copy link
Contributor

clarfonthey commented Aug 23, 2024

Since you're already doing this, would you mind pushing a release after too so rust-lang/rust#128711 can go through? Please and thank you. <3

IMHO the sooner we do this, the better, since it gives people time to migrate away from the API before we start following up on the threat to improve the code. And, since it's a published crate, it's not a big deal, since the old (non-breaking) versions will continue to work correctly.

@Amanieu
Copy link
Member Author

Amanieu commented Aug 23, 2024

I think Apache/arrow-datafusion uses:

Datafusion seem to actually require the full functionality of RawTable since it directly stores bucket indices in a BinaryHeap so that it can delete arbitrary elements later (and rebuilds the heap on rehash to keep the indices valid).

This isn't something that can be supported on HashTable since it relies on low-level implementation details of RawTable such as the fact that buckets are never moved except in a rehash. Perhaps in the future the new raw API that @clarfonthey is planning (see #545) will be publicly exposed, but in the meantime datafusion can just stay on the previous hashbrown version.

@Dandandan
Copy link

FYI @alamb @avantgardnerio

@alamb
Copy link

alamb commented Aug 23, 2024

We have also been discussing implementing our own custom hash table in apache/datafusion#7095, so perhaps this would be another potential reason to pursue that idea

I agree that using the low level details of how the hash table in hashbrown is implemented is not ideal (e.g. it constrains how hashbrown can version releases)

@Amanieu I wonder if you would consider a feature flag on hashbrown like hashbrown_unstable (following the model of tokio_unstable). That would let DataFusion keep using hashbrown internals (and it would be on us to deal with any changes during upgrade)

@avantgardnerio
Copy link

This will require work in DataFusion to fix TopK aggregations, and then after it is fixed it will cause significant (40% IIRC) performance regressions.

@Amanieu
Copy link
Member Author

Amanieu commented Aug 23, 2024

I don't think that's a good idea because every minor release of hashbrown could break users of your crate.

However you can just keep using the 0.14 version of hashbrown which will still have the raw feature. It's only being removed in 0.15.

@alamb
Copy link

alamb commented Aug 23, 2024

However you can just keep using the 0.14 version of hashbrown which will still have the raw feature. It's only being removed in 0.15.

Another option is for us to to fork the crate which might be reasonable if we can get more performance by doing so

@Dandandan
Copy link

I think we should create some tickets in DataFusion for moving our usage towards HashTable / own hashtable (e.g. better vectorized) / forked code.

@Amanieu
Copy link
Member Author

Amanieu commented Aug 23, 2024

@bors r+

@bors
Copy link
Contributor

bors commented Aug 23, 2024

📌 Commit 26ef4a1 has been approved by Amanieu

It is now in the queue for this repository.

@bors
Copy link
Contributor

bors commented Aug 23, 2024

⌛ Testing commit 26ef4a1 with merge aa1411b...

@bors
Copy link
Contributor

bors commented Aug 23, 2024

☀️ Test successful - checks-actions
Approved by: Amanieu
Pushing aa1411b to master...

@bors bors merged commit aa1411b into rust-lang:master Aug 23, 2024
24 checks passed
@the-mikedavis
Copy link
Contributor

I was using RawTable::iter_hash for a multi-map-like type (kind of like HashMap<K, Vec<T>>) and I don't believe there's a way to look up multiple entries with the same key from HashTable<(K, V)>: HashTable::find stops after the first matching element. Would it be possible to introduce something safe in HashTable to take its place? I can share more code if you'd like more context and maybe work on a PR if you think it's a reasonable addition.

@Amanieu
Copy link
Member Author

Amanieu commented Aug 28, 2024

I think iter_hash makes sense to expose on HashTable. Feel free to send a PR, you can do a partial revert of this one to get iter_hash back in RawTable.

bors added a commit that referenced this pull request Sep 2, 2024
Add `HashTable::iter_hash`, `HashTable::iter_hash_mut`

This is a follow-up to #546 ([comment](#546 (comment))). `iter_hash` from the old raw API can be useful for reading from a "bag" / "multi map" type which allows duplicate key-value pairs. Exposing it safely in `HashTable` takes a fairly small wrapper around `RawIterHash`. This PR partially reverts #546 to restore `RawTable::iter_hash` and its associated types.
Amanieu added a commit to Amanieu/hashbrown that referenced this pull request Sep 12, 2024
This was previously removed from `RawTable` in rust-lang#546. This is now added
as a public API on `HashMap`, `HashSet` and `HashTable`.
@xacrimon
Copy link

xacrimon commented Sep 14, 2024

@Amanieu dashmap relies on RawTable functionality. With this merged we're be effectively locked to 0.14 forever.

@Amanieu
Copy link
Member Author

Amanieu commented Sep 14, 2024

What specific functionality do you need that isn't available though HashTable which is the replacement for RawTable?

@cuviper
Copy link
Member

cuviper commented Sep 14, 2024

I took a brief look, and the main roadblock that I see will be iterators and entry structs that currently contain RwLock guards with corresponding RawIter / Bucket / InsertSlot. With HashTable, those second parts all have lifetimes that would need to borrow through the guard, so they become self-referential.

@clarfonthey
Copy link
Contributor

Added a comment to #545 mentioning to look into dashmap's internals when I go about ripping out the raw table API. It may be the case that we need to offer some kind of lifetime-erased version of the various types to get it to work, but I'm hoping we can get around that.

bors added a commit that referenced this pull request Sep 18, 2024
Re-introduce a way to get the allocation size of a table

This was previously removed from `RawTable` in #546. This is now added as a public API on `HashMap`, `HashSet` and `HashTable`.

Fixes #238
Fixes #506
@Amanieu
Copy link
Member Author

Amanieu commented Oct 1, 2024

I took a brief look, and the main roadblock that I see will be iterators and entry structs that currently contain RwLock guards with corresponding RawIter / Bucket / InsertSlot. With HashTable, those second parts all have lifetimes that would need to borrow through the guard, so they become self-referential.

I thought about this a bit and fundamentally it's not a problem with HashTable but rather a limitation of the MutexGuard API. I think you can work around this by using a Mutex<()> and an UnsafeCell<HashTable> where you manage the locking manually.

@kyren
Copy link
Contributor

kyren commented Dec 8, 2024

Hi, two projects I'm affiliated with are stuck on hashbrown 0.14 right now due to using the raw API in a way that can't be replicated in 0.15. They both use RawTable for nearly the same purpose, which I will try to explain.

https://github.com/ruffle-rs/ruffle uses the RawTable API to implement AVM2's iteration protocol for object properties. I'm really unfamiliar with this, @moulins and other people on the Ruffle team can probably explain in better detail, but the bytecode assumes that object properties are stored in a flat bucket based hash table, and the bytecode can get entries from an object based on a bucket index and also has an instruction like HasNext(object, index) -> next_index to find the next valid bucket.

https://github.com/kyren/piccolo (which is something I'm much better equipped to explain) uses the RawTable API for something similar. Lua has a function called next that can be used to find the "next" key in a Lua table, which is normally determined by its "raw" order in a hash table, or in hashbrown terms the bucket index (not all versions of PUC-Rio Lua implement tables this way so this could equivalently be some order in a red-black tree). It is used like this:

local my_table = { a = 1, b = 2, c = 3 }
local first_key = next(my_table, nil)
local next_key = next(my_table, first_key)
-- and so on...

The core issue is that both projects implement VMs for languages whose implementation assumes that a hash table has some kind of observable order that is stable ONLY when the hash table is not modified in certain ways. For piccolo (and this is the same in the internals of PUC-Rio Lua), that requirement is actually that the hash table order only be stable under no inserts or removals whatsoever, only mutating values and (transparent) modifications of keys. How this is accomplished is with something the PUC-Rio Lua internals refer to as "dead keys", having a version of a key that compares identically to a normal live key but doesn't own a value for the purposes of garbage collection. When removing an entry in a hash table, you set the key to its "dead key" equivalent and set the value to nil (an entry with a nil value in Lua is indistinguishable from an entry not being present). Ruffle's AVM2 does not use this trick right now and instead relies on the order being guaranteed not to change without a rehash, but AIUI it is possible to efficiently implement tombstone entries in the AVM2 VM in a similar way to piccolo, loosening the requirements.

We (some of the Ruffle devs and I) believe we have a minimal API that satisfies both our use cases and comes with hopefully minimal guarantees.

If it were only piccolo, the required API would be to get the next live entry after some given entry (and some way to get the first entry), in whatever the bucket order is, with no guarantees about that order being the same after any sort of mutation of the HashTable itself at all. AVM2 is slightly different due to needing to get an entry specifically by bucket index (which the AVM2 VM also caches internally in certain cases), but it never needs to get the bucket index by looking up an existing key (which piccolo must have to implement next). Therefore, a union of both requirements could look like this:

  1. Access an entry by bucket index
  2. Ask for the first valid bucket index greater than or equal to a given one (0 for the first valid bucket)
  3. Ask for the bucket index corresponding to an occupied entry / find a bucket index by key and eq predicate
  4. No guarantees whatsoever for iteration order after any kind of "table mutation", both insertion and removal

There is probably a lot of room for equivalent APIs that would work here also. The core issue is that we need to observe the bucket order for a hash table that is not being modified. To put it in maybe a less restrictive way, we really don't even care about the bucket order, what both projects really need is to be able to iterate over a HashTable in such a way that the iteration can be suspended and resumed (for piccolo, resuming from a specific key, for Ruffle, resuming from some other sort of token). It is possible ofc that both projects could create a less efficient custom data structure with auxiliary arrays for an entry order, but in both cases these are pretty much the core data structures for the implemented language, so this is valuable enough that our best alternative other than some kind of API change here is a custom fork of hashbrown (or to stay on 0.14 indefinitely).

@clarfonthey
Copy link
Contributor

So, I just want to clarify here that the reason why RawTable is being deprecated is because it's old, not written with modern Rust features (like MaybeUninit) in mind, and is already soft-replaced by the HashTable API. So, all of this functionality you want to use seems extremely reasonable to add onto the HashTable API, since it all feels reasonable to me: the goal isn't to get rid of the lower-level APIs, but rather to offer them in a better place.

I think that the ability to keep track of bucket indices beyond just hashes seems reasonable, although I think that at least linking a few demonstrations of existing code you cannot rewrite under the HashTable API would be useful, so we know specifically what methods you're using and how they could best be replaced. I get the vague idea based upon what you're describing, but code examples would be the most helpful, since that way we can think of a suitable replacement API.

I think it would be most helpful to open a separate issue for these features (or multiple, depending on how different they are) so they can be tracked and followed better. There are a bunch of different features that people want from RawTable that haven't made their way onto HashTable yet, and it's best to separate out the discussions so we make sure nothing gets missed.

@kyren
Copy link
Contributor

kyren commented Dec 9, 2024

So, I just want to clarify here that the reason why RawTable is being deprecated is because it's old, not written with modern Rust features (like MaybeUninit) in mind, and is already soft-replaced by the HashTable API. So, all of this functionality you want to use seems extremely reasonable to add onto the HashTable API, since it all feels reasonable to me: the goal isn't to get rid of the lower-level APIs, but rather to offer them in a better place.

Ah I see, I may have gotten the wrong impression from reading the PR comments and from discussion with Ruffle devs, I don't think I correctly understood the primary motivation for getting rid of RawTable (implementation freedom vs just being old and needing to be rewritten). I understand better now, thank you!

I think that the ability to keep track of bucket indices beyond just hashes seems reasonable, although I think that at least linking a few demonstrations of existing code you cannot rewrite under the HashTable API would be useful, so we know specifically what methods you're using and how they could best be replaced. I get the vague idea based upon what you're describing, but code examples would be the most helpful, since that way we can think of a suitable replacement API.

Sure thing, the code I'm the most familiar with is is https://github.com/kyren/piccolo/blob/master/src/table/raw.rs, specifically the RawTable::next method. The way that Lua tables work and Lua table iteration works is a bit strange especially if you're not familiar with Lua, but hopefully it should be possible to follow.

I think it would be most helpful to open a separate issue for these features (or multiple, depending on how different they are) so they can be tracked and followed better. There are a bunch of different features that people want from RawTable that haven't made their way onto HashTable yet, and it's best to separate out the discussions so we make sure nothing gets missed.

I'd be happy to open a new issue for the features that piccolo and Ruffle need, I think they're both close enough that a single issue for the both of them could cover it?

@Berrysoft
Copy link

https://github.com/Berrysoft/wepoll2 I used the raw feature to make a panic-free HashMap. The insert method of HashMap calls reserve which will panic if allocation fails. I have to reimplement a try_insert method to call try_reserve to remove the panic code path. Do you have some advice on porting to v0.15?

@clarfonthey
Copy link
Contributor

I think that falliable allocation APIs are a very reasonable addition to all the types in the crate, so, it's worth filing an issue for.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.