-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect HashSet documentation? #175
Comments
Here is another passage from the documentation that is inaccurate for the same reason:
|
It gets worse: use im::hashset::HashSet;
use std::collections::hash_map::DefaultHasher;
use std::hash::BuildHasherDefault;
use std::cmp::Ordering;
fn main() {
let mut set1 = HashSet::with_hasher(<BuildHasherDefault<DefaultHasher>>::default());
let mut set2 = HashSet::with_hasher(<BuildHasherDefault<DefaultHasher>>::default());
for i in 0..15_001 {
set1.insert(i);
set2.insert(15_000 - i);
}
// The sets are the same according to Eq..
assert_eq!(set1, set2);
// ..but not according to Ord causing this assertion to fail.
assert_eq!(set1.cmp(&set2), Ordering::Equal);
} |
ryzhyk
added a commit
to ddlog-dev/im-rs
that referenced
this issue
Feb 20, 2021
Add ordered iterators over HAMT and `HashSet`. Given two sets with the same elements and the same hashers, the new iterators enumerate these sets in the same order. This is in contrast to existing iterators that do not guarantee consistent ordering for elements with the same hash values. Consistent ordering is achieved by sorting collision nodes on the fly during iteration. We use the new iterators to fix the implementation of `Ord` and `PartialOrd` for `HashSet` (bodil#175).
ryzhyk
added a commit
to ddlog-dev/im-rs
that referenced
this issue
Feb 21, 2021
Add ordered iterators over HAMT and `HashSet`. Given two sets with the same elements and the same hashers, the new iterators enumerate these sets in the same order. This is in contrast to existing iterators that do not guarantee consistent ordering for elements with the same hash values. Consistent ordering is achieved by sorting collision nodes on the fly during iteration. We use the new iterators to fix the implementation of `Ord`, `PartialOrd`, and `Hash` for `HashSet` (bodil#175).
ryzhyk
added a commit
to ddlog-dev/differential-datalog
that referenced
this issue
Feb 25, 2021
At the API level functional hashsets (aka immutable hashsets) behave just like regular hashsets; however their internal implementation supports cloning a hashset in time O(1), by sharing the entire internal state between the clone and the original. Modifying the clone updates only the affected state in a copy-on-write fashion, with the rest of the state still shared with the parent. Example use case (added to `test-stream.sh`): computing the set of all unique id's that appear in a stream. At every iteration, we add all newly observed ids to the set of id's computed so far. This would normally amount to cloning and modifying a potentially large set in time `O(n)`, where `n` is the size of the set. With functional sets, the cost if `O(1)`. Functional data types are generally a great match for working with immutable collections, e.g., collections stored in DDlog relations. We therefore plan to introduce more functional data types in the future, possibly even replacing the standard collections (`Set`, `Map`, `Vec`) with functional versions. Implementation: we implement the library as a DDlog wrapper around the `im` crate. Unfortunately, the crate is no longer maintained and in fact it had some correctness issues described here: bodil/im-rs#175. I forked the crate and fixed the bugs in my fork: ddlog-dev/im-rs@46f13d8. We may need to switch to a different crate in the future, e.g., `rpds`, which is less popular but seems to be better maintained. Performance considerations. While functional sets are faster to copy, they are still expensive to hash and compare (just like normal sets, but potentially even more so due to more complex internal design). My initial implementation of the unique id's use case stored aggregates in a relation. It was about as slow as the implementation using non-functinal sets, with most of the time spent in comparing sets as they were deleted from/insered into relations. The stream-based implementation is >20x faster as it does not compute deltas, and is 8x faster than equivalent implementation using regular sets.
ryzhyk
added a commit
to ddlog-dev/differential-datalog
that referenced
this issue
Feb 25, 2021
At the API level functional hashsets (aka immutable hashsets) behave just like regular hashsets; however their internal implementation supports cloning a hashset in time O(1), by sharing the entire internal state between the clone and the original. Modifying the clone updates only the affected state in a copy-on-write fashion, with the rest of the state still shared with the parent. Example use case (added to `test-stream.sh`): computing the set of all unique id's that appear in a stream. At every iteration, we add all newly observed ids to the set of id's computed so far. This would normally amount to cloning and modifying a potentially large set in time `O(n)`, where `n` is the size of the set. With functional sets, the cost if `O(1)`. Functional data types are generally a great match for working with immutable collections, e.g., collections stored in DDlog relations. We therefore plan to introduce more functional data types in the future, possibly even replacing the standard collections (`Set`, `Map`, `Vec`) with functional versions. Implementation: we implement the library as a DDlog wrapper around the `im` crate. Unfortunately, the crate is no longer maintained and in fact it had some correctness issues described here: bodil/im-rs#175. I forked the crate and fixed the bugs in my fork: ddlog-dev/im-rs@46f13d8. We may need to switch to a different crate in the future, e.g., `rpds`, which is less popular but seems to be better maintained. Performance considerations. While functional sets are faster to copy, they are still expensive to hash and compare (just like normal sets, but potentially even more so due to more complex internal design). My initial implementation of the unique id's use case stored aggregates in a relation. It was about as slow as the implementation using non-functinal sets, with most of the time spent in comparing sets as they were deleted from/insered into relations. The stream-based implementation is >20x faster as it does not compute deltas, and is 8x faster than equivalent implementation using regular sets.
ryzhyk
added a commit
to ddlog-dev/differential-datalog
that referenced
this issue
Feb 25, 2021
At the API level functional hashsets (aka immutable hashsets) behave just like regular hashsets; however their internal implementation supports cloning a hashset in time O(1), by sharing the entire internal state between the clone and the original. Modifying the clone updates only the affected state in a copy-on-write fashion, with the rest of the state still shared with the parent. Example use case (added to `test-stream.sh`): computing the set of all unique id's that appear in a stream. At every iteration, we add all newly observed ids to the set of id's computed so far. This would normally amount to cloning and modifying a potentially large set in time `O(n)`, where `n` is the size of the set. With functional sets, the cost if `O(1)`. Functional data types are generally a great match for working with immutable collections, e.g., collections stored in DDlog relations. We therefore plan to introduce more functional data types in the future, possibly even replacing the standard collections (`Set`, `Map`, `Vec`) with functional versions. Implementation: we implement the library as a DDlog wrapper around the `im` crate. Unfortunately, the crate is no longer maintained and in fact it had some correctness issues described here: bodil/im-rs#175. I forked the crate and fixed the bugs in my fork: ddlog-dev/im-rs@46f13d8. We may need to switch to a different crate in the future, e.g., `rpds`, which is less popular but seems to be better maintained. Performance considerations. While functional sets are faster to copy, they are still expensive to hash and compare (just like normal sets, but potentially even more so due to more complex internal design). My initial implementation of the unique id's use case stored aggregates in a relation. It was about as slow as the implementation using non-functinal sets, with most of the time spent in comparing sets as they were deleted from/insered into relations. The stream-based implementation is >20x faster as it does not compute deltas, and is 8x faster than equivalent implementation using regular sets.
ryzhyk
added a commit
to vmware/differential-datalog
that referenced
this issue
Feb 28, 2021
At the API level functional hashsets (aka immutable hashsets) behave just like regular hashsets; however their internal implementation supports cloning a hashset in time O(1), by sharing the entire internal state between the clone and the original. Modifying the clone updates only the affected state in a copy-on-write fashion, with the rest of the state still shared with the parent. Example use case (added to `test-stream.sh`): computing the set of all unique id's that appear in a stream. At every iteration, we add all newly observed ids to the set of id's computed so far. This would normally amount to cloning and modifying a potentially large set in time `O(n)`, where `n` is the size of the set. With functional sets, the cost if `O(1)`. Functional data types are generally a great match for working with immutable collections, e.g., collections stored in DDlog relations. We therefore plan to introduce more functional data types in the future, possibly even replacing the standard collections (`Set`, `Map`, `Vec`) with functional versions. Implementation: we implement the library as a DDlog wrapper around the `im` crate. Unfortunately, the crate is no longer maintained and in fact it had some correctness issues described here: bodil/im-rs#175. I forked the crate and fixed the bugs in my fork: ddlog-dev/im-rs@46f13d8. We may need to switch to a different crate in the future, e.g., `rpds`, which is less popular but seems to be better maintained. Performance considerations. While functional sets are faster to copy, they are still expensive to hash and compare (just like normal sets, but potentially even more so due to more complex internal design). My initial implementation of the unique id's use case stored aggregates in a relation. It was about as slow as the implementation using non-functinal sets, with most of the time spent in comparing sets as they were deleted from/insered into relations. The stream-based implementation is >20x faster as it does not compute deltas, and is 8x faster than equivalent implementation using regular sets.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Documentation for
HashSet
contains the following sentence:I don't believe this is accurate. Even when replacing
RandomState
with a deterministic hasher, the order of values during enumeration is still only predictable assuming there are no collision nodes in the HAMT. In the presence of collisions, the order depends on the order in which values were added to the collision node. Here is a simple repro that generates two sets with identical contents, but different enumeration order.The text was updated successfully, but these errors were encountered: