Functional hashsets library.

At the API level functional hashsets (aka immutable hashsets) behave just like regular hashsets; however their internal implementation supports cloning a hashset in time O(1), by sharing the entire internal state between the clone and the original. Modifying the clone updates only the affected state in a copy-on-write fashion, with the rest of the state still shared with the parent. Example use case (added to `test-stream.sh`): computing the set of all unique id's that appear in a stream. At every iteration, we add all newly observed ids to the set of id's computed so far. This would normally amount to cloning and modifying a potentially large set in time `O(n)`, where `n` is the size of the set. With functional sets, the cost if `O(1)`. Functional data types are generally a great match for working with immutable collections, e.g., collections stored in DDlog relations. We therefore plan to introduce more functional data types in the future, possibly even replacing the standard collections (`Set`, `Map`, `Vec`) with functional versions. Implementation: we implement the library as a DDlog wrapper around the `im` crate. Unfortunately, the crate is no longer maintained and in fact it had some correctness issues described here: bodil/im-rs#175. I forked the crate and fixed the bugs in my fork: ddlog-dev/im-rs@46f13d8. We may need to switch to a different crate in the future, e.g., `rpds`, which is less popular but seems to be better maintained. Performance considerations. While functional sets are faster to copy, they are still expensive to hash and compare (just like normal sets, but potentially even more so due to more complex internal design). My initial implementation of the unique id's use case stored aggregates in a relation. It was about as slow as the implementation using non-functinal sets, with most of the time spent in comparing sets as they were deleted from/insered into relations. The stream-based implementation is >20x faster as it does not compute deltas, and is 8x faster than equivalent implementation using regular sets.
vmware · Feb 28, 2021 · d0a1a69 · d0a1a69
1 parent a80272f
commit d0a1a69
Show file tree

Hide file tree

Showing 16 changed files with 6,846 additions and 5 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -3,6 +3,29 @@ All notable changes to this project will be documented in this file.
 
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
 
+## [Unreleased]
+
+## Libraries
+
+- Functional HashSets, aka immutable hashsets (`lib/hashset.dl`).  At the API
+  level functional hashsets behave just like regular hashsets; however their
+  internal implementation supports cloning a in time of O(1), by sharing the 
+  entire internal state between the clone and the original.  Modifying the clone
+  updates only the affected state in a copy-on-write fashion, with the
+  rest of the state still shared with the parent.
+
+  Example use case: computing the set of all unique id's that appear in a
+  stream.  At every iteration, we add all newly observed ids to the set of
+  id's computed so far.  This would normally amount to cloning and modifying a
+  potentially large set in time `O(n)`, where `n` is the size of the set.  With
+  functional sets, the cost if `O(1)`.
+
+  Functional data types are generally a great match for working with immutable
+  collections, e.g., collections stored in DDlog relations.  We therefore plan
+  to introduce more functional data types in the future, possibly even
+  replacing the standard collections (`Set`, `Map`, `Vec`) with functional
+  versions.
+
 ## [0.37.1] - Feb 23, 2021
 
 ### Optimizations

diff --git a/lib/hashset.dl b/lib/hashset.dl
@@ -0,0 +1,200 @@
+/* Immutable hash sets.
+ * This module contains bindings for the `HashSet` type
+ * from the `im` crate. */
+
+#[iterate_by_ref=iter:'A]
+extern type HashSet<'A>
+
+extern function hashset_singleton(x: 'X): HashSet<'X>
+extern function hashset_empty(): HashSet<'X>
+
+function size(s: HashSet<'X>): usize {
+    hashset_size(s)
+}
+
+function insert(s: mut HashSet<'X>, v: 'X) {
+    hashset_insert(s, v)
+}
+
+function insert_imm(s: HashSet<'X>, v: 'X): HashSet<'X> {
+    hashset_insert_imm(s, v)
+}
+
+function remove(s: mut HashSet<'X>, v: 'X) {
+    hashset_remove(s, v)
+}
+
+function remove_imm(s: HashSet<'X>, v: 'X): HashSet<'X> {
+    hashset_remove_imm(s, v)
+}
+
+function contains(s: HashSet<'X>, v: 'X): bool {
+    hashset_contains(s, v)
+}
+
+function is_empty(s: HashSet<'X>): bool {
+    hashset_is_empty(s)
+}
+
+function nth(s: HashSet<'X>, n: usize): Option<'X> {
+    hashset_nth(s, n)
+}
+
+function to_vec(s: HashSet<'A>): Vec<'A> {
+    hashset_to_vec(s)
+}
+
+function to_hashset(v: Vec<'A>): HashSet<'A> {
+    var res = hashset_empty();
+    for (x in v) {
+        res.insert(x);
+    };
+    res
+}
+
+function to_hashset(g: Group<'K, 'A>): HashSet<'A> {
+    var res = hashset_empty();
+    for ((x, _) in g) {
+        res.insert(x);
+    };
+    res
+}
+
+function to_hashset(o: Option<'X>): HashSet<'X> {
+    match (o) {
+        Some{x} -> hashset_singleton(x),
+        None -> hashset_empty()
+    }
+}
+
+function union(s1: HashSet<'X>, s2: HashSet<'X>): HashSet<'X> {
+    hashset_union(s1, s2)
+}
+
+function union(sets: Vec<HashSet<'X>>): HashSet<'X> {
+    hashset_unions(sets)
+}
+
+function union(sets: Group<'K, HashSet<'X>>): HashSet<'X> {
+    group_hashset_unions(sets)
+}
+
+function intersection(s1: HashSet<'X>, s2: HashSet<'X>): HashSet<'X> {
+    hashset_intersection(s1, s2)
+}
+
+function difference(s1: HashSet<'X>, s2: HashSet<'X>): HashSet<'X> {
+    hashset_difference(s1, s2)
+}
+
+/* Applies closure `f` to each element of the set. */
+function map(s: HashSet<'A>, f: function('A): 'B): HashSet<'B> {
+    var res = hashset_empty();
+    for (x in s) {
+        res.insert(f(x))
+    };
+    res
+}
+
+/* Returns the element that gives the minimum value from the specified function.
+ * If several elements are equally minimum, the first element is returned.
+ * If the set is empty, `None` is returned. */
+function arg_min(s: HashSet<'A>, f: function('A): 'B): Option<'A> {
+    hashset_arg_min(s, f)
+}
+
+/* Returns the element that gives the maximum value from the specified function.
+ * If several elements are equally maximum, the first element is returned.
+ * If the set is empty, `None` is returned. */
+function arg_max(s: HashSet<'A>, f: function('A): 'B): Option<'A> {
+    hashset_arg_max(s, f)
+}
+
+/* Returns the first element of the set that satisfies predicate `f` or
+ * `None` if none of the elements satisfy the predicate. */
+function find(s: HashSet<'A>, f: function('A): bool): Option<'A> {
+    for (x in s) {
+        if (f(x)) {
+            return Some{x}
+        }
+    };
+    None
+}
+
+/* Returns a vector containing only those elements in `s` that satisfy predicate
+ * `f`. */
+function filter(s: HashSet<'A>, f: function('A): bool): HashSet<'A> {
+    var res = hashset_empty();
+    for (x in s) {
+        if (f(x)) {
+            res.insert(x)
+        }
+    };
+    res
+}
+
+/* Both filters and maps the set.
+ *
+ * Calls the closure on each element of the set.  If the closure returns
+ * `Some{element}`, then that element is returned. */
+function filter_map(s: HashSet<'A>, f: function('A): Option<'B>): HashSet<'B> {
+    var res = hashset_empty();
+    for (x in s) {
+        match (f(x)) {
+            None -> (),
+            Some{y} -> res.insert(y)
+        }
+    };
+    res
+}
+
+/* Returnds `true` iff all elements of the set satisfy predicate `f`. */
+function all(s: HashSet<'A>, f: function('A): bool): bool {
+    for (x in s) {
+        if (not f(x)) {
+            return false
+        }
+    };
+    true
+}
+
+/* Returnds `true` iff at least one element of the set satisfies predicate `f`. */
+function any(s: HashSet<'A>, f: function('A): bool): bool {
+    for (x in s) {
+        if (f(x)) {
+            return true
+        }
+    };
+    false
+}
+
+/* Iterates over the set is ascending order, aggregating its contents using `f`.
+ *
+ * `f` - takes the previous value of the accumulator and the next element in the
+ * set and returns the new value of the accumulator.
+ *
+ * `initializer` - initial value of the accumulator. */
+function fold(s: HashSet<'A>, f: function('B, 'A): 'B, initializer: 'B): 'B {
+    var res = initializer;
+    for (x in s) {
+        res = f(res, x)
+    };
+    res
+}
+
+extern function hashset_arg_min(s: HashSet<'A>, f: function('A): 'B): Option<'A>
+extern function hashset_arg_max(s: HashSet<'A>, f: function('A): 'B): Option<'A>
+extern function hashset_size(s: HashSet<'X>): usize
+extern function hashset_insert(s: mut HashSet<'X>, v: 'X)
+extern function hashset_remove(s: mut HashSet<'X>, v: 'X)
+extern function hashset_insert_imm(s: HashSet<'X>, v: 'X): HashSet<'X>
+extern function hashset_remove_imm(s: HashSet<'X>, v: 'X): HashSet<'X>
+extern function hashset_contains(s: HashSet<'X>, v: 'X): bool
+extern function hashset_is_empty(s: HashSet<'X>): bool
+extern function hashset_nth(s: HashSet<'X>, n: usize): Option<'X>
+extern function hashset_to_vec(s: HashSet<'A>): Vec<'A>
+extern function hashset_union(s1: HashSet<'X>, s2: HashSet<'X>): HashSet<'X>
+extern function hashset_unions(sets: Vec<HashSet<'X>>): HashSet<'X>
+extern function group_hashset_unions(sets: Group<'K, HashSet<'X>>): HashSet<'X>
+extern function hashset_intersection(s1: HashSet<'X>, s2: HashSet<'X>): HashSet<'X>
+extern function hashset_difference(s1: HashSet<'X>, s2: HashSet<'X>): HashSet<'X>
diff --git a/lib/hashset.flatbuf.rs b/lib/hashset.flatbuf.rs
@@ -0,0 +1,42 @@
+impl<'a, T, F> FromFlatBuffer<fbrt::Vector<'a, F>> for typedefs::hashset::HashSet<T>
+where
+    T: Hash + Eq + Clone + FromFlatBuffer<F::Inner>,
+    F: fbrt::Follow<'a> + 'a,
+{
+    fn from_flatbuf(fb: fbrt::Vector<'a, F>) -> ::std::result::Result<Self, String> {
+        let mut set = typedefs::hashset::HashSet::new();
+        for x in FBIter::from_vector(fb) {
+            set.insert(T::from_flatbuf(x)?);
+        }
+        Ok(set)
+    }
+}
+
+// For scalar types, the FlatBuffers API returns slice instead of 'Vector'.
+impl<'a, T> FromFlatBuffer<&'a [T]> for typedefs::hashset::HashSet<T>
+where
+    T: Hash + Eq + Clone,
+{
+    fn from_flatbuf(fb: &'a [T]) -> ::std::result::Result<Self, String> {
+        let mut set = typedefs::hashset::HashSet::new();
+        for x in fb.iter() {
+            set.insert(x.clone());
+        }
+        Ok(set)
+    }
+}
+
+impl<'b, T> ToFlatBuffer<'b> for typedefs::hashset::HashSet<T>
+where
+    T: Hash + Eq + Clone + Ord + ToFlatBufferVectorElement<'b>,
+{
+    type Target = fbrt::WIPOffset<fbrt::Vector<'b, <T::Target as fbrt::Push>::Output>>;
+
+    fn to_flatbuf(&self, fbb: &mut fbrt::FlatBufferBuilder<'b>) -> Self::Target {
+        let vec: ::std::vec::Vec<T::Target> = self
+            .iter()
+            .map(|x| x.to_flatbuf_vector_element(fbb))
+            .collect();
+        fbb.create_vector(vec.as_slice())
+    }
+}