Skip to content

Commit

Permalink
Functional hashsets library.
Browse files Browse the repository at this point in the history
At the API level functional hashsets (aka immutable hashsets) behave
just like regular hashsets; however their internal implementation
supports cloning a hashset in time O(1), by sharing the entire internal
state between the clone and the original.  Modifying the clone updates
only the affected state in a copy-on-write fashion, with the rest of the
state still shared with the parent.

Example use case (added to `test-stream.sh`): computing the set of all
unique id's that appear in a stream.  At every iteration, we add all
newly observed ids to the set of id's computed so far.  This would
normally amount to cloning and modifying a potentially large set in time
`O(n)`, where `n` is the size of the set.  With functional sets, the
cost if `O(1)`.

Functional data types are generally a great match for working with immutable
collections, e.g., collections stored in DDlog relations.  We therefore plan
to introduce more functional data types in the future, possibly even
replacing the standard collections (`Set`, `Map`, `Vec`) with functional
versions.

Implementation: we implement the library as a DDlog wrapper around the
`im` crate.  Unfortunately, the crate is no longer maintained and in
fact it had some correctness issues described here:
bodil/im-rs#175.  I forked the crate and fixed
the bugs in my fork:
ddlog-dev/im-rs@46f13d8.

We may need to switch to a different crate in the future, e.g., `rpds`,
which is less popular but seems to be better maintained.

Performance considerations.  While functional sets are faster to copy,
they are still expensive to hash and compare (just like normal sets, but
potentially even more so due to more complex internal design).  My
initial implementation of the unique id's use case stored aggregates in
a relation.  It was about as slow as the implementation using
non-functinal sets, with most of the time spent in comparing sets as
they were deleted from/insered into relations.  The stream-based
implementation is >20x faster as it does not compute deltas, and is 8x
faster than equivalent implementation using regular sets.
  • Loading branch information
ryzhyk committed Feb 28, 2021
1 parent a80272f commit d0a1a69
Show file tree
Hide file tree
Showing 16 changed files with 6,846 additions and 5 deletions.
23 changes: 23 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,29 @@ All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).

## [Unreleased]

## Libraries

- Functional HashSets, aka immutable hashsets (`lib/hashset.dl`). At the API
level functional hashsets behave just like regular hashsets; however their
internal implementation supports cloning a in time of O(1), by sharing the
entire internal state between the clone and the original. Modifying the clone
updates only the affected state in a copy-on-write fashion, with the
rest of the state still shared with the parent.

Example use case: computing the set of all unique id's that appear in a
stream. At every iteration, we add all newly observed ids to the set of
id's computed so far. This would normally amount to cloning and modifying a
potentially large set in time `O(n)`, where `n` is the size of the set. With
functional sets, the cost if `O(1)`.

Functional data types are generally a great match for working with immutable
collections, e.g., collections stored in DDlog relations. We therefore plan
to introduce more functional data types in the future, possibly even
replacing the standard collections (`Set`, `Map`, `Vec`) with functional
versions.

## [0.37.1] - Feb 23, 2021

### Optimizations
Expand Down
200 changes: 200 additions & 0 deletions lib/hashset.dl
Original file line number Diff line number Diff line change
@@ -0,0 +1,200 @@
/* Immutable hash sets.
* This module contains bindings for the `HashSet` type
* from the `im` crate. */

#[iterate_by_ref=iter:'A]
extern type HashSet<'A>

extern function hashset_singleton(x: 'X): HashSet<'X>
extern function hashset_empty(): HashSet<'X>

function size(s: HashSet<'X>): usize {
hashset_size(s)
}

function insert(s: mut HashSet<'X>, v: 'X) {
hashset_insert(s, v)
}

function insert_imm(s: HashSet<'X>, v: 'X): HashSet<'X> {
hashset_insert_imm(s, v)
}

function remove(s: mut HashSet<'X>, v: 'X) {
hashset_remove(s, v)
}

function remove_imm(s: HashSet<'X>, v: 'X): HashSet<'X> {
hashset_remove_imm(s, v)
}

function contains(s: HashSet<'X>, v: 'X): bool {
hashset_contains(s, v)
}

function is_empty(s: HashSet<'X>): bool {
hashset_is_empty(s)
}

function nth(s: HashSet<'X>, n: usize): Option<'X> {
hashset_nth(s, n)
}

function to_vec(s: HashSet<'A>): Vec<'A> {
hashset_to_vec(s)
}

function to_hashset(v: Vec<'A>): HashSet<'A> {
var res = hashset_empty();
for (x in v) {
res.insert(x);
};
res
}

function to_hashset(g: Group<'K, 'A>): HashSet<'A> {
var res = hashset_empty();
for ((x, _) in g) {
res.insert(x);
};
res
}

function to_hashset(o: Option<'X>): HashSet<'X> {
match (o) {
Some{x} -> hashset_singleton(x),
None -> hashset_empty()
}
}

function union(s1: HashSet<'X>, s2: HashSet<'X>): HashSet<'X> {
hashset_union(s1, s2)
}

function union(sets: Vec<HashSet<'X>>): HashSet<'X> {
hashset_unions(sets)
}

function union(sets: Group<'K, HashSet<'X>>): HashSet<'X> {
group_hashset_unions(sets)
}

function intersection(s1: HashSet<'X>, s2: HashSet<'X>): HashSet<'X> {
hashset_intersection(s1, s2)
}

function difference(s1: HashSet<'X>, s2: HashSet<'X>): HashSet<'X> {
hashset_difference(s1, s2)
}

/* Applies closure `f` to each element of the set. */
function map(s: HashSet<'A>, f: function('A): 'B): HashSet<'B> {
var res = hashset_empty();
for (x in s) {
res.insert(f(x))
};
res
}

/* Returns the element that gives the minimum value from the specified function.
* If several elements are equally minimum, the first element is returned.
* If the set is empty, `None` is returned. */
function arg_min(s: HashSet<'A>, f: function('A): 'B): Option<'A> {
hashset_arg_min(s, f)
}

/* Returns the element that gives the maximum value from the specified function.
* If several elements are equally maximum, the first element is returned.
* If the set is empty, `None` is returned. */
function arg_max(s: HashSet<'A>, f: function('A): 'B): Option<'A> {
hashset_arg_max(s, f)
}

/* Returns the first element of the set that satisfies predicate `f` or
* `None` if none of the elements satisfy the predicate. */
function find(s: HashSet<'A>, f: function('A): bool): Option<'A> {
for (x in s) {
if (f(x)) {
return Some{x}
}
};
None
}

/* Returns a vector containing only those elements in `s` that satisfy predicate
* `f`. */
function filter(s: HashSet<'A>, f: function('A): bool): HashSet<'A> {
var res = hashset_empty();
for (x in s) {
if (f(x)) {
res.insert(x)
}
};
res
}

/* Both filters and maps the set.
*
* Calls the closure on each element of the set. If the closure returns
* `Some{element}`, then that element is returned. */
function filter_map(s: HashSet<'A>, f: function('A): Option<'B>): HashSet<'B> {
var res = hashset_empty();
for (x in s) {
match (f(x)) {
None -> (),
Some{y} -> res.insert(y)
}
};
res
}

/* Returnds `true` iff all elements of the set satisfy predicate `f`. */
function all(s: HashSet<'A>, f: function('A): bool): bool {
for (x in s) {
if (not f(x)) {
return false
}
};
true
}

/* Returnds `true` iff at least one element of the set satisfies predicate `f`. */
function any(s: HashSet<'A>, f: function('A): bool): bool {
for (x in s) {
if (f(x)) {
return true
}
};
false
}

/* Iterates over the set is ascending order, aggregating its contents using `f`.
*
* `f` - takes the previous value of the accumulator and the next element in the
* set and returns the new value of the accumulator.
*
* `initializer` - initial value of the accumulator. */
function fold(s: HashSet<'A>, f: function('B, 'A): 'B, initializer: 'B): 'B {
var res = initializer;
for (x in s) {
res = f(res, x)
};
res
}

extern function hashset_arg_min(s: HashSet<'A>, f: function('A): 'B): Option<'A>
extern function hashset_arg_max(s: HashSet<'A>, f: function('A): 'B): Option<'A>
extern function hashset_size(s: HashSet<'X>): usize
extern function hashset_insert(s: mut HashSet<'X>, v: 'X)
extern function hashset_remove(s: mut HashSet<'X>, v: 'X)
extern function hashset_insert_imm(s: HashSet<'X>, v: 'X): HashSet<'X>
extern function hashset_remove_imm(s: HashSet<'X>, v: 'X): HashSet<'X>
extern function hashset_contains(s: HashSet<'X>, v: 'X): bool
extern function hashset_is_empty(s: HashSet<'X>): bool
extern function hashset_nth(s: HashSet<'X>, n: usize): Option<'X>
extern function hashset_to_vec(s: HashSet<'A>): Vec<'A>
extern function hashset_union(s1: HashSet<'X>, s2: HashSet<'X>): HashSet<'X>
extern function hashset_unions(sets: Vec<HashSet<'X>>): HashSet<'X>
extern function group_hashset_unions(sets: Group<'K, HashSet<'X>>): HashSet<'X>
extern function hashset_intersection(s1: HashSet<'X>, s2: HashSet<'X>): HashSet<'X>
extern function hashset_difference(s1: HashSet<'X>, s2: HashSet<'X>): HashSet<'X>
42 changes: 42 additions & 0 deletions lib/hashset.flatbuf.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
impl<'a, T, F> FromFlatBuffer<fbrt::Vector<'a, F>> for typedefs::hashset::HashSet<T>
where
T: Hash + Eq + Clone + FromFlatBuffer<F::Inner>,
F: fbrt::Follow<'a> + 'a,
{
fn from_flatbuf(fb: fbrt::Vector<'a, F>) -> ::std::result::Result<Self, String> {
let mut set = typedefs::hashset::HashSet::new();
for x in FBIter::from_vector(fb) {
set.insert(T::from_flatbuf(x)?);
}
Ok(set)
}
}

// For scalar types, the FlatBuffers API returns slice instead of 'Vector'.
impl<'a, T> FromFlatBuffer<&'a [T]> for typedefs::hashset::HashSet<T>
where
T: Hash + Eq + Clone,
{
fn from_flatbuf(fb: &'a [T]) -> ::std::result::Result<Self, String> {
let mut set = typedefs::hashset::HashSet::new();
for x in fb.iter() {
set.insert(x.clone());
}
Ok(set)
}
}

impl<'b, T> ToFlatBuffer<'b> for typedefs::hashset::HashSet<T>
where
T: Hash + Eq + Clone + Ord + ToFlatBufferVectorElement<'b>,
{
type Target = fbrt::WIPOffset<fbrt::Vector<'b, <T::Target as fbrt::Push>::Output>>;

fn to_flatbuf(&self, fbb: &mut fbrt::FlatBufferBuilder<'b>) -> Self::Target {
let vec: ::std::vec::Vec<T::Target> = self
.iter()
.map(|x| x.to_flatbuf_vector_element(fbb))
.collect();
fbb.create_vector(vec.as_slice())
}
}
Loading

0 comments on commit d0a1a69

Please sign in to comment.