A fast implementation of Reservoir Sampling with Immutable Persistent data structures.
*** WORK IS STILL IN PROGRESS ***
In order to use the library add the dependency to your project.clj
;; Leiningen project
[com.brunobonacci/reservoir "0.1.0-SNAPSHOT"]
;; deps.edn format
{:deps { com.brunobonacci/reservoir {:mvn/version "0.1.0-SNAPSHOT"}}}
Then require the namespace:
(ns foo.bar
(:require [com.brunobonacci.reservoir :as rsv]))
Check the online documentation
;; create a reservoir
(def r (rsv/reservoir 3))
;; add some items, returns a new reservoir
(def r' (conj r :item1 :item2 :item3 :item4 :item4))
(frequencies r')
;; => {:item4 2, :item3 1}
;; to add many items you can use `into`
(def r2
(->> (repeatedly #(rand-int 1000))
(take 100000)
(into (rsv/reservoir 1000))))
;; the distribution of the samples should mimic the normal
;; distribution (pseudo)
(->> (frequencies r2)
(sort-by second >)
(take 10))
;; => ([481 6]
;; [186 5]
;; [647 5]
;; [194 5]
;; [362 5]
;; [887 4]
;; [592 4]
;; [694 4]
;; [21 4]
;; [708 4])
;; to see how many items passed by the reservoir
(rsv/total-items r2)
;; => 100000
;; to get the samples
(rsv/samples r2)
;; => [205 831 423 830 238 ...]
;; reservoir and just normal clojure sequences
(seq r2)
;; => (205 831 423 830 238 ...)
(def r3 (into (rsv/reservoir 100) (range 30)))
;; this is the total capacity of the reservoir
(rsv/capacity r3)
;; => 100
;; here the current number of items in the reservoir
;; it will be always between 0 and `capacity`
(count r3)
;; => 30
;; this is the total number of items which passed
;; into the reservoir. let's add more elements
(rsv/total-items (into r3 (range 120)))
;; => 150
;; you can merge two or more reservoir as following
;; the resulting reservoir will have the characteristics
;; of the first reservoir with all the sample combined
;; based on their probability to be in the final sample.
(rsv/merge-reservoirs
(into (rsv/reservoir 5) (range 20))
(into (rsv/reservoir 3) (range 10))
(into (rsv/reservoir 4 :algorithm :algorithm-R) (range 20)))
;; => #com.brunobonacci/reservoir [:algorithm-L 5 50 [0 5 8 17 10]]
There are two implementation of the reservoir sampling.
The Simple algorithm (also known as Algorithm-R
which has a time complexity of O(n)
and an optimal version called
Algorithm-L
which is O(k(1+log(n/k)))
. The default implementation is Algorithm-L
;; :simple :algorithm-R
(quick-bench
(into (reservoir 10000 :algorithm :simple) (range 1000000))
;; Time per million of items
;; --------------------------------
;; Execution time mean : 48.510345 ms
;; Execution time std-deviation : 1.994757 ms
;; Execution time lower quantile : 46.083439 ms ( 2.5%)
;; Execution time upper quantile : 51.081795 ms (97.5%)
;; Overhead used : 2.309092 ns
;; :algorithm-L (default)
(quick-bench
(into (reservoir 10000 :algorithm :algorithm-L) (range 1000000)))
;; Time per million of items
;; --------------------------------
;; Execution time mean : 30.131117 ms
;; Execution time std-deviation : 1.487733 ms
;; Execution time lower quantile : 28.247301 ms ( 2.5%)
;; Execution time upper quantile : 31.998845 ms (97.5%)
;; Overhead used : 8.211514 ns
This reservoir implementation is up to two orders or magnitude faster (~50x) than the excellent BigML/sampling implementation.
(require '[bigml.sampling.reservoir :as reservoir])
(quick-bench
(into (reservoir/create 10000 :implementation :efraimdis) (range 1000000)))
;; Execution time mean : 1439.723 ms
;; Execution time std-deviation : 56.219255 ms
;; Execution time lower quantile : 1.384248 sec ( 2.5%)
;; Execution time upper quantile : 1.513384 sec (97.5%)
;; Overhead used : 1.827186 ns
(quick-bench
(into (reservoir/create 10000 :implementation :insertion) (range 1000000)))
;; Execution time mean : 462.243933 ms
;; Execution time std-deviation : 8.206758 ms
;; Execution time lower quantile : 447.970058 ms ( 2.5%)
;; Execution time upper quantile : 469.125423 ms (97.5%)
;; Overhead used : 1.827186 ns
Copyright © 2020 Bruno Bonacci - Distributed under the Apache License v2.0