reservoir

A fast implementation of Reservoir Sampling with Immutable Persistent data structures.

*** WORK IS STILL IN PROGRESS ***

Usage

In order to use the library add the dependency to your project.clj

;; Leiningen project
[com.brunobonacci/reservoir "0.1.0-SNAPSHOT"]

;; deps.edn format
{:deps { com.brunobonacci/reservoir {:mvn/version "0.1.0-SNAPSHOT"}}}

Current version:

Then require the namespace:

(ns foo.bar
  (:require [com.brunobonacci.reservoir :as rsv]))

Check the online documentation

;; create a reservoir
(def r (rsv/reservoir 3))

;; add some items, returns a new reservoir
(def r' (conj r :item1 :item2 :item3 :item4 :item4))

(frequencies r')
;; => {:item4 2, :item3 1}

;; to add many items you can use `into`
(def r2
  (->> (repeatedly #(rand-int 1000))
    (take 100000)
    (into (rsv/reservoir 1000))))

;; the distribution of the samples should mimic the normal
;; distribution (pseudo)
(->> (frequencies r2)
  (sort-by second >)
  (take 10))
;; => ([481 6]
;;     [186 5]
;;     [647 5]
;;     [194 5]
;;     [362 5]
;;     [887 4]
;;     [592 4]
;;     [694 4]
;;     [21 4]
;;     [708 4])

;; to see how many items passed by the reservoir
(rsv/total-items r2)
;; => 100000

;; to get the samples
(rsv/samples r2)
;; => [205 831 423 830 238  ...]

;; reservoir and just normal clojure sequences
(seq r2)
;; => (205 831 423 830 238  ...)


(def r3 (into (rsv/reservoir 100) (range 30)))

;; this is the total capacity of the reservoir
(rsv/capacity r3)
;; => 100

;; here the current number of items in the reservoir
;; it will be always between 0 and `capacity`
(count r3)
;; => 30

;; this is the total number of items which passed
;; into the reservoir. let's add more elements
(rsv/total-items (into r3 (range 120)))
;; => 150


;; you can merge two or more reservoir as following
;; the resulting reservoir will have the characteristics
;; of the first reservoir with all the sample combined
;; based on their probability to be in the final sample.
(rsv/merge-reservoirs
  (into (rsv/reservoir 5) (range 20))
  (into (rsv/reservoir 3) (range 10))
  (into (rsv/reservoir 4 :algorithm :algorithm-R) (range 20)))
;; => #com.brunobonacci/reservoir [:algorithm-L 5 50 [0 5 8 17 10]]

Performances

There are two implementation of the reservoir sampling. The Simple algorithm (also known as Algorithm-R which has a time complexity of O(n) and an optimal version called Algorithm-L which is O(k(1+log(n/k))). The default implementation is Algorithm-L

  ;; :simple :algorithm-R
  (quick-bench
    (into (reservoir 10000 :algorithm :simple) (range 1000000))

  ;; Time per million of items
  ;; --------------------------------
  ;; Execution time mean : 48.510345 ms
  ;; Execution time std-deviation : 1.994757 ms
  ;; Execution time lower quantile : 46.083439 ms ( 2.5%)
  ;; Execution time upper quantile : 51.081795 ms (97.5%)
  ;; Overhead used : 2.309092 ns


  ;; :algorithm-L  (default)
  (quick-bench
    (into (reservoir 10000 :algorithm :algorithm-L) (range 1000000)))

  ;; Time per million of items
  ;; --------------------------------
  ;; Execution time mean : 30.131117 ms
  ;; Execution time std-deviation : 1.487733 ms
  ;; Execution time lower quantile : 28.247301 ms ( 2.5%)
  ;; Execution time upper quantile : 31.998845 ms (97.5%)
  ;; Overhead used : 8.211514 ns

This reservoir implementation is up to two orders or magnitude faster (~50x) than the excellent BigML/sampling implementation.

  (require '[bigml.sampling.reservoir :as reservoir])

  (quick-bench
    (into (reservoir/create  10000 :implementation :efraimdis) (range 1000000)))

  ;; Execution time mean : 1439.723 ms
  ;; Execution time std-deviation : 56.219255 ms
  ;; Execution time lower quantile : 1.384248 sec ( 2.5%)
  ;; Execution time upper quantile : 1.513384 sec (97.5%)
  ;; Overhead used : 1.827186 ns

  (quick-bench
    (into (reservoir/create  10000 :implementation :insertion) (range 1000000)))

  ;; Execution time mean : 462.243933 ms
  ;; Execution time std-deviation : 8.206758 ms
  ;; Execution time lower quantile : 447.970058 ms ( 2.5%)
  ;; Execution time upper quantile : 469.125423 ms (97.5%)
  ;; Overhead used : 1.827186 ns

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.circleci		.circleci
dev-resources		dev-resources
dev		dev
src		src
test/com/brunobonacci		test/com/brunobonacci
.gitignore		.gitignore
.midje.clj		.midje.clj
LICENSE		LICENSE
README.md		README.md
project.clj		project.clj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

reservoir

Usage

Performances

License

About

Releases

Packages

Languages

License

BrunoBonacci/reservoir

Folders and files

Latest commit

History

Repository files navigation

reservoir

Usage

Performances

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages