Skip to content

moritzmucha/fastdigest

Repository files navigation

fastDigest

PyPI Python Build License

fastDigest is a Rust-powered Python extension module that provides a lightning-fast implementation of the t-digest data structure and algorithm, offering a lightweight suite of online statistics for streaming and distributed data.

Contents

Features

  • Online statistics: Compute highly accurate estimates of quantiles, the CDF, and derived quantities such as the (trimmed) mean.
  • Updating: Update a t-digest incrementally with streaming data or batches of large datasets.
  • Merging: Merge many t-digests into one, enabling parallel compute operations such as map-reduce.
  • Serialization: Use the to_dict/from_dict methods or the pickle module for serialization.
  • Easy API: The fastDigest API is designed to be intuitive and to keep high overlap with popular libraries.
  • Blazing fast: Thanks to its Rust backbone, this module is up to hundreds of times faster than other Python implementations.

Installation

Installing from PyPI

Compiled wheels are available on PyPI. Simply install via pip:

pip install fastdigest

Installing from source

To build and install fastDigest from source, you will need Rust and maturin.

  1. Install the Rust toolchain → see https://rustup.rs

  2. Install maturin via pip:

pip install maturin
  1. git clone or download and extract this repository, open a terminal in its root directory, then build and install the package:
maturin build --release
pip install target/wheels/fastdigest-0.9.2-<platform-tag>.whl

Usage

The following examples are intended to give you a quick start. See the API reference for the full documentation.

Initialization

Simply call TDigest() to create a new instance, or use TDigest.from_values to directly create a digest of any sequence of numbers:

from fastdigest import TDigest

digest = TDigest()
digest = TDigest.from_values([2.71, 3.14, 1.42])

Mathematical functions

Estimate the value at the rank q using quantile(q):

digest = TDigest.from_values(range(1001))
print("99th percentile:", digest.quantile(0.99))

Or the inverse - use cdf to find the rank (cumulative probability) of a given value:

print("cdf(990) =", digest.cdf(990))

Compute the arithmetic mean, or the trimmed_mean between two quantiles:

data = list(range(11))  # numbers 1-10
data[-1] = 100_000  # extreme outlier
digest = TDigest.from_values(data)
print(f"        Mean: {digest.mean()}")
print(f"Trimmed mean: {digest.trimmed_mean(0.1, 0.9)}")

Updating a TDigest

Use batch_update to merge a sequence of many values at once, or update to add one value at a time:

digest = TDigest()
digest.batch_update([0, 1, 2])
digest.update(3)

Note: These methods are not the same - they are optimized for different use-cases, and there can be significant performance differences.

Merging TDigest objects

Use the + operator to create a new instance from two TDigests, or += to merge in-place:

digest1 = TDigest.from_values(range(20))
digest2 = TDigest.from_values(range(20, 51))
digest3 = TDigest.from_values(range(51, 101))

digest1 += digest2
merged_new = digest1 + digest3

The merge_all function offers an easy way to merge an iterable of many TDigests:

from fastdigest import TDigest, merge_all

digests = [TDigest.from_values(range(i, i+10)) for i in range(0, 100, 10)]
merged = merge_all(digests)

Dict conversion

Obtain a dictionary representation by calling to_dict() and load it into a new instance with TDigest.from_dict:

from fastdigest import TDigest
import json

digest = TDigest.from_values(range(101))
td_dict = digest.to_dict()
print(json.dumps(td_dict, indent=2))
restored = TDigest.from_dict(td_dict)

Migration

The fastDigest API is designed to be backward compatible with the tdigest Python library. Migrating is as simple as changing your import statement.

Dicts created by tdigest can also natively be used by fastDigest.

Benchmarks

  • Task: Construct a digest of 1,000,000 uniformly distributed random values and estimate their median (average of 10 consecutive runs).
  • Test environment: Python 3.12.12, MacBook Pro (M4 Pro), macOS 15.7.2 Sequoia
Library Time (ms) Relative speed
tdigest 9,773 1x
pytdigest 54 180x
fastdigest 20 480x

If you want to try it yourself, install fastDigest (and optionally tdigest and/or pytdigest) and run:

python benchmark.py

License

fastDigest is licensed under the MIT License. See the LICENSE file for details.

Acknowledgements

Credit goes to Ted Dunning for inventing the t-digest. Special thanks to Andy Lok and Paul Meng for creating the tdigests and tdigest Rust libraries, respectively, as well as to all PyO3 contributors.

About

A fast t-digest library for Python built on Rust.

Resources

License

Stars

Watchers

Forks

Packages

No packages published