T-Digest Design Proposal #2542

LindaSummer · 2024-09-17T15:25:56Z

LindaSummer
Sep 17, 2024

Introduction

Redis Stack has supported a new probabilistic data structure t-dgest.

In #2423, we plan to support t-digest as well.

Basic about T-Digest

The original paper is [Computing Extremely Accurate Quantiles Using t-Digests][https://arxiv.org/abs/1902.04023] [1].

Thanks to the blog T-Digest in Python [7] and the slide [8], I have a better understanding of t-digest.

The main idea of t-digest is to divide the data into small bins and store the mean and count of each bin akka centroids.

This action compressed ranges of data into a single centroids with just mean and weight. The mean is the average of the data in the bin and the weight is the number of data compressed in the bin.

This behavior is called as sketch. Sketch is necessary when we need to deal with plenty of data.

We use these centroids with interpolation to estimate the quantile of the data.

We lost some precisions of the original data and get the ability to easily merge them and calculate the quantile.

We use the potential function k to control the distribution of the bins.

Inside the function we provide a delta to control the error of the distribution.

Implementation

In original implementation [2], the implementation contains a MergingDigest and an AVLTreeDigest.

Since we leverage from rocksdb, we should map the internal structure to rocksdb.

After reading the implementation of apache arrow [5], I think that we can follow its implementation and use rocksdb to store the internal states.

metadata

In metadata, we should store the compression ($1/\delta$), total_weight, minimum and maximum of t-digest.

        +----------+------------+-----------+-----------+----------------+----------------+------------+---------+
key =>  |  flags   |  expire    |  version  |  size     |  compression   |  total_weight  |  minimum   | maximum |
        | (1byte)  | (Ebyte)    |  (8byte)  | (Sbyte)   |  double(8byte) |  double(8byte) |  (8byte)   | (8byte) | 
        +----------+------------+-----------+-----------+----------------+----------------+------------+---------+

centroids

The main structure of a t-digest is a sorted centroids array. The redis SortedSet should be a good choice.

For each centroid, we can use the memeber key as the mean and the score as the weight.

To make an ordered double, we can use an ordered serialize way for double.

We can flip the sign bit and if original number is negative, flip all bits.

When search for centroids, use implementation like ZRANGE and ZRANGE REV to get adjacent centroids.

temp buffer

Temp buffer is a buffer for data to be merged to centroids.

It doesn't need to be ordered.

So a simple redis List is ok.

concurrency safety

Read can just use a rocksdb snapshot and no lock is needed.

Write should be done in a transaction for all related metadata and keys.

References

[1]Computing Extremely Accurate Quantiles Using t-Digests

[2]https://github.com/tdunning/t-digest

[3]https://issues.apache.org/jira/browse/ARROW-11367

[4]apache/arrow#9310

[5]https://github.com/apache/arrow/blob/main/cpp/src/arrow/util/tdigest.cc

[6]https://redis.io/docs/latest/develop/data-types/probabilistic/t-digest/

[7]T-Digest in Python

[8]https://blog.bcmeng.com/pdf/TDigest.pdf

PragmaTwice · 2024-09-17T16:27:32Z

PragmaTwice
Sep 17, 2024
Collaborator

Currently the data structures in kvrocks is NOT composable, which means you cannot just construct a Redis list and state that it's a part of your new data structure. All rocksdb keys inside one object (in redis level) should have the same user key and different sub keys.

So maybe you need to clearly describe the format of all sub keys in your data structure.

1 reply

LindaSummer Sep 18, 2024
Author

Thanks very much for your suggestion! 😊
I will research the related data structure's design in kvrocks and update with a more comprehensive proposal.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

T-Digest Design Proposal #2542

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

T-Digest Design Proposal #2542

LindaSummer Sep 17, 2024

Introduction

Basic about T-Digest

Implementation

metadata

centroids

temp buffer

concurrency safety

References

Replies: 1 comment · 1 reply

PragmaTwice Sep 17, 2024 Collaborator

LindaSummer Sep 18, 2024 Author

LindaSummer
Sep 17, 2024

Replies: 1 comment 1 reply

PragmaTwice
Sep 17, 2024
Collaborator

LindaSummer Sep 18, 2024
Author