T-Digest Design Proposal #2542
LindaSummer
started this conversation in
General
Replies: 1 comment 1 reply
-
Currently the data structures in kvrocks is NOT composable, which means you cannot just construct a Redis list and state that it's a part of your new data structure. All rocksdb keys inside one object (in redis level) should have the same user key and different sub keys. So maybe you need to clearly describe the format of all sub keys in your data structure. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Introduction
Redis Stack has supported a new probabilistic data structure t-dgest.
In #2423, we plan to support t-digest as well.
Basic about T-Digest
The original paper is [Computing Extremely Accurate Quantiles Using t-Digests][https://arxiv.org/abs/1902.04023] [1].
Thanks to the blog T-Digest in Python [7] and the slide [8], I have a better understanding of t-digest.
The main idea of t-digest is to divide the data into small bins and store the mean and count of each bin akka centroids.
This action compressed ranges of data into a single centroids with just mean and weight. The mean is the average of the data in the bin and the weight is the number of data compressed in the bin.
This behavior is called as sketch. Sketch is necessary when we need to deal with plenty of data.
We use these centroids with interpolation to estimate the quantile of the data.
We lost some precisions of the original data and get the ability to easily merge them and calculate the quantile.
We use the potential function
k
to control the distribution of the bins.Inside the function we provide a
delta
to control the error of the distribution.Implementation
In original implementation [2], the implementation contains a
MergingDigest
and anAVLTreeDigest
.Since we leverage from rocksdb, we should map the internal structure to rocksdb.
After reading the implementation of apache arrow [5], I think that we can follow its implementation and use rocksdb to store the internal states.
metadata
In metadata, we should store the compression ($1/\delta$ ), total_weight, minimum and maximum of t-digest.
centroids
The main structure of a t-digest is a sorted centroids array. The redis
SortedSet
should be a good choice.For each centroid, we can use the memeber key as the mean and the score as the weight.
To make an ordered double, we can use an ordered serialize way for double.
We can flip the sign bit and if original number is negative, flip all bits.
When search for centroids, use implementation like
ZRANGE
andZRANGE REV
to get adjacent centroids.temp buffer
Temp buffer is a buffer for data to be merged to centroids.
It doesn't need to be ordered.
So a simple redis
List
is ok.concurrency safety
Read can just use a rocksdb snapshot and no lock is needed.
Write should be done in a transaction for all related metadata and keys.
References
[1]Computing Extremely Accurate Quantiles Using t-Digests
[2]https://github.com/tdunning/t-digest
[3]https://issues.apache.org/jira/browse/ARROW-11367
[4]apache/arrow#9310
[5]https://github.com/apache/arrow/blob/main/cpp/src/arrow/util/tdigest.cc
[6]https://redis.io/docs/latest/develop/data-types/probabilistic/t-digest/
[7]T-Digest in Python
[8]https://blog.bcmeng.com/pdf/TDigest.pdf
Beta Was this translation helpful? Give feedback.
All reactions