-
Notifications
You must be signed in to change notification settings - Fork 228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible improvement to the speed of the algorithm #61
Comments
To the above formula for |
Yes. Head sum is the critical issue. What I would recommend is that you take a look at the MergingArray implementation. No need for trees. Very fast insertion. The algorithm is even simpler than the tree versions. The idea is that you simply accumulate new data into an array. When the array fills, you sort it and merge with the existing t-digest centroids. Because the merge transits the merged data in order, all of the head sum stuff is handled trivially. This also allows an incremental improvement suggested by otmar that we can get a very precise bound on whether the new data and the existing centroids should merge so that there is a finite bound on the size of the entire data structure. As a result, we can also avoid all dynamic allocation while the algorithm is running while preserving accuracy as before. So far, I have only been able to match the performance of the existing tree based algorithm (the AVL-tree), but I really think that we could beat it significantly. Right now, the cost is about 200ns per new data point. I think that can drop noticeably. |
Ah! So sorry, somehow I missed |
Hmm. I tried using merging condition of |
Yes, the integrated location approach does give fewer centroids. In fact, Bounds are good. On Thu, Oct 22, 2015 at 4:29 PM, Alexey Borzenkov notifications@github.com
|
Is there further action required here? Did you get a fully functional cpp version going? Did you want to host it as part of this main repo? |
No, I think As for C++ implementations I've done several and I feel like once you understand the algorithm it becomes very easy to reimplement and adapt it in a way suitable for the task at hand, while coding a generic implementation worth making it official would take too much work or make it very complex and hard to understand. There are a lot of different ways for computing quantiles and percentiles, for example, as well as choices of data types. I'm not sure I would be able to summon enough will to do the grunt work in my free time. |
OK. Btw... I have some new ideas coming that could improve accuracy and On Sun, May 29, 2016 at 5:04 PM, Alexey Borzenkov notifications@github.com
|
Hi, I've recently learned about t-digest and really like the algorithm. I coded a very naive implementation in C++ using
std::multimap
for centroids and found 3µs per measurement way too expensive (I'm trying to use it in a fast map/reduce-like setting with up to 10 million measurements per chunk, and 3µs made response unacceptable).However in my search for improving the speed I found that too much time is spent in computing the head sum after finding closest centroids, which was possible to amortise: I first collect a big enough array of measurements (e.g. 8192), which when full is turned into centroids with weight 1 and sorted with the rest of the summary. Then I go over centroids from left to right compress it by checking if it is possible to merge two centroids by computing
q = sum + (l.count + r.count) / 2
(for the possibly merged centroid) and computek
based on it. Ifl.count + r.count
is less than or equal tok
, then I merge two centroids into one, which becomes the newl
, otherwise I consider the next pair and so forth.In my very simple checks on random data so far I found this seems to give less accurate results than the original t-digest, however the accuracy was still pretty good considering the final summary size, and the best part is that this variant gives me around 70ns per measurement. I considered other variants, like computing all head sums and then trying to merge closest or smallest centroids first, but this didn't give much accuracy improvement while slowing it down.
I'm wondering what you might think about this extremely simplified version and what are fatal flaws with using it like this?
The text was updated successfully, but these errors were encountered: