Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add lock-free histogram synchronization for atomic reads #130

Closed
NikodemGapski opened this issue Dec 22, 2024 · 5 comments
Closed

Add lock-free histogram synchronization for atomic reads #130

NikodemGapski opened this issue Dec 22, 2024 · 5 comments
Labels
wontfix This will not be worked on

Comments

@NikodemGapski
Copy link

histogram: Add lock-free synchronization for atomic reads

Issue

As of now, the implementation of the atomic histogram has no support for atomically taking a snapshot of the histogram's state. That is, there is no way to guarantee there existed a moment in time when the histogram's state was exactly the same as the taken snapshot.

The provided drain and load methods might overlap with increment operations, skewing the data towards higher bucket indices (as they might have more time for new increments than the lower ones).

I tried to adapt this crate to a lock-free mechanism for writes that also would offer atomic reads. The code can be found here. That being said, I think such a mechanism could prove useful for the users of this crate, so I offer a contribution to this repository with said implementation. I believe its guarantees and read/write speed to be quite universally optimal and an ideal go-to solution for atomic histogram users.

The solution I linked (from the Scylla Rust driver fork repo) is still a little rough around the edges and lacks tests. I am going to implement those very soon, though, so, please, do not get discouraged by their current state.

Algorithm

The idea is taken from the Prometheus library in Go (talk, repo).

Overview

We divide the histogram users into two categories: observations (incrementing a bucket) and snapshots (atomically reading the histogram state). We assume the number of observations far exceeds the number of snapshots, so that is what we optimize for.

Guarantees:

  • an observation might occur at any moment without delay,
  • observations are fast,
  • only one snapshot can be happening at a time (others wait on a mutex).

With such guarantees we can design an algorithm around two bucket arrays (effectively two histograms), a hot and a cold one (which one is which changes dynamically). Their added values will reflect the current state of the histogram, the cold values will reflect a frozen state from some specific moment of a snapshot, and all observations will start on the hot array.

Synchronization

We need two atomic counters (one for each array) counting the number of finished observations added there and an additional atomic flag-counter containing the total number of started observations along with a flag (index of the hot array).

That way, an observation can atomically read the hot index and update the number of started observations. Later (after it is finished) it can update the number of finished observations on the array it was operating on.

On the other hand, a snapshot can atomically read the number of started observations and change the hot index. Then it can wait for all started observations to finish (as they are still operating on the now-cold array) on a spin-lock (observations are fast, so it is not a problem). Afterwards, the snapshot can be sure it has exclusive access to a frozen state of the histogram reflecting all observations that started until the moment of changing the hot index flag.

After a finished snapshot, it suffices to drain the cold array and counter into the hot ones (that way the hot array will keep the full state of the histogram until the next snapshot) and release the snapshot's mutex.

Concurrency guarantees

Thanks to the snapshot mutex and changing the hot flag during snapshots we get a guarantee that the flag will not be changed when accessing the histogram data* and a guarantee of information flow liveness (consecutive stacked snapshots will not freeze the accessed data entirely).

*The flag may be modified then, but only when the number of observations exceeds 2^63 - 1. That number is very close to the functional limit of the histogram anyway, though.

The question

The main question is, would you even be interested in such a solution for this crate? And if so, should it be a separate module (as in the provided implementation), or get integrated into the AtomicHistogram class?

@NikodemGapski
Copy link
Author

Update: I already implemented some concurrency-related tests (along with a few bug-fixes).
The code can be found here.

@brayniac
Copy link
Contributor

brayniac commented Jan 6, 2025

Hi @NikodemGapski -

Some thoughts on this. First, as a question: have you seen that the AtomicHistogram, which does not provide an atomic snapshot, causes measurable skew or is this more of a theoretical concern?

Warning that this is all back-of-the-envelope calculation and my numbers are based on single-threaded operations in isolation, but in considering the existing implementation:

On my Apple M1 Max macbook, the load() takes 2.4µS and each increment takes 2.1nS. Assuming secondly snapshots, and an increment rate of 500M/s that's less than 1500 increments, which would be 0.0003% of the samples in that second. Even for a worst case scenario (increments all to the max bucket index), I'm having a hard time imagining this creating some meaningful skew. This is for a 7, 64 histogram - which is enough to represent any 64bit (u64) with a relative error of ~0.781%.

I'm guessing from the code, that you care more about a 12, 18 config. That's closer to 9.1µS to snapshot on the same machine, which is enough to shift a 99.999th percentile, but I'm not sure that would really shift a 99.99th for example. We're talking about something on the order of 0.001% of the samples getting incremented during the snapshot. I think even in the worst case, this isn't going to shift the 99.99th percentile on a secondly basis? It seems the relative error (for this config it's 0.0244%) is so much higher than this possible skew.

It's possible you care about finer-grained timescales in which case, that could creep in and effect the tail if there were a run of increments for only the highest bucket index during the snapshot process. It's also possible you've done some more formal analysis and/or have evidence that this is a real-world problem.

Side note: I'm not sure the math here is correct. The stated range (limited by the 2^18 histogram power) is 0..262144. With a grouping power of 12, that's 28672 buckets - or about 230KB per histogram. I'm not quite sure why the calculated size there is so much higher.

I think overall the algorithm makes sense, if we move to include this, I think it should be its own implementation within the crate. There's now 3 atomic increments for every increment instead of just 1 and two of those increments are on atomics that other threads will be incrementing. I'm curious if you have done any benchmarks around this implementation. Particularly when two or more threads are incrementing at the same time (with each thread pinned to different physical cores). Ideally we would also have some idea of the performance in NUMA situations too.

I think if we include this, we'd need to be able to articulate in documentation when this implementation should be used over the existing AtomicHistogram. I'm curious in which situations you see there to be a meaningful impact in the correctness.

@NikodemGapski
Copy link
Author

Here's a bunch of thoughts regarding the points you mentioned.

The concern around AtomicHistogram guarantees is, as of now, theoretical. I haven't (yet) created any specific benchmarks to show whether it's significant in practice, mostly due to my hesitation to treat my personal computer as a reference point.

That being said, while I'm working on designing these benchmarks, I can think of some scenarios where the relative error due to .load() skew increases to a significant degree.

Skew error scenarios

First, as you stated, finer-grained timescales increase the relative error proportionately.

In the example you provided, the core assumption was an even density of increments across the time period. However, in a case where increments happen in bursts with time period between bursts being close to, or exceeding, the snapshot period, the relative error of some measurements skyrockets.

Taking:

  • the numbers from your measurements (1 snapshot ~ 1500 increments),
  • secondly snapshots,
  • on average secondly bursts,
  • n * 1500 burst length,
    we can calculate the probability of a snapshot occurring which has a relative error of at least 1/n:
inc_time = 2; // ~2ns
second_inc = 1e9 / inc_time; // ~5e8 increments fit in a second
snapshot_inc = 1500;
target_time_period = (n - 1) * snapshot_inc; // #{moments the burst can start}

target_prob = target_time_period / second_inc; // ~ (n-1) / 3e5

So, e.g. the resulting probability for a relative error of at least 6.25% (n = 16) will be 15/3e5 = 5e-5,
which is, on average, one such faulty snapshot per 5.5h - for a long-running server it might be
relatively often.

Obviously, when a combination of the above conditions is met, the relative error/probability of it occurring
increases even more, which for some systems may become unacceptable.

Side note: the 12, 18 config was just a "rule of thumb" kind of choice in my commit, so I wouldn't cling onto it quite much. It does, however, highlight another condition which may increase the relative skew error for some scenarios (a choice of a higher grouping power).

Also, it's important to note that while it takes some effort to think of scenarios where the skew error would be noticeable for common percentile values, the Histogram also offers .into_iter() API. Because of this, it might be used bucket-wise in some cases and then the skew-related errors become more apparent.

Ordering guarantees

There is one more thing, which isn't necessarily important for the Scylla driver itself, that might make LockFreeHistogram a desirable choice over AtomicHistogram for some.

Note: I think there is actually a bug related to this issue in my implementation right now. It has to do with currently-used weak ordering for atomic counters and buckets (the Go language, where this algorithm was inspired from, has stronger ordering guarantees than that as I see now).

Due to AtomicHistogram using weak ordering for bucket increments (understandably, for performance reasons I believe), there is no order guarantee on increments to different buckets, even on one thread. To stress it even more, there is very little possibility of formal (OS/hardware-independent) reasoning about the histogram's state at any point in time. I can imagine use cases which depend on a need for such reasoning (e.g. in high-risk scenarios when a concurrent program is required to be provable). Probably such use cases don't specifically include metrics, which, I guess, tend to have some general error tolerance, but I believe the users of this crate are a superset of those using it only for metrics.

LockFreeHistogram, on the other hand, by running all increment/snapshot traffic through a read-write sequence on one atomic variable (hot_idx_and_count) serialises these accesses in a specific order (respecting the thread-wise instruction sequencing), the same one for all observers.

Lock-free histogram benchmarks

When it comes to benchmarking the lock-free histogram implementation, I can, again, make some guesses (I don't have access to any server, so my benchmarking capabilities for multiple NUMA nodes are limited; for other scenarios I'm going to write some benchmarks soon).

The main difference in performance should stem from the serialisation of all operations, which requires cache synchronisation across threads. Specifically, as AtomicHistogram's total increment duration was bounded below by (roughly) AVG_SYNC_TIME * MAX{bucket[i].count()} (a synchronisation was needed only bucket-wise), for LockFreeHistogram it is instead bounded below by AVG_SYNC_TIME * SUM{bucket[i].count()} (even though there are 3 atomic counters in each increment operation which need to be serialised, I believe the total duration doesn't depend on it, as accesses to these counters are somewhat "pipelined" by consecutive increments*), which might be significantly longer than the former, depending on the bucket distribution.

  • I'm not yet sure if that's correct (it should be true with the aforementioned bug present, but in the end may depend on the ordering I'll choose to fix it with).

In particular, given that (if I recall correctly) atomic increment requires synchronisation in

  • L2 cache for threads on one NUMA node,
  • L3 cache for threads on different NUMA nodes,
    I would guess the sync time bounds to be around 20ns for L2 and 240ns for L3. That is assuming 3-wise accesses for:
  • foreign core-owned cache line miss,
  • update of the current core's value into the foreign core's value (possibly bypassing the corresponding cache and notifying the current core directly; I'm not entirely sure if that's the case),
  • write-back to cache with an incremented value (doesn't cause delay for this one core, but it might for all others).

That being said, I would take these calculations with a grain of salt as I might have missed something and exact mechanisms are possibly architecture-wise. However, a key takeaway is that due to stronger guarantees LockFreeHistogram most likely has a significant performance decrease in comparison to AtomicHistogram.

Other notes

BTW: you're absolutely right about my config math being incorrect, I made a mistake there. I'm hoping this time I did the math right.

@brayniac
Copy link
Contributor

Thanks for the thoughtful response.

For the skew error scenarios, I'd like to note that IntoIter is only implemented for the non-atomic histogram. One must .load() the atomic histogram into an owned copy first. This copy does not allow concurrent mutation (takes &mut for all writes). I think the only exposure here is during .load() and I'm still not convinced this is a significant issue in-practice when thinking about long-running services (eg, caches, databases, etc).

In terms of ordering guarantees, I'm unaware of any users desiring different guarantees there. You're correct that the weak ordering is for performance reasons where these increments happen on the critical traffic-serving path for a few projects I work on. I'm hesitant to try to solve a problem that has no concrete use-case.

I'd be curious to see benchmark results. I believe that most modern micro-architectures do not have a unified L2 for an entire NUMA domain, so I'm really curious what the performance of the suggested implementation looks like.

I think at this point I'm a bit hesitant to accept a contribution along these lines as I'm still not convinced that this is going to cause skew that's measurable in practical applications. I'd believe for the Scylla DB driver the existing implementation would be sufficient to provide good signal into performance.

@brayniac
Copy link
Contributor

Closing this out for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

2 participants