Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ConcurrentLfu time-based expiry #516

Merged
merged 34 commits into from
May 8, 2024
Merged

ConcurrentLfu time-based expiry #516

merged 34 commits into from
May 8, 2024

Conversation

bitfaster
Copy link
Owner

@bitfaster bitfaster commented Nov 25, 2023

Support all time-based expiry modes for ConcurrentLfu.

  • Extend node types to support the different access order lists as described here
    • AccessOrderNode: Existing scheme is AccessOrder, but buffer array ops will be slower if node type is not sealed (due to extra type checking), therefore derive new sealed type (this will be a generic arg to ConcurrentLfu later)
    • AccessOrderExpiringNode: support expire after access, just needs a time stamp
    • TimeOrderNode: support both expire after write and custom modes - requires time order list + time stamp
    • Think of some better names for these
  • Port TimerWheel from Caffeine
  • Define node policy interface required to expire items for ExpireAfterWrite, ExpireAfterAccess and ExpireAfter
  • Plug all the policy hook points into ConcurrentLfuCore
  • Implement an ExpireAfter policy that uses TimerWheel and IExpiryCalculator
  • Implement ConcurrentTLfu based on the ExpireAfter policy

@bitfaster bitfaster marked this pull request as draft November 25, 2023 07:11
BitFaster.Caching/Lfu/TimerWheel.cs Fixed Show fixed Hide fixed
BitFaster.Caching/Lfu/TimerWheel.cs Fixed Show fixed Hide fixed
BitFaster.Caching/Lfu/TimerWheel.cs Fixed Show fixed Hide fixed
BitFaster.Caching/Lfu/TimerWheel.cs Fixed Show fixed Hide fixed
BitFaster.Caching/Lfu/TimerWheel.cs Fixed Show fixed Hide fixed
@coveralls
Copy link

coveralls commented Nov 25, 2023

Coverage Status

coverage: 99.304% (+0.06%) from 99.242%
when pulling 3e24514 on users/alexpeck/lfuexpire
into 434a7be on main.

@ben-manes
Copy link

fwiw, I implemented expireAfterAccess and expireAfterWrite prior to having the timing wheel. Otherwise I’d likely not have bothered and made them convenience builder methods that set the expireAfter policy. Instead I didn’t redo working code and left both approaches. Just fyi so you don’t think it was a performance need or something, I simply hadn’t known how to do the variable policy originally in a way that fit my design goals.

@bitfaster
Copy link
Owner Author

Thanks for the insight, it makes total sense. I'm figuring out how to fit this together - from reading your code in Caffeine you generate a Node class with the required backing fields for each policy to minimize storage overhead.

For expireAfterAccess, since each of window, probation and protected are already in access order it seems like it will be easy to walk the existing lists from the LRU position until a non-expired item is found (I think you had suggested this last year so I had it in my mind this part is much easier to implement). This requires a node type with 1 additional field for the expiry time (so the overhead is prev+next+time).

For expireAfterWrite and expireAfter another linked list is required for time order, so I planned to define a node type with an extra prev/next ptr (so the overhead is prev+next+time+timeprev+timenext) - last night I figured out these can both use the same node class but I hadn't made the mental leap that both could use the timing wheel which would indeed be simpler.

It seems like it's still worth implementing expireAfterAccess separately to piggyback on the existing access order lists and compact the node size.

@ben-manes
Copy link

A caveat for using the access order lists is that the read buffer is allowed to drop events. That could delay the eviction of the expired entry if the head is pinned to a non-expired item, as this user experienced. This shouldn't typically be a problem if the size is bounded or the workload varies, but if the user unintentionally pins the entry then there is unbounded growth. Of course the lossiness of the read buffer is crucial for concurrent read performance, whereas writes are not lossy for correctness so expireAfterWrite does not suffer from this problem.

In that user's scenario the solution was to switch to the timing wheel based approach. This corrected the problem because the wheels turn independently and, if it encounters a misplaced entry, it will reschedule the entry rather than halt the expiration early. That makes it a more robust of an algorithm for malicious workloads.

I think that the vast majority of expiration should use expireAfterWrite (time-to-live) as a freshness mechanism, so I tend to view most usages of expireAfterAccess (time-to-idle) as a likely mistake. It's certainly very useful when applied correctly, just not commonly needed. That might mean that the optimization to remove those extraneous fields might not be worth your effort. When it works correctly it is very satisfying how elegant and simple it feels, so I am certainly not disfavoring it as a reasonable algorithmic choice if that's what you prefer.

BitFaster.Caching/Lfu/ConcurrentLfu.cs Fixed Show fixed Hide fixed
BitFaster.Caching/Lfu/ConcurrentLfu.cs Fixed Show fixed Hide fixed
BitFaster.Caching/Lfu/NodePolicy.cs Fixed Show fixed Hide fixed
BitFaster.Caching/Lfu/NodePolicy.cs Fixed Show fixed Hide fixed
@bitfaster
Copy link
Owner Author

I hadn't anticipated that. I had thought the opposite might occur - the dropped reads would fail to update the access order+expiry time of a Node, and it would be erroneously expired due to having a stale time stamp. Most likely I'm thinking of doing something that won't actually work in practice, and I didn't figure it out yet.

Thanks for all these gems of wisdom. I'm still getting my head around the timer wheel algorithm and before I get too deep into that I need to do a few rounds of changes so I can add this cleanly.

@ben-manes
Copy link

ben-manes commented Nov 26, 2023

I probably updated the timestamp on the request since the entry would be buffered for some unknown amount of time and figured that fuzziness of the replay order wouldn't skew it too badly.

The timer wheel is neat but confusing, and I'm not sure how much of that is inherent or my fault. I didn't have a code reference and, since I wanted to make it as fast as I could for fun, I avoid expensive division / modulus for power-of-two time spans which makes the code harder to follow. I later found other implementations and most seem to use a chain-of-responsibility by having wheel objects where items flow downward. Those seemed to be more generic / flexible, but scatter the algorithm across multiple files and perhaps do extra work. I guess at the cost of being brittle and confusing, mine is crazy fast and perfect for what I needed. I'll be interested to see if you'd approach it differently once you get comfortable with it all.

@bitfaster
Copy link
Owner Author

It's good to start out with the fast version and I expect I will learn a lot from it. Your frequency sketch code translated to C# really well and that was also a fun exercise.

I will try to get deeper into it this week, there are many questions in my mind about how to get this working, and I first need to retrofit the generic node type into ConcurrentLfu without making a mess. I need some focus time to analyze TimerWheel, it looks very interesting and I also want to read the paper you linked to in your source code for background.

BitFaster.Caching/Lfu/TimerWheel.cs Fixed Show fixed Hide fixed
BitFaster.Caching/Lfu/TimerWheel.cs Fixed Show fixed Hide fixed
Comment on lines +123 to +126
if ((node.GetTimestamp() - time) < 0)
{
cache.Evict(node);
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fyi, prior to the loop the linked list is detached so I rescheduled the node if not expired. Presumably that could be due to a concurrent update so the reordering had not happened yet or similar scenarios. If you don't reschedule then detaching is probably unnecessary. If you do then it avoids an accidental loop, e.g. the overflow bucket (1+ weeks) might scan over items that expire far into the distant future so it would reschedule back into it. Or maybe if it reschedules at the head instead of tail that would resolves it. Just something to review and consider how you prefer handling these scenarios.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! You probably just saved me several hours of debugging - so far I had only tested superficially by replicating some of your unit tests and single stepping in the debugger to understand how it works. Now you pointed this out, it seems that I must call schedule here, otherwise nodes would always stay in the same bucket and never cascade down through the wheels. I had distracted myself trying to understand under what conditions a resurrection could occur and missed it.

I'm not sure if this is very useful to you at this point, but I don't think you have a unit test that would detect my mistake (at least in TimerWheelTest.java), since the only test you have calling advance more than once is advance_backwards.

I think I have mapped out how to fit everything together, but I need to do a lot more testing. You made a great point in that all time-based policies can be implemented via TimerWheel, so I will start out with that and add the more specialist after access policy later as an optimization.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point! I'll look into adding an explicit unit test for this. I hope that it is implicitly covered indirectly by other tests due to the data integrity checks that are performed after a cache test passes.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspected you had it covered - that's effectively what I would like to do next, I will make both a higher-level functional test suite and something that is more like concurrent/stress with constant expiry and check integrity - so good to have that link for inspiration.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ported your test case, works great! 😄

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ported your test case, works great! 😄

It was more or less a copy paste of your other test. It wasn't correct for me - since I am not using nano seconds (Duration can use different time sources with different units to get better perf on different platforms), I had to adjust it a bit to get nodes into each wheel.

I have now added a way to assert on the position and print out the wheel within the tests, next job is the schedule tests. Likely it is completely working with what I have, but I am still poking around to understand it.

ben-manes added a commit to ben-manes/caffeine that referenced this pull request Dec 10, 2023
ben-manes added a commit to ben-manes/caffeine that referenced this pull request Dec 10, 2023
@bitfaster
Copy link
Owner Author

Baseline

Method Runtime Mean Error StdDev Ratio Allocated
ConcurrentDictionary .NET 6.0 7.128 ns 0.1347 ns 0.1497 ns 1.00 -
ConcurrentLfuBackground .NET 6.0 21.053 ns 0.4485 ns 0.9060 ns 2.89 -
ConcurrentLfuForeround .NET 6.0 48.797 ns 0.9918 ns 1.0185 ns 6.86 -
ConcurrentLfuThreadPool .NET 6.0 53.980 ns 2.2947 ns 6.7658 ns 6.32 -
ConcurrentLfuNull .NET 6.0 20.419 ns 0.4308 ns 0.4789 ns 2.87 -
ConcurrentDictionary .NET Framework 4.8 9.301 ns 0.2144 ns 0.2633 ns 1.00 -
ConcurrentLfuBackground .NET Framework 4.8 35.731 ns 0.7128 ns 0.9014 ns 3.84 -
ConcurrentLfuForeround .NET Framework 4.8 72.231 ns 1.3833 ns 1.4206 ns 7.77 -
ConcurrentLfuThreadPool .NET Framework 4.8 33.419 ns 0.6971 ns 1.3093 ns 3.60 -
ConcurrentLfuNull .NET Framework 4.8 24.960 ns 0.5225 ns 0.5132 ns 2.68 -

With time-based expiry changes:

Method Runtime Mean Error StdDev Ratio Allocated
ConcurrentDictionary .NET 6.0 7.309 ns 0.1417 ns 0.1842 ns 1.00 -
ConcurrentLfuBackground .NET 6.0 21.333 ns 0.4566 ns 0.9926 ns 2.94 -
ConcurrentLfuForeround .NET 6.0 49.709 ns 0.6833 ns 0.6058 ns 6.75 -
ConcurrentLfuThreadPool .NET 6.0 34.427 ns 0.6164 ns 0.8841 ns 4.71 -
ConcurrentLfuNull .NET 6.0 20.406 ns 0.3131 ns 0.2929 ns 2.77 -
ConcurrentDictionary .NET Framework 4.8 11.454 ns 0.2267 ns 0.2784 ns 1.00 -
ConcurrentLfuBackground .NET Framework 4.8 37.182 ns 0.7228 ns 0.7733 ns 3.25 -
ConcurrentLfuForeround .NET Framework 4.8 73.309 ns 1.4388 ns 1.5992 ns 6.41 -
ConcurrentLfuThreadPool .NET Framework 4.8 51.505 ns 3.6956 ns 10.8964 ns 4.09 -
ConcurrentLfuNull .NET Framework 4.8 26.443 ns 0.5488 ns 0.5636 ns 2.30 -

@bitfaster bitfaster marked this pull request as ready for review December 15, 2023 11:42
@bitfaster
Copy link
Owner Author

Last piece missing is to defend against frequent updates saturating the write queue. E.g. if the item is updated within 1 second or whatever, don't enqueue/reschedule.

{
var currentExpiry = node.TimeToExpire - current;
node.TimeToExpire = current + expiryCalculator.GetExpireAfterRead(node.Key, node.Value, currentExpiry);
wheel.Reschedule(node);
Copy link
Owner Author

@bitfaster bitfaster Jan 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need a guard here to avoid rescheduling the same node into the same bucket if it was read twice or more, or if it was very recently read.

{
var currentExpiry = node.TimeToExpire - current;
node.TimeToExpire = current + expiryCalculator.GetExpireAfterUpdate(node.Key, node.Value, currentExpiry);
wheel.Reschedule(node);
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need a guard here to avoid rescheduling the same node into the same bucket if it was updated twice or more, or if it was very recently changed.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The simplest cheap thing is to compare currentExpiry to node.TimeToExpire, and only reschedule if expiry has changed. But we can still execute the expiryCalculator many times for the same item, in the same maintenance cycle.

One strategy is to store the time scheduled in each node (either count each time maintenance runs, or store the timestamp), and only re-schedule when the time has changed (or has changed by more than some delta like 15ms, the worst-case clock resolution).

Both AccessOrderNode and TimeOrderNode have 2 bytes of padding when the key and value are a reference type, so using this padding we could count 65,535 update cycles without increasing node memory size. Storing an additional long would give more accuracy but will increase size by 8 bytes.

If a node is not rescheduled (assuming approximate time using the method above and a node is not touched for 65,535 update cycles) and has expired, it will remain in the timer wheel until it is accessed or the bucket it is in is expired. That is only a problem if the expiry calculator can generate an expiry far in the future, then a much shorter expiry for the same item. Not the common case, but something someone might do.

There is scope to combine the bools wasAccessed, wasDeleted and the int Position enum (1 byte + 1 byte + 4 bytes) - this could be packed into to single byte. Position should most likely be switched - this will eliminate the padding.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fyi, I might be misunderstanding.

If the entry was close to expiration and its lifetime was then extended (e.g. by a write), how would the approaches behave? If the call to the expiryCalculator happens during a maintenance cycle then there is a delay, so would the entry expire (and be visibly absent) or would it be resurrected? The latter could be confusing, especially if a load was triggered that could now be abandoned or would it replace the present value?

The naive but less confusing approach is to perform the expiryCalculator during the caller's operation to set the timestamp when the node's expires. Then on every maintenance cycle the accessed nodes would be reordered, possibly unnecessarily. That's hopefully ignorable since the cost is a cheap O(1) reordering.

Like you mentioned, there is a flaw of a missed reordering to shrink the expiry time. I think that could happen either in either approach. It would only happen to due GetExpireAfterRead shrinking the duration because those events can be discarded by the buffer, whereas the GetExpireAfterUpdate won't as writes cannot be lossy. I think its an okay scenario to ignore and rely on size eviction if a problem. A fix could be to promote the read event to instead submit into the write buffer if the duration shrinks by a significant amount. Then it would rarely take the penalty of a write buffer promotion and handle this edge case, so hopefully users wouldn't encounter it often enough to notice.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got back to this at last :) But now it's not fresh in my mind.

I have probably diverged somewhat from your code - this is a summary:

  • I deferred calling expiryCalculator so that it happens during maintenance. That way lookups only need to compare expiry time to current time, in addition to writing to the buffer. My thinking was to favor optimizing lookups at all costs (I realize this is into the hair-splitting zone). This also means that only 1 thread is ever updating the expiry time, because maintenance is already protected by a lock.
  • Each time maintenance runs, the current time is evaluated only once, and all items in the read and write buffers will reflect the timestamp for that maintenance call. This assumes that the delay between reading/writing an item and maintenance running is very small.
  • Now, the read/write buffer could contain a hundred instances of the same node, with the access timestamp. Assuming the expiry calculator produces an identical expiry for the same inputs, this would mean needlessly doing the same computation hundreds of times.

In summary, I was thinking about how to defend against the read buffer having a high number of duplicates and optimizing this busy work. Just as you wrote your reply I was studying your code to see if you had a similar guard - I guessed it would be in the onAccess method here.

Another simple thing I considered is just storing a pointer to the last node I processed during maintenance, so if you get a bunch in a row it would collapse them and do expiry calc + reschedule fewer times.

Good point that the shrinking expiry time is anyway possible for dropped reads, I was too deep into the impact of this hypothetical optimization to realize it already occurs. I feel like it's a benign case because it falls back to a misprediction/cache pollution, rather than something bad like unbounded growth.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fairly rusty on this all too and happy to learn when you diverge.

I didn't optimize for duplicates and decided that the cost was too small to be worth optimizing. There is some hashing, pointer swaps, etc. which we tried to make very cheap. Then if I had to coalesce duplicates into a single sketch increment, for example, that likely means some extra hashing for building an intermediate multiset. Since each read buffer has a max of 16 elements * NCPUs, that's 256 read buffer entries on a 16-core machine per maintenance run. I think that I tried a naive version and didn't see any difference in a benchmark, which made sense since lossy buffers avoid a client-side penalty, so I couldn't convince myself it would be beneficial and dropped the idea.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a fix here #603, my update code path can have a similar problem where we might not write the updated node to the write buffer, but that can only occur under load.

The overhead of the time APIs in .NET is inconsistent across operating systems - I currently have user space calls on Windows/Linux and a system call on MacOS (because unfortunately all the built in .NET time APIs are mapped to system calls on Mac).

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got some focus time to look at this properly, now I fully understand the consequences of updating expiry during maintenance:

  1. Any entry dropped when the buffer is full (either a read or update) will have a stale expiry time.
  2. Maintenance must run after read & update without being deferred. For write heavy workloads, this doesn't change much. For read heavy workloads with infrequent cache activity, maintenance will run on every read increasing overhead.

The fix I did yesterday only addresses part of 2. Handling 1 & 2 with this design will increase complexity. With hindsight updating expiry as part of the lookup as you have done is the better tradeoff, I will fix this completely next weekend.

I hesitated with this feature back in December because I had some gaps, then came back recently and did stress tests where this wasn't apparent. Thanks for pointing me in the right direction, I will get it fixed.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your other option is to honor the entry as expiring early, so it’s still linearizable on queries by not reappearing. That might mean a slightly lower hit rate for a slightly faster read. I’d likely prefer the hit rate but it seems rare enough to be personal preference.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree - I would prefer better hit rate at the expense of a tiny per read overhead. Perf aside, fixing this also makes it much more intuitive.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now fixed in v2.5.1. Thanks again for the heads up.

@bitfaster bitfaster merged commit de72842 into main May 8, 2024
13 checks passed
@bitfaster bitfaster deleted the users/alexpeck/lfuexpire branch May 10, 2024 06:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants