Fix flakes in ./topdown/cache #7188

evankanderson · 2024-11-22T22:45:37Z

Why the changes in this PR are needed?

In #7180, I got a few test failures in untouched code in ./topdown/cache. Running the tests with go test -count 200 ./topdown/cache (which takes around 400s), I got 5 failures across a few different tests.

The underlying cause seems to be the timing of the ticker in comparison te the tests. The tests create a cache with a 1s expiry scan (which starts a timer to tick at 1s, then re-creates the timer to run 1s later, etc), and then immediately goes on to create objects with 1s expiry. The test then sleeps for 2s, and checks to see if the items have been evicted. Unfortunately, the following timing can happen:

Ticker is created at +0.000s
Cache item is created, expiring at +1.001s
Ticker ticks for cache cleanup at +1.000s, the item is still valid
Cache cleanup completes at +1.001s, and the next tick is scheduled for +2.001s
Check code runs at +2.000s, and finds the item not yet evicted
Ticker ticks at +2.001s, and cleans up cache items after the test has failed

What are the changes in this PR?

I'm willing to reduce the changes to simply timing changes, but I tried to improve the codebase a bit. 😁

It may be slightly surprising to users that items may expire up to stale_entry_eviction_period_seconds after their expiration. I made the ticker tick twice for this interval, halving the "items remain after expiry" visibility duration.
The current code re-creates and re-schedules the ticker for each cache scan. This generates a certain amount of extra work, for some possible pacing benefits: if the cache cleanup takes most of the stale_entry_eviction_period_seconds duration, the next cache cleanup won't be scheduled until stale_entry_eviction_period_seconds after the previous cleanup succeeds. On the other hand, this also means that items can expire even later than expected. This change maintains a fixed ticker cycle, regardless of the size of the cache. I suspect any OPA instance where cache clearing approaches stale_entry_eviction_period_seconds is going to be pretty unhappy regardless of the pacing.

In order to reduce test execution time and prefer higher counts of go test -count N, I reduced the execution time of ./topdown/cache from 2.2s to around 1.2s by tweaking the various expirations and sleeps.

Notes to assist PR review:

My gut feel is that the additional cost of running the cache scan twice as frequently will be mostly offset by not needing to clear and re-create Ticker objects (and the associated channels, etc). Also, hopefully, this cost should be in the noise for most installations.

A more conservative change would be simply to increase the 2 * time.Second sleeps to be 2.2 seconds; in any case, I'd preserve the extra comments as the code seems slighly subtle.

(A third, rejected option would be to check the expiry in unsafeGet, and pretend the object was not in the cache as soon as it expired. This would add a time fetch to each unsafeGet, which seemed more expensive than necessary.)

Further comments:

This passed successfully on a 500-count run, as well as several smaller runs (MacBook M2, moderate other load):

$ go test -count 500 ./topdown/cache
ok      github.com/open-policy-agent/opa/topdown/cache  550.795s

ashutosh-narkar · 2024-11-23T00:37:55Z

@evankanderson thanks for looking into this. I think it's fine to fix the tests in this change and open a separate issue with more discussion around stale_entry_eviction_period_seconds. @rudrakhp have may some thoughts on that. I don't quite understand the change made in cache and why we should make that change just to make the tests work.

evankanderson · 2024-11-23T00:55:06Z

As I said, I may have gotten overzealous in the cleanup!

I was confused at first why an entry with a 1s lifespan was still alive 2s later, but this is existing behavior, so I'm willing to leave it as-is and have the tests run a bit longer. (About 0.5s)

The main impact is on folks running the tests hundreds of times, which is admittedly not a common use-case.

ashutosh-narkar · 2024-11-23T01:14:22Z

@rudrakhp any thoughts about the tests?

rudrakhp · 2024-11-23T03:12:18Z

@evankanderson thanks for the catch! We can choose to sleep for 3 * stale_entry_eviction_period_seconds in the tests that will ensure at least one cleanup routine run after the cache entry expires. Or if we are conservative and don't want to fix the tests for folks running the tests large number of times concurrently, 2.3 * stale_entry_eviction_period_seconds also works.

My gut feel is that the additional cost of running the cache scan twice as frequently will be mostly offset by not needing to clear and re-create Ticker objects (and the associated channels, etc). Also, hopefully, this cost should be in the noise for most installations.

Irrespective of whether any items have expired, we do a linear scan of the cache in the cleanup routine. Unless I am missing something here, the cost of running the cleanup routine (on each tick) seems to be way greater than cost to recreate the ticker and associated channel. I understand that the proposal is to run the routine twice to ensure stale entries are deleted, but I am not sure if correctness is worth what we loose in runtime performance.

In order to reduce test execution time and prefer higher counts of go test -count N, I reduced the execution time of ./topdown/cache from 2.2s to around 1.2s by tweaking the various expirations and sleeps.

Agree that it takes a lot of time for tests to tun as 1s is minimum expiry time we can set today. From a runtime performance aspect, I didn't see why anyone would want to schedule the (costly) cleanup routine more frequently. But we could set floating point numbers instead of integers for stale_entry_eviction_period_seconds, that can allow us to schedule more frequent cleanups in the tests.

Like @ashutosh-narkar already mentioned, don't think we should change the cache cleanup behaviour just for the tests, especially if it's something like invoking the cleanup twice as frequently which users might not expect. We could brainstorm on this aspect in a separate PR though.

evankanderson · 2024-11-23T17:59:02Z

I removed the frequency change, and validated the new test timings for 500 runs.

topdown/cache/cache.go

netlify · 2024-12-02T21:21:58Z

✅ Deploy Preview for openpolicyagent ready!

Name	Link
🔨 Latest commit	`2a34e71`
🔍 Latest deploy log	https://app.netlify.com/sites/openpolicyagent/deploys/67506f42d2efbd0008f28cf1
😎 Deploy Preview	https://deploy-preview-7188--openpolicyagent.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

evankanderson

I'm back from Thanksgiving break! I've returned to the original code configuration, and resisted the urge to make further changes (such as trying to optimize cleanStaleValues() by fetching time.Now() only once).

topdown/cache/cache.go

evankanderson · 2024-12-03T05:09:20Z

Hmm, the test failure is in general area that I'd tried to make non-flaky, so I'll need to investigate tomorrow...

evankanderson · 2024-12-03T22:16:46Z

Okay, it looks like the failure in ./topdown on TestInterQueryCache_ClientError is about an 0.05% flake rate on my MacBook (probably faster than the GitHub runners). I suspect that it is unrelated -- it looks like we're sometimes only making one call to the HTTP Server in the error case, and the other one is being cached despite being a partial evaluation. I don't think it is related to this PR.

evankanderson · 2024-12-04T15:03:36Z

(rebased)

ashutosh-narkar

Thanks for the contribution @evankanderson! Just a small comment and we can get this in. Also if you could author a more descriptive commit message that would be helpful. See guidelines here.

ashutosh-narkar · 2024-12-04T20:15:41Z

topdown/cache/cache.go

 			for {
 				select {
 				case <-cleanupTicker.C:
+					// NOTE: We stop the ticker and create a new one here to ensure that applications
+					// get _at least_ staleEntryEvictionTimePeriodSeconds with the cache unlocked;
+					// see https://github.com/open-policy-agent/opa/pull/7188/files#r1855342998


Instead of putting a link here, it would be helpful to add some text explaining the situation.

The link is to the discussion in this PR; I'm happy to add more detail here, but I thought that the link to the original was probably also useful background. Would the following work?

Suggested change

// see https://github.com/open-policy-agent/opa/pull/7188/files#r1855342998

// for more background, see https://github.com/open-policy-agent/opa/pull/7188/files#r1855342998

A single line summary of the convo with the link would be great!

I think that's what the two lines above the link are -- a summary of the discussion, and then a link for more information.

Ok, can you please update the commit message by adding a description in the body and we can get this in.

Updated the commit message to cover this discussion as well as the fixing of the flakes in the test (which was the primary motivation).

Per discussion in open-policy-agent#7188, do *not* adjust the timing of cache scans to be more frequent; we want to wait at least staleEntryEvictionTimePeriodSeconds between cache scans, even if we hold the lock for a substantial period of time. Signed-off-by: Evan Anderson <evan@stacklok.com>

evankanderson force-pushed the cache-flakes branch 2 times, most recently from a1d8cee to ffa2447 Compare November 23, 2024 16:09

rudrakhp reviewed Nov 24, 2024

View reviewed changes

topdown/cache/cache.go Show resolved Hide resolved

mchitten mentioned this pull request Dec 2, 2024

Update eval_cancel_error logic to separate context canceled, timeout errors #7202

Merged

evankanderson force-pushed the cache-flakes branch from ffa2447 to 845962f Compare December 2, 2024 21:20

evankanderson commented Dec 2, 2024

View reviewed changes

topdown/cache/cache.go Show resolved Hide resolved

evankanderson force-pushed the cache-flakes branch from 845962f to 2a34e71 Compare December 4, 2024 15:03

evankanderson force-pushed the cache-flakes branch from 2a34e71 to 13ea41a Compare December 4, 2024 17:56

rudrakhp approved these changes Dec 4, 2024

View reviewed changes

ashutosh-narkar reviewed Dec 4, 2024

View reviewed changes

evankanderson force-pushed the cache-flakes branch from 13ea41a to c474080 Compare December 4, 2024 23:35

ashutosh-narkar approved these changes Dec 4, 2024

View reviewed changes

ashutosh-narkar merged commit 84b554c into open-policy-agent:main Dec 4, 2024
28 checks passed

BrewTestBot mentioned this pull request Dec 20, 2024

opa 1.0.0 Homebrew/homebrew-core#201984

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix flakes in ./topdown/cache #7188

Fix flakes in ./topdown/cache #7188

evankanderson commented Nov 22, 2024 •

edited

Loading

ashutosh-narkar commented Nov 23, 2024

evankanderson commented Nov 23, 2024

ashutosh-narkar commented Nov 23, 2024

rudrakhp commented Nov 23, 2024 •

edited

Loading

evankanderson commented Nov 23, 2024

netlify bot commented Dec 2, 2024 •

edited

Loading

evankanderson left a comment

evankanderson commented Dec 3, 2024

evankanderson commented Dec 3, 2024

evankanderson commented Dec 4, 2024

ashutosh-narkar left a comment

ashutosh-narkar Dec 4, 2024

evankanderson Dec 4, 2024

ashutosh-narkar Dec 4, 2024

evankanderson Dec 4, 2024

ashutosh-narkar Dec 4, 2024

evankanderson Dec 4, 2024

	// see https://github.com/open-policy-agent/opa/pull/7188/files#r1855342998
	// for more background, see https://github.com/open-policy-agent/opa/pull/7188/files#r1855342998

Fix flakes in ./topdown/cache #7188

Fix flakes in ./topdown/cache #7188

Conversation

evankanderson commented Nov 22, 2024 • edited Loading

Why the changes in this PR are needed?

What are the changes in this PR?

Notes to assist PR review:

Further comments:

ashutosh-narkar commented Nov 23, 2024

evankanderson commented Nov 23, 2024

ashutosh-narkar commented Nov 23, 2024

rudrakhp commented Nov 23, 2024 • edited Loading

evankanderson commented Nov 23, 2024

netlify bot commented Dec 2, 2024 • edited Loading

✅ Deploy Preview for openpolicyagent ready!

evankanderson left a comment

Choose a reason for hiding this comment

evankanderson commented Dec 3, 2024

evankanderson commented Dec 3, 2024

evankanderson commented Dec 4, 2024

ashutosh-narkar left a comment

Choose a reason for hiding this comment

ashutosh-narkar Dec 4, 2024

Choose a reason for hiding this comment

evankanderson Dec 4, 2024

Choose a reason for hiding this comment

ashutosh-narkar Dec 4, 2024

Choose a reason for hiding this comment

evankanderson Dec 4, 2024

Choose a reason for hiding this comment

ashutosh-narkar Dec 4, 2024

Choose a reason for hiding this comment

evankanderson Dec 4, 2024

Choose a reason for hiding this comment

evankanderson commented Nov 22, 2024 •

edited

Loading

rudrakhp commented Nov 23, 2024 •

edited

Loading

netlify bot commented Dec 2, 2024 •

edited

Loading