-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix flakes in ./topdown/cache #7188
Fix flakes in ./topdown/cache #7188
Conversation
@evankanderson thanks for looking into this. I think it's fine to fix the tests in this change and open a separate issue with more discussion around |
As I said, I may have gotten overzealous in the cleanup! I was confused at first why an entry with a 1s lifespan was still alive 2s later, but this is existing behavior, so I'm willing to leave it as-is and have the tests run a bit longer. (About 0.5s) The main impact is on folks running the tests hundreds of times, which is admittedly not a common use-case. |
@rudrakhp any thoughts about the tests? |
@evankanderson thanks for the catch! We can choose to sleep for
Irrespective of whether any items have expired, we do a linear scan of the cache in the cleanup routine. Unless I am missing something here, the cost of running the cleanup routine (on each tick) seems to be way greater than cost to recreate the ticker and associated channel. I understand that the proposal is to run the routine twice to ensure stale entries are deleted, but I am not sure if correctness is worth what we loose in runtime performance.
Agree that it takes a lot of time for tests to tun as 1s is minimum expiry time we can set today. From a runtime performance aspect, I didn't see why anyone would want to schedule the (costly) cleanup routine more frequently. But we could set floating point numbers instead of integers for Like @ashutosh-narkar already mentioned, don't think we should change the cache cleanup behaviour just for the tests, especially if it's something like invoking the cleanup twice as frequently which users might not expect. We could brainstorm on this aspect in a separate PR though. |
a1d8cee
to
ffa2447
Compare
I removed the frequency change, and validated the new test timings for 500 runs. |
ffa2447
to
845962f
Compare
✅ Deploy Preview for openpolicyagent ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm back from Thanksgiving break! I've returned to the original code configuration, and resisted the urge to make further changes (such as trying to optimize cleanStaleValues()
by fetching time.Now()
only once).
Hmm, the test failure is in general area that I'd tried to make non-flaky, so I'll need to investigate tomorrow... |
Okay, it looks like the failure in |
845962f
to
2a34e71
Compare
(rebased) |
2a34e71
to
13ea41a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution @evankanderson! Just a small comment and we can get this in. Also if you could author a more descriptive commit message that would be helpful. See guidelines here.
for { | ||
select { | ||
case <-cleanupTicker.C: | ||
// NOTE: We stop the ticker and create a new one here to ensure that applications | ||
// get _at least_ staleEntryEvictionTimePeriodSeconds with the cache unlocked; | ||
// see https://github.com/open-policy-agent/opa/pull/7188/files#r1855342998 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of putting a link here, it would be helpful to add some text explaining the situation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The link is to the discussion in this PR; I'm happy to add more detail here, but I thought that the link to the original was probably also useful background. Would the following work?
// see https://github.com/open-policy-agent/opa/pull/7188/files#r1855342998 | |
// for more background, see https://github.com/open-policy-agent/opa/pull/7188/files#r1855342998 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A single line summary of the convo with the link would be great!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's what the two lines above the link are -- a summary of the discussion, and then a link for more information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, can you please update the commit message by adding a description in the body and we can get this in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the commit message to cover this discussion as well as the fixing of the flakes in the test (which was the primary motivation).
Per discussion in open-policy-agent#7188, do *not* adjust the timing of cache scans to be more frequent; we want to wait at least staleEntryEvictionTimePeriodSeconds between cache scans, even if we hold the lock for a substantial period of time. Signed-off-by: Evan Anderson <evan@stacklok.com>
13ea41a
to
c474080
Compare
Why the changes in this PR are needed?
In #7180, I got a few test failures in untouched code in
./topdown/cache
. Running the tests withgo test -count 200 ./topdown/cache
(which takes around 400s), I got 5 failures across a few different tests.The underlying cause seems to be the timing of the ticker in comparison te the tests. The tests create a cache with a 1s expiry scan (which starts a timer to tick at 1s, then re-creates the timer to run 1s later, etc), and then immediately goes on to create objects with 1s expiry. The test then sleeps for 2s, and checks to see if the items have been evicted. Unfortunately, the following timing can happen:
What are the changes in this PR?
I'm willing to reduce the changes to simply timing changes, but I tried to improve the codebase a bit. 😁
stale_entry_eviction_period_seconds
duration, the next cache cleanup won't be scheduled untilstale_entry_eviction_period_seconds
after the previous cleanup succeeds. On the other hand, this also means that items can expire even later than expected. This change maintains a fixed ticker cycle, regardless of the size of the cache. I suspect any OPA instance where cache clearing approachesstale_entry_eviction_period_seconds
is going to be pretty unhappy regardless of the pacing.In order to reduce test execution time and prefer higher counts of
go test -count N
, I reduced the execution time of./topdown/cache
from 2.2s to around 1.2s by tweaking the various expirations and sleeps.Notes to assist PR review:
My gut feel is that the additional cost of running the cache scan twice as frequently will be mostly offset by not needing to clear and re-create Ticker objects (and the associated channels, etc). Also, hopefully, this cost should be in the noise for most installations.
A more conservative change would be simply to increase the
2 * time.Second
sleeps to be 2.2 seconds; in any case, I'd preserve the extra comments as the code seems slighly subtle.(A third, rejected option would be to check the expiry in
unsafeGet
, and pretend the object was not in the cache as soon as it expired. This would add a time fetch to eachunsafeGet
, which seemed more expensive than necessary.)Further comments:
This passed successfully on a 500-count run, as well as several smaller runs (MacBook M2, moderate other load):