This repository has been archived by the owner on Aug 2, 2022. It is now read-only.
Auto flush checkpoint queue if too many are waiting #279
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Issue #, if available:
Description of changes:
Our performance testing finds that our checkpoint queue can increase quickly. This can happen during maintenance if there are a lot of entities in cache and very few cache swap outs happen in the past hour. When an entity's state is swapped out, we try to save a checkpoint. If we haven't done so for an entity within one hour, we put the checkpoint to a buffer and do a flush at the end of maintenance. Since we only flush the 1st 1000 queued requeues to disk, a lot of requests may still wait in the queue until the next flush happens. This is not ideal and can cause memory outages.
This PR triggers another flush after the previous flush finishes if there are a lot of queued requests.
This PR also corrects the LImitExceededException when a circuit breaker is open: previously, we send a LImitExceededException that stops the detector immediately, which leaves no room for the detector to recover. This PR fixes that by changing the LImitExceededException's stop now flag to be false to give the detector a few more intervals to recover.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.