Skip to content
This repository has been archived by the owner on Aug 2, 2022. It is now read-only.

Auto flush checkpoint queue if too many are waiting #279

Merged
merged 1 commit into from
Oct 19, 2020

Conversation

kaituo
Copy link
Member

@kaituo kaituo commented Oct 19, 2020

Issue #, if available:

Description of changes:
Our performance testing finds that our checkpoint queue can increase quickly. This can happen during maintenance if there are a lot of entities in cache and very few cache swap outs happen in the past hour. When an entity's state is swapped out, we try to save a checkpoint. If we haven't done so for an entity within one hour, we put the checkpoint to a buffer and do a flush at the end of maintenance. Since we only flush the 1st 1000 queued requeues to disk, a lot of requests may still wait in the queue until the next flush happens. This is not ideal and can cause memory outages.

This PR triggers another flush after the previous flush finishes if there are a lot of queued requests.

This PR also corrects the LImitExceededException when a circuit breaker is open: previously, we send a LImitExceededException that stops the detector immediately, which leaves no room for the detector to recover. This PR fixes that by changing the LImitExceededException's stop now flag to be false to give the detector a few more intervals to recover.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Our performance testing finds that our checkpoint queue can increase quickly.  This can happen during maintenance if there are a lot of entities in cache and very few cache swap outs happen in the past hour.  When an entity's state is swapped out, we try to save a checkpoint.  If we haven't done so for an entity within one hour, we put the checkpoint to a buffer and do a flush at the end of maintenance.  Since we only flush the 1st 1000 queued requeues to disk, a lot of requests may still wait in the queue until the next flush happens.  This is not ideal and can cause memory outages.

This PR triggers another flush after the previous flush finishes if there are a lot of queued requests.

This PR also corrects the LImitExceededException when a circuit breaker is open: previously, we send a LImitExceededException that stops the detector immediately, which leaves no room for the detector to recover.  This PR fixes that by changing the LImitExceededException's stop now flag to be false to give the detector a few more intervals to recover.
@kaituo kaituo added the enhancement New feature or request label Oct 19, 2020
@codecov
Copy link

codecov bot commented Oct 19, 2020

Codecov Report

Merging #279 into master will decrease coverage by 1.96%.
The diff coverage is 50.00%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master     #279      +/-   ##
============================================
- Coverage     73.29%   71.33%   -1.97%     
- Complexity     1751     1764      +13     
============================================
  Files           180      185       +5     
  Lines          8119     8436     +317     
  Branches        670      722      +52     
============================================
+ Hits           5951     6018      +67     
- Misses         1859     2082     +223     
- Partials        309      336      +27     
Flag Coverage Δ Complexity Δ
#cli 79.27% <ø> (ø) 0.00 <ø> (ø)
#plugin 70.59% <50.00%> (-2.12%) 1764.00 <0.00> (+13.00) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ Complexity Δ
...pendistroforelasticsearch/ad/ml/CheckpointDao.java 83.59% <0.00%> (-0.90%) 57.00 <0.00> (ø)
...arch/ad/transport/EntityResultTransportAction.java 85.07% <100.00%> (+0.22%) 13.00 <0.00> (ø)
...elasticsearch/ad/AnomalyDetectorProfileRunner.java 50.98% <0.00%> (-20.69%) 41.00% <0.00%> (+2.00%) ⬇️
...search/ad/transport/GetAnomalyDetectorRequest.java 84.61% <0.00%> (-12.16%) 10.00% <0.00%> (+1.00%) ⬇️
...ticsearch/ad/transport/ProfileTransportAction.java 70.83% <0.00%> (-11.52%) 6.00% <0.00%> (ø%)
...forelasticsearch/ad/transport/ProfileResponse.java 87.30% <0.00%> (-8.62%) 16.00% <0.00%> (ø%)
...d/transport/GetAnomalyDetectorTransportAction.java 47.25% <0.00%> (-3.97%) 9.00% <0.00%> (ø%)
.../ad/util/MultiResponsesDelegateActionListener.java 71.79% <0.00%> (-3.89%) 11.00% <0.00%> (ø%)
...stroforelasticsearch/ad/model/DetectorProfile.java 31.01% <0.00%> (-3.38%) 16.00% <0.00%> (+2.00%) ⬇️
...icsearch/ad/rest/RestGetAnomalyDetectorAction.java 40.00% <0.00%> (-2.11%) 4.00% <0.00%> (ø%)
... and 13 more

@kaituo kaituo merged commit 1c02172 into opendistro-for-elasticsearch:master Oct 19, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants