Auto flush checkpoint queue if too many are waiting #279

kaituo · 2020-10-19T22:20:45Z

Issue #, if available:

Description of changes:
Our performance testing finds that our checkpoint queue can increase quickly. This can happen during maintenance if there are a lot of entities in cache and very few cache swap outs happen in the past hour. When an entity's state is swapped out, we try to save a checkpoint. If we haven't done so for an entity within one hour, we put the checkpoint to a buffer and do a flush at the end of maintenance. Since we only flush the 1st 1000 queued requeues to disk, a lot of requests may still wait in the queue until the next flush happens. This is not ideal and can cause memory outages.

This PR triggers another flush after the previous flush finishes if there are a lot of queued requests.

This PR also corrects the LImitExceededException when a circuit breaker is open: previously, we send a LImitExceededException that stops the detector immediately, which leaves no room for the detector to recover. This PR fixes that by changing the LImitExceededException's stop now flag to be false to give the detector a few more intervals to recover.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Our performance testing finds that our checkpoint queue can increase quickly. This can happen during maintenance if there are a lot of entities in cache and very few cache swap outs happen in the past hour. When an entity's state is swapped out, we try to save a checkpoint. If we haven't done so for an entity within one hour, we put the checkpoint to a buffer and do a flush at the end of maintenance. Since we only flush the 1st 1000 queued requeues to disk, a lot of requests may still wait in the queue until the next flush happens. This is not ideal and can cause memory outages. This PR triggers another flush after the previous flush finishes if there are a lot of queued requests. This PR also corrects the LImitExceededException when a circuit breaker is open: previously, we send a LImitExceededException that stops the detector immediately, which leaves no room for the detector to recover. This PR fixes that by changing the LImitExceededException's stop now flag to be false to give the detector a few more intervals to recover.

codecov · 2020-10-19T22:21:57Z

Codecov Report

Merging #279 into master will decrease coverage by 1.96%.
The diff coverage is 50.00%.

@@             Coverage Diff              @@
##             master     #279      +/-   ##
============================================
- Coverage     73.29%   71.33%   -1.97%     
- Complexity     1751     1764      +13     
============================================
  Files           180      185       +5     
  Lines          8119     8436     +317     
  Branches        670      722      +52     
============================================
+ Hits           5951     6018      +67     
- Misses         1859     2082     +223     
- Partials        309      336      +27

Flag	Coverage Δ	Complexity Δ
#cli	`79.27% <ø> (ø)`	`0.00 <ø> (ø)`
#plugin	`70.59% <50.00%> (-2.12%)`	`1764.00 <0.00> (+13.00)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ	Complexity Δ
...pendistroforelasticsearch/ad/ml/CheckpointDao.java	`83.59% <0.00%> (-0.90%)`	`57.00 <0.00> (ø)`
...arch/ad/transport/EntityResultTransportAction.java	`85.07% <100.00%> (+0.22%)`	`13.00 <0.00> (ø)`
...elasticsearch/ad/AnomalyDetectorProfileRunner.java	`50.98% <0.00%> (-20.69%)`	`41.00% <0.00%> (+2.00%)`	⬇️
...search/ad/transport/GetAnomalyDetectorRequest.java	`84.61% <0.00%> (-12.16%)`	`10.00% <0.00%> (+1.00%)`	⬇️
...ticsearch/ad/transport/ProfileTransportAction.java	`70.83% <0.00%> (-11.52%)`	`6.00% <0.00%> (ø%)`
...forelasticsearch/ad/transport/ProfileResponse.java	`87.30% <0.00%> (-8.62%)`	`16.00% <0.00%> (ø%)`
...d/transport/GetAnomalyDetectorTransportAction.java	`47.25% <0.00%> (-3.97%)`	`9.00% <0.00%> (ø%)`
.../ad/util/MultiResponsesDelegateActionListener.java	`71.79% <0.00%> (-3.89%)`	`11.00% <0.00%> (ø%)`
...stroforelasticsearch/ad/model/DetectorProfile.java	`31.01% <0.00%> (-3.38%)`	`16.00% <0.00%> (+2.00%)`	⬇️
...icsearch/ad/rest/RestGetAnomalyDetectorAction.java	`40.00% <0.00%> (-2.11%)`	`4.00% <0.00%> (ø%)`
... and 13 more

kaituo requested review from weicongs-amazon and ohltyler October 19, 2020 22:20

kaituo added the enhancement New feature or request label Oct 19, 2020

weicongs-amazon approved these changes Oct 19, 2020

View reviewed changes

ohltyler approved these changes Oct 19, 2020

View reviewed changes

kaituo merged commit 1c02172 into opendistro-for-elasticsearch:master Oct 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto flush checkpoint queue if too many are waiting #279

Auto flush checkpoint queue if too many are waiting #279

kaituo commented Oct 19, 2020 •

edited

Loading

codecov bot commented Oct 19, 2020 •

edited

Loading

Auto flush checkpoint queue if too many are waiting #279

Auto flush checkpoint queue if too many are waiting #279

Conversation

kaituo commented Oct 19, 2020 • edited Loading

codecov bot commented Oct 19, 2020 • edited Loading

Codecov Report

kaituo commented Oct 19, 2020 •

edited

Loading

codecov bot commented Oct 19, 2020 •

edited

Loading