SP threshold is not updated correctly if >15% of client instances are not in UP status #1127

mgtriffid · 2018-10-05T12:50:28Z

Environment:
Cluster of 2 Eureka nodes of freshest version

eureka.enableSelfPreservation=true
eureka.expectedClientRenewalIntervalSeconds=15
eureka.renewalThresholdUpdateIntervalMs=240000 //default is 15 minutes, we just want to speed up testing

16 client instances talking to this cluster

eureka.client.refresh.interval=15

Experiment 1: confirming issue
15:00: 16 client instances in UP plus 2 Eurekas. Threshold is 61 = 18 * 4 * 0.85, 68 renews comes minutely, renews are above threshold, all is fine.
15:02: 4 client instances were put into OUT_OF_SERVICE state via Eureka’s REST API.
15:03: Logs indicate that this method updating threshold was called, and they say threshold is still 61. This is expected, we did not delete anything, so threshold should be the same.
15:04: We kill -9 one of clients and wait for eviction to happen.
15:06: Logs show that eviction happened.
15:06: /eureka/status endpoint says that only 64 renews came last minute. Reasonable: we had 68 per minute, killed one service, now we have 64.
15:07: Logs indicate that this method updating threshold was called, and they say threshold is still 61. Now this is becoming problematic. Common sense says: “Only 1 instance was evicted, 1 < 0.15 * 18, evictions happen from time to time, we should adjust self-preservation threshold now”. But tricky condition of method updateRenewalThreshold comes into play:

eureka/eureka-core/src/main/java/com/netflix/eureka/registry/PeerAwareInstanceRegistryImpl.java

Lines 533 to 537 in b2c5ea6

    
           if ((count) > (serverConfig.getRenewalPercentThreshold() * expectedNumberOfClientsSendingRenews) 
        
                   || (!this.isSelfPreservationModeEnabled())) { 
        
               this.expectedNumberOfClientsSendingRenews = count; 
        
               updateRenewsPerMinThreshold(); 
        
           }

This code now thinks that count == 13, not 17 (11 instances in UP plus two Eurekas). And previous value of expectedNumberOfClientsSendingRenews was 18, as in the beginning we had everything in UP. Condition evaluates to
((13) > (0.85 * 18)) == ((13) > 15.3) == false
, and threshold is not updated, and expectedNumberOfClientsSendingRenews is still 18.
15:11: Started previously killed client again. Client registers, and on new registration Eureka increments expectedNumberOfClientsSendingRenews, it is now 19. It also recalculates threshold: (int) (19 * 0.85 * 60 / 15) == 64.
So, we returned the system to previous state, but threshold changed from 61 to 64. This is wrong, and it actually can lead to unwanted self-preservation if there is a significant number of instances in status different than UP.
Experiment 2: looking for fix
This time we configure eureka.shouldFilterOnlyUpInstances=false on Eureka nodes.
15:33: 16 client instances in UP plus 2 Eurekas. Threshold is 61 = 18 * 4 * 0.85, 68 renews comes minutely, renews are above threshold, all is fine
15:34: 4 client instances were put into OUT_OF_SERVICE state via Eureka’s REST API.
15:37: Logs indicate that this method updating threshold was called, and they say threshold is still 61. This is expected, we did not delete anything, so threshold should be the same.
15:43: We kill -9 one of clients and wait for eviction to happen.
15:44: Logs show that eviction happened.
15:45: /eureka/status endpoint says that only 64 renews came last minute. Reasonable: we had 68 per minute, killed one service, now we have 64.
15:45: Threshold updated, and it is 57, and this is what makes the difference! Now that check thought count was 17.
((17) > (0.85 * 18)) == ((17) > 15.3) == true
expectedNumberOfClientsSendingRenews becomes 17, threshold is updated to (int) (17 * 0.85 * 60 / 15) == 57.
Conclusion:
Probably, we should instruct users to configure eureka.shouldFilterOnlyUpInstances=false on Eureka server and add this configuration as default for eureka-server module which assembles .war file
Question:
But why do we actually calculate registry size using Eureka client and not, well, registry size? It looks like I am missing something not obvious, but critical. Could you kindly clarify?

Thanks!

The text was updated successfully, but these errors were encountered:

holy12345 · 2018-10-08T12:07:10Z

HI @mgtriffid

Thanks this question, I think this is question interesting. I notice this case you have two eureka server. one of eureka server means eureka client for other. we know eureka client can fetch instancesInfo from eureka server so this server fetch all instanceInfo from other and filter which status are notUP and cause the issue which you mention (I think this is key point)

thanks : )

pparth · 2018-10-09T10:09:06Z

@qiangdavidliu Any help here?

qiangdavidliu · 2018-10-20T22:57:19Z

Sorry for the delayed response, and thanks for the detailed write up. I am also unsure to the reason why the calculation was done on the client rather than the registry, and found this PR from 2013 (!) that showed registry based calculations was possible at one point (4cd9c6e). I suspect that the registry is eventually not used due to the concurrent nature of accessing the server internal registry at scale.

As for your suggestion that we apply eureka.shouldFilterOnlyUpInstances=false as a default on the server side (and internally, we do run with that set to false too), that makes a lot of sense.

The eureka server internal eureka-client need to be configured to get instances for all statuses, not just UP. This is necessary to properly calculate the correct self preservation threshold. See issue #1127 for a discussion.

yikangfeng · 2019-01-24T07:08:32Z

Environment:
Cluster of 2 Eureka nodes of freshest version
eureka.enableSelfPreservation=true
eureka.expectedClientRenewalIntervalSeconds=15
eureka.renewalThresholdUpdateIntervalMs=240000 //default is 15 minutes, we just want to speed up testing
16 client instances talking to this cluster
eureka.client.refresh.interval=15
Experiment 1: confirming issue
15:00: 16 client instances in UP plus 2 Eurekas. Threshold is 61 = 18 * 4 * 0.85, 68 renews comes minutely, renews are above threshold, all is fine.
15:02: 4 client instances were put into OUT_OF_SERVICE state via Eureka’s REST API.
15:03: Logs indicate that this method updating threshold was called, and they say threshold is still 61. This is expected, we did not delete anything, so threshold should be the same.
15:04: We kill -9 one of clients and wait for eviction to happen.
15:06: Logs show that eviction happened.
15:06: /eureka/status endpoint says that only 64 renews came last minute. Reasonable: we had 68 per minute, killed one service, now we have 64.
15:07: Logs indicate that this method updating threshold was called, and they say threshold is still 61. Now this is becoming problematic. Common sense says: “Only 1 instance was evicted, 1 < 0.15 * 18, evictions happen from time to time, we should adjust self-preservation threshold now”. But tricky condition of method updateRenewalThreshold comes into play:
eureka/eureka-core/src/main/java/com/netflix/eureka/registry/PeerAwareInstanceRegistryImpl.java

Lines 533 to 537 in b2c5ea6

if ((count) > (serverConfig.getRenewalPercentThreshold() * expectedNumberOfClientsSendingRenews)
|| (!this.isSelfPreservationModeEnabled())) {
this.expectedNumberOfClientsSendingRenews = count;
updateRenewsPerMinThreshold();
}

This code now thinks that count == 13, not 17 (11 instances in UP plus two Eurekas). And previous value of expectedNumberOfClientsSendingRenews was 18, as in the beginning we had everything in UP. Condition evaluates to
((13) > (0.85 * 18)) == ((13) > 15.3) == false
, and threshold is not updated, and expectedNumberOfClientsSendingRenews is still 18.
15:11: Started previously killed client again. Client registers, and on new registration Eureka increments expectedNumberOfClientsSendingRenews, it is now 19. It also recalculates threshold: (int) (19 * 0.85 * 60 / 15) == 64.
So, we returned the system to previous state, but threshold changed from 61 to 64. This is wrong, and it actually can lead to unwanted self-preservation if there is a significant number of instances in status different than UP.
Experiment 2: looking for fix
This time we configure eureka.shouldFilterOnlyUpInstances=false on Eureka nodes.
15:33: 16 client instances in UP plus 2 Eurekas. Threshold is 61 = 18 * 4 * 0.85, 68 renews comes minutely, renews are above threshold, all is fine
15:34: 4 client instances were put into OUT_OF_SERVICE state via Eureka’s REST API.
15:37: Logs indicate that this method updating threshold was called, and they say threshold is still 61. This is expected, we did not delete anything, so threshold should be the same.
15:43: We kill -9 one of clients and wait for eviction to happen.
15:44: Logs show that eviction happened.
15:45: /eureka/status endpoint says that only 64 renews came last minute. Reasonable: we had 68 per minute, killed one service, now we have 64.
15:45: Threshold updated, and it is 57, and this is what makes the difference! Now that check thought count was 17.
((17) > (0.85 * 18)) == ((17) > 15.3) == true
expectedNumberOfClientsSendingRenews becomes 17, threshold is updated to (int) (17 * 0.85 * 60 / 15) == 57.
Conclusion:
Probably, we should instruct users to configure eureka.shouldFilterOnlyUpInstances=false on Eureka server and add this configuration as default for eureka-server module which assembles .war file
Question:
But why do we actually calculate registry size using Eureka client and not, well, registry size? It looks like I am missing something not obvious, but critical. Could you kindly clarify?
Thanks!

Why is count 13 after killing a service in the Experiment 1? Should it be 17? Thank you

yikangfeng · 2019-01-24T07:26:03Z

Environment:
Cluster of 2 Eureka nodes of freshest version
eureka.enableSelfPreservation=true
eureka.expectedClientRenewalIntervalSeconds=15
eureka.renewalThresholdUpdateIntervalMs=240000 //default is 15 minutes, we just want to speed up testing
16 client instances talking to this cluster
eureka.client.refresh.interval=15
Experiment 1: confirming issue
15:00: 16 client instances in UP plus 2 Eurekas. Threshold is 61 = 18 * 4 * 0.85, 68 renews comes minutely, renews are above threshold, all is fine.
15:02: 4 client instances were put into OUT_OF_SERVICE state via Eureka’s REST API.
15:03: Logs indicate that this method updating threshold was called, and they say threshold is still 61. This is expected, we did not delete anything, so threshold should be the same.
15:04: We kill -9 one of clients and wait for eviction to happen.
15:06: Logs show that eviction happened.
15:06: /eureka/status endpoint says that only 64 renews came last minute. Reasonable: we had 68 per minute, killed one service, now we have 64.
15:07: Logs indicate that this method updating threshold was called, and they say threshold is still 61. Now this is becoming problematic. Common sense says: “Only 1 instance was evicted, 1 < 0.15 * 18, evictions happen from time to time, we should adjust self-preservation threshold now”. But tricky condition of method updateRenewalThreshold comes into play:
eureka/eureka-core/src/main/java/com/netflix/eureka/registry/PeerAwareInstanceRegistryImpl.java
Lines 533 to 537 in b2c5ea6
if ((count) > (serverConfig.getRenewalPercentThreshold() * expectedNumberOfClientsSendingRenews)
|| (!this.isSelfPreservationModeEnabled())) {
this.expectedNumberOfClientsSendingRenews = count;
updateRenewsPerMinThreshold();
}
This code now thinks that count == 13, not 17 (11 instances in UP plus two Eurekas). And previous value of expectedNumberOfClientsSendingRenews was 18, as in the beginning we had everything in UP. Condition evaluates to
((13) > (0.85 * 18)) == ((13) > 15.3) == false
, and threshold is not updated, and expectedNumberOfClientsSendingRenews is still 18.
15:11: Started previously killed client again. Client registers, and on new registration Eureka increments expectedNumberOfClientsSendingRenews, it is now 19. It also recalculates threshold: (int) (19 * 0.85 * 60 / 15) == 64.
So, we returned the system to previous state, but threshold changed from 61 to 64. This is wrong, and it actually can lead to unwanted self-preservation if there is a significant number of instances in status different than UP.
Experiment 2: looking for fix
This time we configure eureka.shouldFilterOnlyUpInstances=false on Eureka nodes.
15:33: 16 client instances in UP plus 2 Eurekas. Threshold is 61 = 18 * 4 * 0.85, 68 renews comes minutely, renews are above threshold, all is fine
15:34: 4 client instances were put into OUT_OF_SERVICE state via Eureka’s REST API.
15:37: Logs indicate that this method updating threshold was called, and they say threshold is still 61. This is expected, we did not delete anything, so threshold should be the same.
15:43: We kill -9 one of clients and wait for eviction to happen.
15:44: Logs show that eviction happened.
15:45: /eureka/status endpoint says that only 64 renews came last minute. Reasonable: we had 68 per minute, killed one service, now we have 64.
15:45: Threshold updated, and it is 57, and this is what makes the difference! Now that check thought count was 17.
((17) > (0.85 * 18)) == ((17) > 15.3) == true
expectedNumberOfClientsSendingRenews becomes 17, threshold is updated to (int) (17 * 0.85 * 60 / 15) == 57.
Conclusion:
Probably, we should instruct users to configure eureka.shouldFilterOnlyUpInstances=false on Eureka server and add this configuration as default for eureka-server module which assembles .war file
Question:
But why do we actually calculate registry size using Eureka client and not, well, registry size? It looks like I am missing something not obvious, but critical. Could you kindly clarify?
Thanks!
Why is count 13 after killing a service in the Experiment 1? Should it be 17? Thank you

Sorry, it should be four OUT_OF_SERVICE services plus one KILL service, right?

troshko111 · 2019-05-30T23:21:06Z

I assume #1140 should do it for now, please re-open if you disagree.

qiangdavidliu mentioned this issue Oct 20, 2018

update default server's eureka-client property #1140

Merged

mgtriffid mentioned this issue Jan 24, 2019

Ereka 1.9.8 Renews threshold is correct for the first time, and then this value will automatically grow, resulting in automatic opening of protection mode. What is the reason? #1178

Closed

troshko111 closed this as completed May 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SP threshold is not updated correctly if >15% of client instances are not in UP status #1127

SP threshold is not updated correctly if >15% of client instances are not in UP status #1127

mgtriffid commented Oct 5, 2018

holy12345 commented Oct 8, 2018

pparth commented Oct 9, 2018

qiangdavidliu commented Oct 20, 2018

yikangfeng commented Jan 24, 2019

yikangfeng commented Jan 24, 2019

troshko111 commented May 30, 2019

SP threshold is not updated correctly if >15% of client instances are not in UP status #1127

SP threshold is not updated correctly if >15% of client instances are not in UP status #1127

Comments

mgtriffid commented Oct 5, 2018

holy12345 commented Oct 8, 2018

pparth commented Oct 9, 2018

qiangdavidliu commented Oct 20, 2018

yikangfeng commented Jan 24, 2019

yikangfeng commented Jan 24, 2019

troshko111 commented May 30, 2019