Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SP threshold is not updated correctly if >15% of client instances are not in UP status #1127

Closed
mgtriffid opened this issue Oct 5, 2018 · 6 comments

Comments

@mgtriffid
Copy link
Contributor

Environment:
Cluster of 2 Eureka nodes of freshest version

eureka.enableSelfPreservation=true
eureka.expectedClientRenewalIntervalSeconds=15
eureka.renewalThresholdUpdateIntervalMs=240000 //default is 15 minutes, we just want to speed up testing

16 client instances talking to this cluster

eureka.client.refresh.interval=15

Experiment 1: confirming issue
15:00: 16 client instances in UP plus 2 Eurekas. Threshold is 61 = 18 * 4 * 0.85, 68 renews comes minutely, renews are above threshold, all is fine.
15:02: 4 client instances were put into OUT_OF_SERVICE state via Eureka’s REST API.
15:03: Logs indicate that this method updating threshold was called, and they say threshold is still 61. This is expected, we did not delete anything, so threshold should be the same.
15:04: We kill -9 one of clients and wait for eviction to happen.
15:06: Logs show that eviction happened.
15:06: /eureka/status endpoint says that only 64 renews came last minute. Reasonable: we had 68 per minute, killed one service, now we have 64.
15:07: Logs indicate that this method updating threshold was called, and they say threshold is still 61. Now this is becoming problematic. Common sense says: “Only 1 instance was evicted, 1 < 0.15 * 18, evictions happen from time to time, we should adjust self-preservation threshold now”. But tricky condition of method updateRenewalThreshold comes into play:

if ((count) > (serverConfig.getRenewalPercentThreshold() * expectedNumberOfClientsSendingRenews)
|| (!this.isSelfPreservationModeEnabled())) {
this.expectedNumberOfClientsSendingRenews = count;
updateRenewsPerMinThreshold();
}

This code now thinks that count == 13, not 17 (11 instances in UP plus two Eurekas). And previous value of expectedNumberOfClientsSendingRenews was 18, as in the beginning we had everything in UP. Condition evaluates to
((13) > (0.85 * 18)) == ((13) > 15.3) == false
, and threshold is not updated, and expectedNumberOfClientsSendingRenews is still 18.
15:11: Started previously killed client again. Client registers, and on new registration Eureka increments expectedNumberOfClientsSendingRenews, it is now 19. It also recalculates threshold: (int) (19 * 0.85 * 60 / 15) == 64.
So, we returned the system to previous state, but threshold changed from 61 to 64. This is wrong, and it actually can lead to unwanted self-preservation if there is a significant number of instances in status different than UP.
Experiment 2: looking for fix
This time we configure eureka.shouldFilterOnlyUpInstances=false on Eureka nodes.
15:33: 16 client instances in UP plus 2 Eurekas. Threshold is 61 = 18 * 4 * 0.85, 68 renews comes minutely, renews are above threshold, all is fine
15:34: 4 client instances were put into OUT_OF_SERVICE state via Eureka’s REST API.
15:37: Logs indicate that this method updating threshold was called, and they say threshold is still 61. This is expected, we did not delete anything, so threshold should be the same.
15:43: We kill -9 one of clients and wait for eviction to happen.
15:44: Logs show that eviction happened.
15:45: /eureka/status endpoint says that only 64 renews came last minute. Reasonable: we had 68 per minute, killed one service, now we have 64.
15:45: Threshold updated, and it is 57, and this is what makes the difference! Now that check thought count was 17.
((17) > (0.85 * 18)) == ((17) > 15.3) == true
expectedNumberOfClientsSendingRenews becomes 17, threshold is updated to (int) (17 * 0.85 * 60 / 15) == 57.
Conclusion:
Probably, we should instruct users to configure eureka.shouldFilterOnlyUpInstances=false on Eureka server and add this configuration as default for eureka-server module which assembles .war file
Question:
But why do we actually calculate registry size using Eureka client and not, well, registry size? It looks like I am missing something not obvious, but critical. Could you kindly clarify?

Thanks!

@holy12345
Copy link
Contributor

HI @mgtriffid

Thanks this question, I think this is question interesting. I notice this case you have two eureka server. one of eureka server means eureka client for other. we know eureka client can fetch instancesInfo from eureka server so this server fetch all instanceInfo from other and filter which status are notUP and cause the issue which you mention (I think this is key point)

thanks : )

@pparth
Copy link

pparth commented Oct 9, 2018

@qiangdavidliu Any help here?

@qiangdavidliu
Copy link
Contributor

Sorry for the delayed response, and thanks for the detailed write up. I am also unsure to the reason why the calculation was done on the client rather than the registry, and found this PR from 2013 (!) that showed registry based calculations was possible at one point (4cd9c6e). I suspect that the registry is eventually not used due to the concurrent nature of accessing the server internal registry at scale.

As for your suggestion that we apply eureka.shouldFilterOnlyUpInstances=false as a default on the server side (and internally, we do run with that set to false too), that makes a lot of sense.

qiangdavidliu added a commit that referenced this issue Oct 20, 2018
The eureka server internal eureka-client need to be configured to get instances for all statuses, not just UP. This is necessary to properly calculate the correct self preservation threshold. See issue #1127 for a discussion.
@yikangfeng
Copy link

Environment:
Cluster of 2 Eureka nodes of freshest version

eureka.enableSelfPreservation=true
eureka.expectedClientRenewalIntervalSeconds=15
eureka.renewalThresholdUpdateIntervalMs=240000 //default is 15 minutes, we just want to speed up testing

16 client instances talking to this cluster

eureka.client.refresh.interval=15

Experiment 1: confirming issue
15:00: 16 client instances in UP plus 2 Eurekas. Threshold is 61 = 18 * 4 * 0.85, 68 renews comes minutely, renews are above threshold, all is fine.
15:02: 4 client instances were put into OUT_OF_SERVICE state via Eureka’s REST API.
15:03: Logs indicate that this method updating threshold was called, and they say threshold is still 61. This is expected, we did not delete anything, so threshold should be the same.
15:04: We kill -9 one of clients and wait for eviction to happen.
15:06: Logs show that eviction happened.
15:06: /eureka/status endpoint says that only 64 renews came last minute. Reasonable: we had 68 per minute, killed one service, now we have 64.
15:07: Logs indicate that this method updating threshold was called, and they say threshold is still 61. Now this is becoming problematic. Common sense says: “Only 1 instance was evicted, 1 < 0.15 * 18, evictions happen from time to time, we should adjust self-preservation threshold now”. But tricky condition of method updateRenewalThreshold comes into play:
eureka/eureka-core/src/main/java/com/netflix/eureka/registry/PeerAwareInstanceRegistryImpl.java

Lines 533 to 537 in b2c5ea6

if ((count) > (serverConfig.getRenewalPercentThreshold() * expectedNumberOfClientsSendingRenews)
|| (!this.isSelfPreservationModeEnabled())) {
this.expectedNumberOfClientsSendingRenews = count;
updateRenewsPerMinThreshold();
}

This code now thinks that count == 13, not 17 (11 instances in UP plus two Eurekas). And previous value of expectedNumberOfClientsSendingRenews was 18, as in the beginning we had everything in UP. Condition evaluates to
((13) > (0.85 * 18)) == ((13) > 15.3) == false
, and threshold is not updated, and expectedNumberOfClientsSendingRenews is still 18.
15:11: Started previously killed client again. Client registers, and on new registration Eureka increments expectedNumberOfClientsSendingRenews, it is now 19. It also recalculates threshold: (int) (19 * 0.85 * 60 / 15) == 64.
So, we returned the system to previous state, but threshold changed from 61 to 64. This is wrong, and it actually can lead to unwanted self-preservation if there is a significant number of instances in status different than UP.
Experiment 2: looking for fix
This time we configure eureka.shouldFilterOnlyUpInstances=false on Eureka nodes.
15:33: 16 client instances in UP plus 2 Eurekas. Threshold is 61 = 18 * 4 * 0.85, 68 renews comes minutely, renews are above threshold, all is fine
15:34: 4 client instances were put into OUT_OF_SERVICE state via Eureka’s REST API.
15:37: Logs indicate that this method updating threshold was called, and they say threshold is still 61. This is expected, we did not delete anything, so threshold should be the same.
15:43: We kill -9 one of clients and wait for eviction to happen.
15:44: Logs show that eviction happened.
15:45: /eureka/status endpoint says that only 64 renews came last minute. Reasonable: we had 68 per minute, killed one service, now we have 64.
15:45: Threshold updated, and it is 57, and this is what makes the difference! Now that check thought count was 17.
((17) > (0.85 * 18)) == ((17) > 15.3) == true
expectedNumberOfClientsSendingRenews becomes 17, threshold is updated to (int) (17 * 0.85 * 60 / 15) == 57.
Conclusion:
Probably, we should instruct users to configure eureka.shouldFilterOnlyUpInstances=false on Eureka server and add this configuration as default for eureka-server module which assembles .war file
Question:
But why do we actually calculate registry size using Eureka client and not, well, registry size? It looks like I am missing something not obvious, but critical. Could you kindly clarify?
Thanks!

Why is count 13 after killing a service in the Experiment 1? Should it be 17? Thank you

@yikangfeng
Copy link

Environment:
Cluster of 2 Eureka nodes of freshest version

eureka.enableSelfPreservation=true
eureka.expectedClientRenewalIntervalSeconds=15
eureka.renewalThresholdUpdateIntervalMs=240000 //default is 15 minutes, we just want to speed up testing

16 client instances talking to this cluster

eureka.client.refresh.interval=15

Experiment 1: confirming issue
15:00: 16 client instances in UP plus 2 Eurekas. Threshold is 61 = 18 * 4 * 0.85, 68 renews comes minutely, renews are above threshold, all is fine.
15:02: 4 client instances were put into OUT_OF_SERVICE state via Eureka’s REST API.
15:03: Logs indicate that this method updating threshold was called, and they say threshold is still 61. This is expected, we did not delete anything, so threshold should be the same.
15:04: We kill -9 one of clients and wait for eviction to happen.
15:06: Logs show that eviction happened.
15:06: /eureka/status endpoint says that only 64 renews came last minute. Reasonable: we had 68 per minute, killed one service, now we have 64.
15:07: Logs indicate that this method updating threshold was called, and they say threshold is still 61. Now this is becoming problematic. Common sense says: “Only 1 instance was evicted, 1 < 0.15 * 18, evictions happen from time to time, we should adjust self-preservation threshold now”. But tricky condition of method updateRenewalThreshold comes into play:
eureka/eureka-core/src/main/java/com/netflix/eureka/registry/PeerAwareInstanceRegistryImpl.java
Lines 533 to 537 in b2c5ea6
if ((count) > (serverConfig.getRenewalPercentThreshold() * expectedNumberOfClientsSendingRenews)
|| (!this.isSelfPreservationModeEnabled())) {
this.expectedNumberOfClientsSendingRenews = count;
updateRenewsPerMinThreshold();
}
This code now thinks that count == 13, not 17 (11 instances in UP plus two Eurekas). And previous value of expectedNumberOfClientsSendingRenews was 18, as in the beginning we had everything in UP. Condition evaluates to
((13) > (0.85 * 18)) == ((13) > 15.3) == false
, and threshold is not updated, and expectedNumberOfClientsSendingRenews is still 18.
15:11: Started previously killed client again. Client registers, and on new registration Eureka increments expectedNumberOfClientsSendingRenews, it is now 19. It also recalculates threshold: (int) (19 * 0.85 * 60 / 15) == 64.
So, we returned the system to previous state, but threshold changed from 61 to 64. This is wrong, and it actually can lead to unwanted self-preservation if there is a significant number of instances in status different than UP.
Experiment 2: looking for fix
This time we configure eureka.shouldFilterOnlyUpInstances=false on Eureka nodes.
15:33: 16 client instances in UP plus 2 Eurekas. Threshold is 61 = 18 * 4 * 0.85, 68 renews comes minutely, renews are above threshold, all is fine
15:34: 4 client instances were put into OUT_OF_SERVICE state via Eureka’s REST API.
15:37: Logs indicate that this method updating threshold was called, and they say threshold is still 61. This is expected, we did not delete anything, so threshold should be the same.
15:43: We kill -9 one of clients and wait for eviction to happen.
15:44: Logs show that eviction happened.
15:45: /eureka/status endpoint says that only 64 renews came last minute. Reasonable: we had 68 per minute, killed one service, now we have 64.
15:45: Threshold updated, and it is 57, and this is what makes the difference! Now that check thought count was 17.
((17) > (0.85 * 18)) == ((17) > 15.3) == true
expectedNumberOfClientsSendingRenews becomes 17, threshold is updated to (int) (17 * 0.85 * 60 / 15) == 57.
Conclusion:
Probably, we should instruct users to configure eureka.shouldFilterOnlyUpInstances=false on Eureka server and add this configuration as default for eureka-server module which assembles .war file
Question:
But why do we actually calculate registry size using Eureka client and not, well, registry size? It looks like I am missing something not obvious, but critical. Could you kindly clarify?
Thanks!

Why is count 13 after killing a service in the Experiment 1? Should it be 17? Thank you

Sorry, it should be four OUT_OF_SERVICE services plus one KILL service, right?

@troshko111
Copy link
Contributor

I assume #1140 should do it for now, please re-open if you disagree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants