-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SP threshold is not updated correctly if >15% of client instances are not in UP status #1127
Comments
HI @mgtriffid Thanks this question, I think this is question interesting. I notice this case you have two thanks : ) |
@qiangdavidliu Any help here? |
Sorry for the delayed response, and thanks for the detailed write up. I am also unsure to the reason why the calculation was done on the client rather than the registry, and found this PR from 2013 (!) that showed registry based calculations was possible at one point (4cd9c6e). I suspect that the registry is eventually not used due to the concurrent nature of accessing the server internal registry at scale. As for your suggestion that we apply |
The eureka server internal eureka-client need to be configured to get instances for all statuses, not just UP. This is necessary to properly calculate the correct self preservation threshold. See issue #1127 for a discussion.
Why is count 13 after killing a service in the Experiment 1? Should it be 17? Thank you |
Sorry, it should be four OUT_OF_SERVICE services plus one KILL service, right? |
I assume #1140 should do it for now, please re-open if you disagree. |
Environment:
Cluster of 2 Eureka nodes of freshest version
16 client instances talking to this cluster
Experiment 1: confirming issue
15:00: 16 client instances in UP plus 2 Eurekas. Threshold is 61 = 18 * 4 * 0.85, 68 renews comes minutely, renews are above threshold, all is fine.
15:02: 4 client instances were put into OUT_OF_SERVICE state via Eureka’s REST API.
15:03: Logs indicate that this method updating threshold was called, and they say threshold is still 61. This is expected, we did not delete anything, so threshold should be the same.
15:04: We
kill -9
one of clients and wait for eviction to happen.15:06: Logs show that eviction happened.
15:06:
/eureka/status
endpoint says that only 64 renews came last minute. Reasonable: we had 68 per minute, killed one service, now we have 64.15:07: Logs indicate that this method updating threshold was called, and they say threshold is still 61. Now this is becoming problematic. Common sense says: “Only 1 instance was evicted, 1 < 0.15 * 18, evictions happen from time to time, we should adjust self-preservation threshold now”. But tricky condition of method
updateRenewalThreshold
comes into play:eureka/eureka-core/src/main/java/com/netflix/eureka/registry/PeerAwareInstanceRegistryImpl.java
Lines 533 to 537 in b2c5ea6
This code now thinks that
count
== 13, not 17 (11 instances in UP plus two Eurekas). And previous value ofexpectedNumberOfClientsSendingRenews
was 18, as in the beginning we had everything inUP
. Condition evaluates to((13) > (0.85 * 18)) == ((13) > 15.3) == false
, and threshold is not updated, and
expectedNumberOfClientsSendingRenews
is still 18.15:11: Started previously killed client again. Client registers, and on new registration Eureka increments
expectedNumberOfClientsSendingRenews
, it is now 19. It also recalculates threshold:(int) (19 * 0.85 * 60 / 15) == 64
.So, we returned the system to previous state, but threshold changed from 61 to 64. This is wrong, and it actually can lead to unwanted self-preservation if there is a significant number of instances in status different than
UP
.Experiment 2: looking for fix
This time we configure
eureka.shouldFilterOnlyUpInstances=false
on Eureka nodes.15:33: 16 client instances in
UP
plus 2 Eurekas. Threshold is 61 = 18 * 4 * 0.85, 68 renews comes minutely, renews are above threshold, all is fine15:34: 4 client instances were put into
OUT_OF_SERVICE
state via Eureka’s REST API.15:37: Logs indicate that this method updating threshold was called, and they say threshold is still 61. This is expected, we did not delete anything, so threshold should be the same.
15:43: We
kill -9
one of clients and wait for eviction to happen.15:44: Logs show that eviction happened.
15:45:
/eureka/status
endpoint says that only 64 renews came last minute. Reasonable: we had 68 per minute, killed one service, now we have 64.15:45: Threshold updated, and it is 57, and this is what makes the difference! Now that check thought count was 17.
((17) > (0.85 * 18)) == ((17) > 15.3) == true
expectedNumberOfClientsSendingRenews
becomes 17, threshold is updated to (int) (17 * 0.85 * 60 / 15) == 57.Conclusion:
Probably, we should instruct users to configure
eureka.shouldFilterOnlyUpInstances=false
on Eureka server and add this configuration as default for eureka-server module which assembles .war fileQuestion:
But why do we actually calculate registry size using Eureka client and not, well, registry size? It looks like I am missing something not obvious, but critical. Could you kindly clarify?
Thanks!
The text was updated successfully, but these errors were encountered: