-
Notifications
You must be signed in to change notification settings - Fork 184
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retain addresses on service discovery failure #1115
Conversation
__Motivation__ Service discovery errors are retried by clients to avoid transient failures cause request errors. DNS service discoverer eagerly clears addresses for errors of type `UnknownHostException` but such errors can also be transient. This DNS discoverer behavior effectively defeats the retries done by the client as it results in request errors. We should avoid service discovery implementations to react pessimistically to service discovery errors as service discovery results are just a hint and do not effectively represent the current state of the servers. Server availability may change asynchronously and may not accurately be reflected in service discovery result due to delays at multiple levels. Instead, we should assume that all service discovery errors are transient and retain the addresses till a subsequent success from service discovery. __Modification__ Introduced a new contract for `ServiceDiscoveryRetryStrategy` and added a default implementation that retains addresses between an SD error and success. Old addresses are retained till a configured percentage of new addresses are removed post retry. This is done because service discovery events are emitted one at a time and hence does not provide a natural boundary which indicates completion of a successful resolution. Users can tune this percentage; the default is 75%. Removed DNS service discoverer behavior of clearing addresses on certain errors. __Result__ Clients better tolerate transient service discoverer errors irrespective of individual service discoverer implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approach LGTM, a few minor comments/questions:
...ttp-netty/src/main/java/io/servicetalk/http/netty/DefaultSingleAddressHttpClientBuilder.java
Outdated
Show resolved
Hide resolved
...alk-http-api/src/main/java/io/servicetalk/http/api/DefaultServiceDiscoveryRetryStrategy.java
Show resolved
Hide resolved
...alk-http-api/src/main/java/io/servicetalk/http/api/DefaultServiceDiscoveryRetryStrategy.java
Outdated
Show resolved
Hide resolved
...alk-http-api/src/main/java/io/servicetalk/http/api/DefaultServiceDiscoveryRetryStrategy.java
Outdated
Show resolved
Hide resolved
...http-api/src/test/java/io/servicetalk/http/api/DefaultServiceDiscoveryRetryStrategyTest.java
Outdated
Show resolved
Hide resolved
...http-api/src/test/java/io/servicetalk/http/api/DefaultServiceDiscoveryRetryStrategyTest.java
Outdated
Show resolved
Hide resolved
...http-api/src/test/java/io/servicetalk/http/api/DefaultServiceDiscoveryRetryStrategyTest.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚢
One nit comment that you can ignore :)
} | ||
|
||
@Override | ||
public Completable apply(final int count, final Throwable cause) { | ||
// As we are retrying indefinitely (unless closed), cap the backoff on MAX_RETRIES retries to avoid | ||
// impractical backoffs | ||
return delegate.apply(count % (MAX_RETRIES - 1), cause); | ||
return delegate.apply(max(1, count % (maxRetries)), cause); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With max
approach it will return 1
two times. Not a big deal. But if you don't like it, alternative could be (count % MAX_RETRIES) + 1
.
Also, IIRC, the power of 2 value will be a bit more optimal for %. How about 8 or 16 instead of 10? Or just a mask of 0x0F
?
* @return {@code this}. | ||
*/ | ||
public Builder<ResolvedAddress, E> retainAddressesTillSuccess(final int retainTillReceivePercentage) { | ||
if (retainTillReceivePercentage < 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor ask : Requires upper cap as well retainTillReceivePercentage should not be greater than 100
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not adding an upper cap is intentional here as it enables folks to retain older addresses for a longish time.
} else { | ||
retainedAddresses.putAll(activeAddresses); | ||
} | ||
targetSize = (int) (ceil(retainTillReceivePercentage / 100d) * activeAddresses.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
System.out.println("Output for 0 :" +(int)Math.ceil(0/100d)); System.out.println("Output for 10 :" +(int)Math.ceil(10/100d)); System.out.println("Output for 75 :" +(int)Math.ceil(75/100d)); System.out.println("Output for 100 :" +(int)Math.ceil(100/100d));
Output for 0 :0 Output for 10 :1 Output for 75 :1 Output for 100 :1
Seems like Percentage computation is bit off. Irrespective of percentage targetSize = 1 * activeAddresses.size()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch!
Wrong placement of brackets 😢
I will fix and add tests, thanks!
Motivation
Service discovery errors are retried by clients to avoid transient failures cause request errors. DNS service discoverer eagerly clears addresses for errors of type
UnknownHostException
but such errors can also be transient. This DNS discoverer behavior effectively defeats the retries done by the client as it results in request errors.We should avoid service discovery implementations to react pessimistically to service discovery errors as service discovery results are just a hint and do not effectively represent the current state of the servers. Server availability may change asynchronously and may not accurately be reflected in service discovery result due to delays at multiple levels. Instead, we should assume that all service discovery errors are transient and retain the addresses till a subsequent success from service discovery.
Modification
Introduced a new contract for
ServiceDiscoveryRetryStrategy
and added a default implementation that retains addresses between an SD error and success. Old addresses are retained till a configured percentage of new addresses are removed post retry. This is done because service discovery events are emitted one at a time and hence does not provide a natural boundary which indicates completion of a successful resolution. Users can tune this percentage; the default is 75%.Removed DNS service discoverer behavior of clearing addresses on certain errors.
Result
Clients better tolerate transient service discoverer errors irrespective of individual service discoverer implementation.