Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retain addresses on service discovery failure #1115

Merged
merged 3 commits into from
Aug 6, 2020

Conversation

NiteshKant
Copy link
Collaborator

Motivation

Service discovery errors are retried by clients to avoid transient failures cause request errors. DNS service discoverer eagerly clears addresses for errors of type UnknownHostException but such errors can also be transient. This DNS discoverer behavior effectively defeats the retries done by the client as it results in request errors.
We should avoid service discovery implementations to react pessimistically to service discovery errors as service discovery results are just a hint and do not effectively represent the current state of the servers. Server availability may change asynchronously and may not accurately be reflected in service discovery result due to delays at multiple levels. Instead, we should assume that all service discovery errors are transient and retain the addresses till a subsequent success from service discovery.

Modification

Introduced a new contract for ServiceDiscoveryRetryStrategy and added a default implementation that retains addresses between an SD error and success. Old addresses are retained till a configured percentage of new addresses are removed post retry. This is done because service discovery events are emitted one at a time and hence does not provide a natural boundary which indicates completion of a successful resolution. Users can tune this percentage; the default is 75%.

Removed DNS service discoverer behavior of clearing addresses on certain errors.

Result

Clients better tolerate transient service discoverer errors irrespective of individual service discoverer implementation.

__Motivation__

Service discovery errors are retried by clients to avoid transient failures cause request errors. DNS service discoverer eagerly clears addresses for errors of type `UnknownHostException` but such errors can also be transient. This DNS discoverer behavior effectively defeats the retries done by the client as it results in request errors.
We should avoid service discovery implementations to react pessimistically to service discovery errors as service discovery results are just a hint and do not effectively represent the current state of the servers. Server availability may change asynchronously and may not accurately be reflected in service discovery result due to delays at multiple levels. Instead, we should assume that all service discovery errors are transient and retain the addresses till a subsequent success from service discovery.

__Modification__

Introduced a new contract for `ServiceDiscoveryRetryStrategy` and added a default implementation that retains addresses between an SD error and success. Old addresses are retained till a configured percentage of new addresses are removed post retry. This is done because service discovery events are emitted one at a time and hence does not provide a natural boundary which indicates completion of a successful resolution. Users can tune this percentage; the default is 75%.

Removed DNS service discoverer behavior of clearing addresses on certain errors.

__Result__

Clients better tolerate transient service discoverer errors irrespective of individual service discoverer implementation.
Copy link
Member

@idelpivnitskiy idelpivnitskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approach LGTM, a few minor comments/questions:

Copy link
Member

@idelpivnitskiy idelpivnitskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢
One nit comment that you can ignore :)

}

@Override
public Completable apply(final int count, final Throwable cause) {
// As we are retrying indefinitely (unless closed), cap the backoff on MAX_RETRIES retries to avoid
// impractical backoffs
return delegate.apply(count % (MAX_RETRIES - 1), cause);
return delegate.apply(max(1, count % (maxRetries)), cause);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With max approach it will return 1 two times. Not a big deal. But if you don't like it, alternative could be (count % MAX_RETRIES) + 1.
Also, IIRC, the power of 2 value will be a bit more optimal for %. How about 8 or 16 instead of 10? Or just a mask of 0x0F?

* @return {@code this}.
*/
public Builder<ResolvedAddress, E> retainAddressesTillSuccess(final int retainTillReceivePercentage) {
if (retainTillReceivePercentage < 0) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor ask : Requires upper cap as well retainTillReceivePercentage should not be greater than 100

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not adding an upper cap is intentional here as it enables folks to retain older addresses for a longish time.

} else {
retainedAddresses.putAll(activeAddresses);
}
targetSize = (int) (ceil(retainTillReceivePercentage / 100d) * activeAddresses.size());

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

System.out.println("Output for 0 :" +(int)Math.ceil(0/100d)); System.out.println("Output for 10 :" +(int)Math.ceil(10/100d)); System.out.println("Output for 75 :" +(int)Math.ceil(75/100d)); System.out.println("Output for 100 :" +(int)Math.ceil(100/100d));

Output for 0 :0 Output for 10 :1 Output for 75 :1 Output for 100 :1

Seems like Percentage computation is bit off. Irrespective of percentage targetSize = 1 * activeAddresses.size()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch!

Wrong placement of brackets 😢

I will fix and add tests, thanks!

@NiteshKant NiteshKant merged commit ab1b686 into apple:main Aug 6, 2020
@NiteshKant NiteshKant deleted the sd-err-retain branch August 6, 2020 02:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants