Enhance clarifications for `ServiceDiscoverer` contract #3002

idelpivnitskiy · 2024-07-12T16:27:01Z

Motivation:

Some assumptions that are currently in place are not clarified in documentation for new implementations.

Modifications:

Clarify that the returned publisher should never complete and must support re-subscribes to handle failures and cancellations.
Clarify motivation for always starting with the "state of the world" for interpreting ServiceDiscovererEvent(s).
Add recommendation to implement a ServiceDiscoverer using a single thread.

Result:

More details for expected behavior of a ServiceDiscoverer implementation.

Motivation: Some assumptions that are currently in place are not clarified in documentation for new implementations. Modifications: - Clarify that the returned publisher should never complete and must support re-subscribes to handle failures and cancellations. - Clarify motivation for always starting with the "state of the world" for interpreting `ServiceDiscovererEvent`(s). - Add recommendation to implement a `ServiceDiscoverer` using a single thread. Result: More details for expected behavior of a `ServiceDiscoverer` implementation.

Motivation: Currently, both implementations of the `LoadBalancer` assume 2 possibilities for the first events collection after re-subscribe: "state of the world" or "delta" and tries to infer which one it is based on what event statuses are present in the collection. However, it doesn't provide any guarantees because even if SD impl keeps returning deltas after a re-subscribe, a delta can contain only `AVAILABLE` events, but it will be incorrectly interpreted as "state of the world". Because of Reactive Streams rule clarified in apple#3002, `LoadBalancer` implementations should always expect "state of the world" for the first events after re-subscribe. Modifications: - Remove inference of the first collection of events after resubscribe and always expect a new subscriber to start from "state of the world". - Update `resubscribeToEventsWhenAllHostsAreUnhealthy()` test to reflect behavior change. Result: Both `LoadBalancer` implementations follow `ServiceDiscoverer` contract and expect the first collection of events after re-subscribe to represent a "state of the world".

Motivation: Clients have a configurable `serviceDiscovererRetryStrategy` to guarantee a steady stream of events to the `LoadBalancer` that never fails. It's necessary at the client level to avoid hanging requests indefinitely and let requests observe failures from ServiceDiscoverer. Also, for `PartitionedHttpClient` it's necessary to guarantee that `GroupedPublisher` never fails. Retry is effectively a re-subscribe. According to `ServiceDiscoverer` contract (clarified in apple#3002), each `Subscriber` receives a "state of the world" as the first collection of events. The problem is that the state may change significantly between retries, as a result unavailable addresses can remain inside the `LoadBalancer` forever. Example: T1. SD delivers [a,b] T1. LB receives [a,b] T1. SD delivers error T2. SD info changed ("a" got revoked) T3. Client retries SD T3. SD delivers [b] T3. LB receives [b] (but still holds "a") When we retry `ServiceDiscoverer` errors, we should keep pushing deltas downstream or purge events that are not present in the new "state of the world". We previously had this protection but it was mistakenly removed in apple#1949 as part of a broader refactoring around `ServiceDiscoverer` <-> `LoadBalancer` contract. Modifications: - Add `RetryingServiceDiscoverer` that handles retries and keeps the state between retries. - Use it in `DefaultSingleAddressHttpClientBuilder` and `DefaultPartitionedHttpClientBuilder`. - Use `CastedServiceDiscoverer` to allow modifications for `ServiceDiscovererEvent` after we started to use a wildcard type in apple#2379. - Pass consistent `targetResource` identifier to both `RetryingServiceDiscoverer` and `LoadBalancerFactory` to allow state correlation when inspecting heap dump. Result: Client keeps pushing deltas to `LoadBalancer` after retrying `ServiceDiscoverer` errors, keeping its state consistent with `ServiceDiscoverer`.

servicetalk-client-api/src/main/java/io/servicetalk/client/api/ServiceDiscoverer.java

bryce-anderson · 2024-07-12T22:24:06Z

servicetalk-client-api/src/main/java/io/servicetalk/client/api/ServiceDiscoverer.java

+ * Because typically a {@link ServiceDiscoverer} implementation runs in the background and doesn't require many compute
+ * resources, it's recommended (but not required) to run it and deliver updates on a single thread either for all
+ * discoveries or at least for all {@link Subscriber Subscribers} to the same {@link Publisher}.


fwiw, it's not clear what this buys the implementors.

I don't want to put it as a requirement, it's just a nice-to-have recommendation bcz most ppl don't think about it when they implement.
For RRLB it will help with this case: #3005 (comment)
But we are moving away from RRLB anyway.

In general, it makes debugging easier bcz you can track all events from a single thread and logs will represent the exact serial behavior. If they use multiple threads, logs may mess up order of events when LB cancels and resubscribes.

Do you have a recommendation how to rephrase?

My best suggestion is basically what you said in the reply. 🙂 I added as a suggestion but I'm also not strongly opinionated about it. If you prefer to keep it short and not try to sell it then that's perfectly fine with me.

servicetalk-client-api/src/main/java/io/servicetalk/client/api/ServiceDiscoverer.java

…#3004) Motivation: Currently, both implementations of the `LoadBalancer` assume 2 possibilities for the first events collection after re-subscribe: "state of the world" or "delta" and try to infer which one it is based on what event statuses are present in the collection. However, it doesn't provide any guarantees because even if SD impl keeps returning deltas after a re-subscribe, a delta can contain only `AVAILABLE` events, but it will be incorrectly interpreted as "state of the world". Because of Reactive Streams rule clarified in #3002, `LoadBalancer` implementations should always expect "state of the world" for the first events after re-subscribe. Modifications: - Remove inference of the first collection of events after re-subscribe and always expect a new subscriber to start from "state of the world". - Update `resubscribeToEventsWhenAllHostsAreUnhealthy()` test to reflect behavior change. Result: Both `LoadBalancer` implementations follow `ServiceDiscoverer` contract and expect the first collection of events after re-subscribe to represent a "state of the world".

…3006) Motivation: Clients have a configurable `serviceDiscovererRetryStrategy` to guarantee a steady stream of events to the `LoadBalancer` that never fails. It's necessary at the client level to avoid hanging requests indefinitely and let requests observe failures from ServiceDiscoverer. Also, for `PartitionedHttpClient` it's necessary to guarantee that `GroupedPublisher` never fails. Retry is effectively a re-subscribe. According to `ServiceDiscoverer` contract (clarified in #3002), each `Subscriber` receives a "state of the world" as the first collection of events. The problem is that the state may change significantly between retries. As a result, unavailable addresses can remain inside the `LoadBalancer` forever. Example: T1. SD delivers [a,b] T1. LB receives [a,b] T1. SD delivers error T2. SD info changed ("a" got revoked) T3. Client retries SD T3. SD delivers [b] T3. LB receives [b] (but still holds "a") When we retry `ServiceDiscoverer` errors, we should keep pushing deltas downstream or purge events that are not present in the new "state of the world". We previously had this protection but it was mistakenly removed in #1949 as part of a broader refactoring around `ServiceDiscoverer` <-> `LoadBalancer` contract. Modifications: - Add `RetryingServiceDiscoverer` that handles retries and keeps the state between retries. - Use it in `DefaultSingleAddressHttpClientBuilder` and `DefaultPartitionedHttpClientBuilder`. - Use `CastedServiceDiscoverer` to allow modifications for `ServiceDiscovererEvent` after we started to use a wildcard type in #2379. - Pass consistent `targetResource` identifier to both `RetryingServiceDiscoverer` and `LoadBalancerFactory` to allow state correlation when inspecting heap dump. Result: Client keeps pushing deltas to `LoadBalancer` after retrying `ServiceDiscoverer` errors, keeping its state consistent with `ServiceDiscoverer`.

idelpivnitskiy mentioned this pull request Jul 12, 2024

LoadBalancer should always consider the first events as initial state #3004

Merged

idelpivnitskiy self-assigned this Jul 12, 2024

idelpivnitskiy mentioned this pull request Jul 12, 2024

Incorrect state possible after retrying ServiceDiscoverer events #3006

Merged

idelpivnitskiy requested review from bryce-anderson and tkountis July 12, 2024 17:59

bryce-anderson approved these changes Jul 12, 2024

View reviewed changes

bryce-anderson reviewed Jul 12, 2024

View reviewed changes

servicetalk-client-api/src/main/java/io/servicetalk/client/api/ServiceDiscoverer.java Outdated Show resolved Hide resolved

clarify motivation for a single threaded model

456183d

bryce-anderson approved these changes Jul 13, 2024

View reviewed changes

tkountis approved these changes Jul 17, 2024

View reviewed changes

idelpivnitskiy merged commit b700ef4 into apple:main Jul 17, 2024
11 checks passed

idelpivnitskiy deleted the sd-contract branch July 17, 2024 23:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance clarifications for `ServiceDiscoverer` contract #3002

Enhance clarifications for `ServiceDiscoverer` contract #3002

idelpivnitskiy commented Jul 12, 2024

bryce-anderson Jul 12, 2024

idelpivnitskiy Jul 12, 2024 •

edited

Loading

bryce-anderson Jul 12, 2024

Enhance clarifications for ServiceDiscoverer contract #3002

Enhance clarifications for ServiceDiscoverer contract #3002

Conversation

idelpivnitskiy commented Jul 12, 2024

bryce-anderson Jul 12, 2024

Choose a reason for hiding this comment

idelpivnitskiy Jul 12, 2024 • edited Loading

Choose a reason for hiding this comment

bryce-anderson Jul 12, 2024

Choose a reason for hiding this comment

Enhance clarifications for `ServiceDiscoverer` contract #3002

Enhance clarifications for `ServiceDiscoverer` contract #3002

idelpivnitskiy Jul 12, 2024 •

edited

Loading