Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance clarifications for ServiceDiscoverer contract #3002

Merged
merged 2 commits into from
Jul 17, 2024

Conversation

idelpivnitskiy
Copy link
Member

Motivation:

Some assumptions that are currently in place are not clarified in documentation for new implementations.

Modifications:

  • Clarify that the returned publisher should never complete and must support re-subscribes to handle failures and cancellations.
  • Clarify motivation for always starting with the "state of the world" for interpreting ServiceDiscovererEvent(s).
  • Add recommendation to implement a ServiceDiscoverer using a single thread.

Result:

More details for expected behavior of a ServiceDiscoverer implementation.

Motivation:

Some assumptions that are currently in place are not clarified in
documentation for new implementations.

Modifications:

- Clarify that the returned publisher should never complete and must
support re-subscribes to handle failures and cancellations.
- Clarify motivation for always starting with the "state of the world"
for interpreting `ServiceDiscovererEvent`(s).
- Add recommendation to implement a `ServiceDiscoverer` using a single
thread.

Result:

More details for expected behavior of a `ServiceDiscoverer`
implementation.
idelpivnitskiy added a commit to idelpivnitskiy/servicetalk that referenced this pull request Jul 12, 2024
Motivation:

Currently, both implementations of the `LoadBalancer` assume 2
possibilities for the first events collection after re-subscribe:
"state of the world" or "delta" and tries to infer which one it is based
on what event statuses are present in the collection. However, it
doesn't provide any guarantees because even if SD impl keeps returning
deltas after a re-subscribe, a delta can contain only `AVAILABLE`
events, but it will be incorrectly interpreted as "state of the world".

Because of Reactive Streams rule clarified in apple#3002, `LoadBalancer`
implementations should always expect "state of the world" for the
first events after re-subscribe.

Modifications:

- Remove inference of the first collection of events after resubscribe
and always expect a new subscriber to start from "state of the world".
- Update `resubscribeToEventsWhenAllHostsAreUnhealthy()` test to reflect
behavior change.

Result:

Both `LoadBalancer` implementations follow `ServiceDiscoverer` contract
and expect the first collection of events after re-subscribe to
represent a "state of the world".
@idelpivnitskiy idelpivnitskiy self-assigned this Jul 12, 2024
idelpivnitskiy added a commit to idelpivnitskiy/servicetalk that referenced this pull request Jul 12, 2024
Motivation:

Clients have a configurable `serviceDiscovererRetryStrategy` to
guarantee a steady stream of events to the `LoadBalancer` that never
fails. It's necessary at the client level to avoid hanging requests
indefinitely and let requests observe failures from ServiceDiscoverer.
Also, for `PartitionedHttpClient` it's necessary to guarantee that
`GroupedPublisher` never fails.

Retry is effectively a re-subscribe. According to `ServiceDiscoverer`
contract (clarified in apple#3002), each `Subscriber` receives a "state of
the world" as the first collection of events. The problem is that the
state may change significantly between retries, as a result unavailable
addresses can remain inside the `LoadBalancer` forever. Example:

T1. SD delivers [a,b]
T1. LB receives [a,b]
T1. SD delivers error
T2. SD info changed ("a" got revoked)
T3. Client retries SD
T3. SD delivers [b]
T3. LB receives [b] (but still holds "a")

When we retry `ServiceDiscoverer` errors, we should keep pushing deltas
downstream or purge events that are not present in the new "state of the
world".

We previously had this protection but it was mistakenly removed in apple#1949
as part of a broader refactoring around `ServiceDiscoverer` <->
`LoadBalancer` contract.

Modifications:

- Add `RetryingServiceDiscoverer` that handles retries and keeps the
state between retries.
- Use it in `DefaultSingleAddressHttpClientBuilder` and
`DefaultPartitionedHttpClientBuilder`.
- Use `CastedServiceDiscoverer` to allow modifications for
`ServiceDiscovererEvent` after we started to use a wildcard type in
apple#2379.
- Pass consistent `targetResource` identifier to both
`RetryingServiceDiscoverer` and `LoadBalancerFactory` to allow state
correlation when inspecting heap dump.

Result:

Client keeps pushing deltas to `LoadBalancer` after retrying
`ServiceDiscoverer` errors, keeping its state consistent with
`ServiceDiscoverer`.
Comment on lines 30 to 32
* Because typically a {@link ServiceDiscoverer} implementation runs in the background and doesn't require many compute
* resources, it's recommended (but not required) to run it and deliver updates on a single thread either for all
* discoveries or at least for all {@link Subscriber Subscribers} to the same {@link Publisher}.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fwiw, it's not clear what this buys the implementors.

Copy link
Member Author

@idelpivnitskiy idelpivnitskiy Jul 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to put it as a requirement, it's just a nice-to-have recommendation bcz most ppl don't think about it when they implement.
For RRLB it will help with this case: #3005 (comment)
But we are moving away from RRLB anyway.

In general, it makes debugging easier bcz you can track all events from a single thread and logs will represent the exact serial behavior. If they use multiple threads, logs may mess up order of events when LB cancels and resubscribes.

Do you have a recommendation how to rephrase?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My best suggestion is basically what you said in the reply. 🙂 I added as a suggestion but I'm also not strongly opinionated about it. If you prefer to keep it short and not try to sell it then that's perfectly fine with me.

idelpivnitskiy added a commit that referenced this pull request Jul 15, 2024
…#3004)

Motivation:

Currently, both implementations of the `LoadBalancer` assume 2 possibilities for the first events collection after re-subscribe: "state of the world" or "delta" and try to infer which one it is based on what event statuses are present in the collection. However, it doesn't provide any guarantees because even if SD impl keeps returning deltas after a re-subscribe, a delta can contain only `AVAILABLE` events, but it will be incorrectly interpreted as "state of the world".

Because of Reactive Streams rule clarified in #3002, `LoadBalancer` implementations should always expect "state of the world" for the first events after re-subscribe.

Modifications:

- Remove inference of the first collection of events after re-subscribe and always expect a new subscriber to start from "state of the world".
- Update `resubscribeToEventsWhenAllHostsAreUnhealthy()` test to reflect behavior change.

Result:

Both `LoadBalancer` implementations follow `ServiceDiscoverer` contract and expect the first collection of events after re-subscribe to represent a "state of the world".
@idelpivnitskiy idelpivnitskiy merged commit b700ef4 into apple:main Jul 17, 2024
11 checks passed
@idelpivnitskiy idelpivnitskiy deleted the sd-contract branch July 17, 2024 23:54
idelpivnitskiy added a commit that referenced this pull request Jul 18, 2024
…3006)

Motivation:

Clients have a configurable `serviceDiscovererRetryStrategy` to guarantee a steady stream of events to the `LoadBalancer` that never fails. It's necessary at the client level to avoid hanging requests indefinitely and let requests observe failures from ServiceDiscoverer. Also, for `PartitionedHttpClient` it's necessary to guarantee that `GroupedPublisher` never fails.

Retry is effectively a re-subscribe. According to `ServiceDiscoverer` contract (clarified in #3002), each `Subscriber` receives a "state of the world" as the first collection of events. The problem is that the state may change significantly between retries. As a result, unavailable addresses can remain inside the `LoadBalancer` forever. Example:

T1. SD delivers [a,b]
T1. LB receives [a,b]
T1. SD delivers error
T2. SD info changed ("a" got revoked)
T3. Client retries SD
T3. SD delivers [b]
T3. LB receives [b] (but still holds "a")

When we retry `ServiceDiscoverer` errors, we should keep pushing deltas downstream or purge events that are not present in the new "state of the world".

We previously had this protection but it was mistakenly removed in #1949 as part of a broader refactoring around `ServiceDiscoverer` <-> `LoadBalancer` contract.

Modifications:

- Add `RetryingServiceDiscoverer` that handles retries and keeps the state between retries.
- Use it in `DefaultSingleAddressHttpClientBuilder` and `DefaultPartitionedHttpClientBuilder`.
- Use `CastedServiceDiscoverer` to allow modifications for `ServiceDiscovererEvent` after we started to use a wildcard type in #2379.
- Pass consistent `targetResource` identifier to both `RetryingServiceDiscoverer` and `LoadBalancerFactory` to allow state correlation when inspecting heap dump.

Result:

Client keeps pushing deltas to `LoadBalancer` after retrying `ServiceDiscoverer` errors, keeping its state consistent with `ServiceDiscoverer`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants