Ensure services in failfast can become ready #858

olix0r · 2021-01-19T06:34:19Z

When a Service is in failfast, the inner service is only polled as new
requests are processed. This means it's theoretically possible for
certain service tasks to be starved.

This change ensures that these layers are paired with a SpawnReady
layer to ensure that the inner service is always driven to readiness.

This could potentially explain behavior as described in
linkerd/linkerd2#5183; though we don't have strong evidence to support
that. This seems like a healthy defensive measure, in any case.

This change also improves stack commentary to favor larger descriptive
comments over layer-level annotations.

While auditing services for readiness, an unnecessary buffer has been
removed from the ingress HTTP stack.

When a Service is in failfast, the inner service is only polled as new requests are processed. This means it's theoretically possible for certain service tasks to be starved. This change ensures that these layers are paired with a `SpawnReady` layer to ensure that the inner service is always driven to readiness. This could potentially explain behavior as described in linkerd/linkerd2#5183; though we don't have strong evidence to support that. This seems like a healthy defensive measure, in any case. This change also improves stack commentary to favor larger descriptive comments over layer-level annotations. While auditing services for readiness, an unnecessary buffer has been removed from the ingress HTTP stack.

hawkw

this seems right to me, and +1 for the new commentary.

it would be nice to have more testing around failfast, but since we don't have a tight repro for the issues some people have been seeing, that can probably wait.

hawkw · 2021-01-19T18:04:09Z

linkerd/app/outbound/src/http/logical.rs

        .push_on_response(
            svc::layers()
-                .push(svc::layer::mk(svc::SpawnReady::new))
+                .push(http::BoxRequest::layer())


just out of curiosity, why put the box above metrics now?

mostly I want the box as close to the inner service as possible, as it's responsible for satisfying the type signature of the inner service.

hawkw · 2021-01-19T18:06:30Z

linkerd/app/outbound/src/ingress.rs

+                .push(svc::layer::mk(svc::SpawnReady::new))
+                .push(svc::FailFast::layer("HTTP Logical", dispatch_timeout))


TIOLI: might be worth having a "push spawn ready and failfast in one method" helper since we now want to pair them in most places. might also help avoid accidentally adding new failfasts without a spawn ready?

Agreed, but we don't have a nameable spawn ready layer at the moment.

hawkw · 2021-01-19T18:07:02Z

linkerd/app/outbound/src/http/logical.rs

-                .push(http::BoxRequest::layer()),
+                // Ensure individual endpoints are driven to readiness so that
+                // the balancer need not drive them all directly.
+                .push(svc::layer::mk(svc::SpawnReady::new)),


TIOLI: could be nice to add a SpawnReady::layer?

tower-rs/tower#536

This release improves diagnostics about the proxy's failfast state: * Warnings are now emitted when the failfast state is entered; * The "max concurrency exhausted" gRPC message has been changed to more-clearly indicate a failfast state error; and * Failfast recovery has been made more robust, ensuring that a service can recover indepenently of new requests being received. Furthermore, metric labeling has been improved: * TCP server metrics are now annotated with the original `target_addr`; * The `tls` label is now set to true for inbound TLS connections that lack a client ID. This is mostly helpful to clarify inbound metrics on the `identity` controller; * Outbound `tls` metrics could be reported incorrectly when a proxy was configured to not use identity. This has been corrected. Finally, socket-level errors now include a _client_ or _server_ prefix to indicate which side of the proxy encountered the error. --- * stack: remove `map_response` (linkerd/linkerd2-proxy#835) * replace `RequestFilter` with Tower's upstream impl (linkerd/linkerd2-proxy#842) * tracing: fix incorrect field format when logging in JSON (linkerd/linkerd2-proxy#845) * replace `FutureService` with Tower's upstream impl (linkerd/linkerd2-proxy#839) * integration: improve tracing in tests (linkerd/linkerd2-proxy#846) * service-profiles: Prevent Duration coercion panics (linkerd/linkerd2-proxy#844) * inbound: Separate HTTP server logic from protocol detection (linkerd/linkerd2-proxy#843) * Correct gRPC 'max-concurrency exhausted' error messages (linkerd/linkerd2-proxy#847) * Update tonic to v0.4 (linkerd/linkerd2-proxy#849) * failfast: Improve diagnostic logging (linkerd/linkerd2-proxy#848) * Update the base docker image (linkerd/linkerd2-proxy#850) * stack: Implement Clone for ResultService (linkerd/linkerd2-proxy#851) * Ensure services in failfast can become ready (linkerd/linkerd2-proxy#858) * tests: replace string matching on metrics with parsing (linkerd/linkerd2-proxy#859) * Decouple tls::accept from TcpStream (linkerd/linkerd2-proxy#853) * metrics: Handle NoPeerIdFromRemote properly (linkerd/linkerd2-proxy#857) * metrics: Reorder metrics labels (linkerd/linkerd2-proxy#856) * Rename tls::accept to tls::server (linkerd/linkerd2-proxy#854) * Annotate socket-level errors with a scope (linkerd/linkerd2-proxy#852) * test: reduce repetition in metrics tests (linkerd/linkerd2-proxy#860) * tls: Disambiguate client and server identities (linkerd/linkerd2-proxy#855) * Update to tower v0.4.4 (linkerd/linkerd2-proxy#864) * Update cargo dependencies (linkerd/linkerd2-proxy#865) * metrics: add `target_addr` label for accepted transport metrics (linkerd/linkerd2-proxy#861) * outbound: Strip endpoint identity when disabled (linkerd/linkerd2-proxy#862) --- The opaque-ports test has been updated to reflect proxy metrics changes.

This release improves diagnostics about the proxy's failfast state: * Warnings are now emitted when the failfast state is entered; * The "max concurrency exhausted" gRPC message has been changed to more-clearly indicate a failfast state error; and * Failfast recovery has been made more robust, ensuring that a service can recover indepenently of new requests being received. Furthermore, metric labeling has been improved: * TCP server metrics are now annotated with the original `target_addr`; * The `tls` label is now set to true for inbound TLS connections that lack a client ID. This is mostly helpful to clarify inbound metrics on the `identity` controller; * Outbound `tls` metrics could be reported incorrectly when a proxy was configured to not use identity. This has been corrected. Finally, socket-level errors now include a _client_ or _server_ prefix to indicate which side of the proxy encountered the error. --- * stack: remove `map_response` (linkerd/linkerd2-proxy#835) * replace `RequestFilter` with Tower's upstream impl (linkerd/linkerd2-proxy#842) * tracing: fix incorrect field format when logging in JSON (linkerd/linkerd2-proxy#845) * replace `FutureService` with Tower's upstream impl (linkerd/linkerd2-proxy#839) * integration: improve tracing in tests (linkerd/linkerd2-proxy#846) * service-profiles: Prevent Duration coercion panics (linkerd/linkerd2-proxy#844) * inbound: Separate HTTP server logic from protocol detection (linkerd/linkerd2-proxy#843) * Correct gRPC 'max-concurrency exhausted' error messages (linkerd/linkerd2-proxy#847) * Update tonic to v0.4 (linkerd/linkerd2-proxy#849) * failfast: Improve diagnostic logging (linkerd/linkerd2-proxy#848) * Update the base docker image (linkerd/linkerd2-proxy#850) * stack: Implement Clone for ResultService (linkerd/linkerd2-proxy#851) * Ensure services in failfast can become ready (linkerd/linkerd2-proxy#858) * tests: replace string matching on metrics with parsing (linkerd/linkerd2-proxy#859) * Decouple tls::accept from TcpStream (linkerd/linkerd2-proxy#853) * metrics: Handle NoPeerIdFromRemote properly (linkerd/linkerd2-proxy#857) * metrics: Reorder metrics labels (linkerd/linkerd2-proxy#856) * Rename tls::accept to tls::server (linkerd/linkerd2-proxy#854) * Annotate socket-level errors with a scope (linkerd/linkerd2-proxy#852) * test: reduce repetition in metrics tests (linkerd/linkerd2-proxy#860) * tls: Disambiguate client and server identities (linkerd/linkerd2-proxy#855) * Update to tower v0.4.4 (linkerd/linkerd2-proxy#864) * Update cargo dependencies (linkerd/linkerd2-proxy#865) * metrics: add `target_addr` label for accepted transport metrics (linkerd/linkerd2-proxy#861) * outbound: Strip endpoint identity when disabled (linkerd/linkerd2-proxy#862) --- The opaque-ports test has been updated to reflect proxy metrics changes. Signed-off-by: Jijeesh <jijeesh.ka@gmail.com>

olix0r requested a review from a team January 19, 2021 06:34

Update default inbound timeout to 300ms

5257b3e

hawkw approved these changes Jan 19, 2021

View reviewed changes

olix0r merged commit a5c06f2 into main Jan 19, 2021

olix0r deleted the ver/failfast branch January 19, 2021 18:19

olix0r mentioned this pull request Jan 21, 2021

proxy: v2.129.0 linkerd/linkerd2#5581

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure services in failfast can become ready #858

Ensure services in failfast can become ready #858

olix0r commented Jan 19, 2021

hawkw left a comment

hawkw Jan 19, 2021

olix0r Jan 19, 2021

hawkw Jan 19, 2021

olix0r Jan 19, 2021

hawkw Jan 19, 2021

olix0r Jan 19, 2021

		.push(svc::layer::mk(svc::SpawnReady::new))
		.push(svc::FailFast::layer("HTTP Logical", dispatch_timeout))

Ensure services in failfast can become ready #858

Ensure services in failfast can become ready #858

Conversation

olix0r commented Jan 19, 2021

hawkw left a comment

Choose a reason for hiding this comment

hawkw Jan 19, 2021

Choose a reason for hiding this comment

olix0r Jan 19, 2021

Choose a reason for hiding this comment

hawkw Jan 19, 2021

Choose a reason for hiding this comment

olix0r Jan 19, 2021

Choose a reason for hiding this comment

hawkw Jan 19, 2021

Choose a reason for hiding this comment

olix0r Jan 19, 2021

Choose a reason for hiding this comment