-
Notifications
You must be signed in to change notification settings - Fork 273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure services in failfast can become ready #858
Conversation
When a Service is in failfast, the inner service is only polled as new requests are processed. This means it's theoretically possible for certain service tasks to be starved. This change ensures that these layers are paired with a `SpawnReady` layer to ensure that the inner service is always driven to readiness. This could potentially explain behavior as described in linkerd/linkerd2#5183; though we don't have strong evidence to support that. This seems like a healthy defensive measure, in any case. This change also improves stack commentary to favor larger descriptive comments over layer-level annotations. While auditing services for readiness, an unnecessary buffer has been removed from the ingress HTTP stack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems right to me, and +1 for the new commentary.
it would be nice to have more testing around failfast, but since we don't have a tight repro for the issues some people have been seeing, that can probably wait.
.push_on_response( | ||
svc::layers() | ||
.push(svc::layer::mk(svc::SpawnReady::new)) | ||
.push(http::BoxRequest::layer()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just out of curiosity, why put the box above metrics now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mostly I want the box as close to the inner service as possible, as it's responsible for satisfying the type signature of the inner service.
.push(svc::layer::mk(svc::SpawnReady::new)) | ||
.push(svc::FailFast::layer("HTTP Logical", dispatch_timeout)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIOLI: might be worth having a "push spawn ready and failfast in one method" helper since we now want to pair them in most places. might also help avoid accidentally adding new failfasts without a spawn ready?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, but we don't have a nameable spawn ready layer at the moment.
.push(http::BoxRequest::layer()), | ||
// Ensure individual endpoints are driven to readiness so that | ||
// the balancer need not drive them all directly. | ||
.push(svc::layer::mk(svc::SpawnReady::new)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIOLI: could be nice to add a SpawnReady::layer
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This release improves diagnostics about the proxy's failfast state: * Warnings are now emitted when the failfast state is entered; * The "max concurrency exhausted" gRPC message has been changed to more-clearly indicate a failfast state error; and * Failfast recovery has been made more robust, ensuring that a service can recover indepenently of new requests being received. Furthermore, metric labeling has been improved: * TCP server metrics are now annotated with the original `target_addr`; * The `tls` label is now set to true for inbound TLS connections that lack a client ID. This is mostly helpful to clarify inbound metrics on the `identity` controller; * Outbound `tls` metrics could be reported incorrectly when a proxy was configured to not use identity. This has been corrected. Finally, socket-level errors now include a _client_ or _server_ prefix to indicate which side of the proxy encountered the error. --- * stack: remove `map_response` (linkerd/linkerd2-proxy#835) * replace `RequestFilter` with Tower's upstream impl (linkerd/linkerd2-proxy#842) * tracing: fix incorrect field format when logging in JSON (linkerd/linkerd2-proxy#845) * replace `FutureService` with Tower's upstream impl (linkerd/linkerd2-proxy#839) * integration: improve tracing in tests (linkerd/linkerd2-proxy#846) * service-profiles: Prevent Duration coercion panics (linkerd/linkerd2-proxy#844) * inbound: Separate HTTP server logic from protocol detection (linkerd/linkerd2-proxy#843) * Correct gRPC 'max-concurrency exhausted' error messages (linkerd/linkerd2-proxy#847) * Update tonic to v0.4 (linkerd/linkerd2-proxy#849) * failfast: Improve diagnostic logging (linkerd/linkerd2-proxy#848) * Update the base docker image (linkerd/linkerd2-proxy#850) * stack: Implement Clone for ResultService (linkerd/linkerd2-proxy#851) * Ensure services in failfast can become ready (linkerd/linkerd2-proxy#858) * tests: replace string matching on metrics with parsing (linkerd/linkerd2-proxy#859) * Decouple tls::accept from TcpStream (linkerd/linkerd2-proxy#853) * metrics: Handle NoPeerIdFromRemote properly (linkerd/linkerd2-proxy#857) * metrics: Reorder metrics labels (linkerd/linkerd2-proxy#856) * Rename tls::accept to tls::server (linkerd/linkerd2-proxy#854) * Annotate socket-level errors with a scope (linkerd/linkerd2-proxy#852) * test: reduce repetition in metrics tests (linkerd/linkerd2-proxy#860) * tls: Disambiguate client and server identities (linkerd/linkerd2-proxy#855) * Update to tower v0.4.4 (linkerd/linkerd2-proxy#864) * Update cargo dependencies (linkerd/linkerd2-proxy#865) * metrics: add `target_addr` label for accepted transport metrics (linkerd/linkerd2-proxy#861) * outbound: Strip endpoint identity when disabled (linkerd/linkerd2-proxy#862) --- The opaque-ports test has been updated to reflect proxy metrics changes.
This release improves diagnostics about the proxy's failfast state: * Warnings are now emitted when the failfast state is entered; * The "max concurrency exhausted" gRPC message has been changed to more-clearly indicate a failfast state error; and * Failfast recovery has been made more robust, ensuring that a service can recover indepenently of new requests being received. Furthermore, metric labeling has been improved: * TCP server metrics are now annotated with the original `target_addr`; * The `tls` label is now set to true for inbound TLS connections that lack a client ID. This is mostly helpful to clarify inbound metrics on the `identity` controller; * Outbound `tls` metrics could be reported incorrectly when a proxy was configured to not use identity. This has been corrected. Finally, socket-level errors now include a _client_ or _server_ prefix to indicate which side of the proxy encountered the error. --- * stack: remove `map_response` (linkerd/linkerd2-proxy#835) * replace `RequestFilter` with Tower's upstream impl (linkerd/linkerd2-proxy#842) * tracing: fix incorrect field format when logging in JSON (linkerd/linkerd2-proxy#845) * replace `FutureService` with Tower's upstream impl (linkerd/linkerd2-proxy#839) * integration: improve tracing in tests (linkerd/linkerd2-proxy#846) * service-profiles: Prevent Duration coercion panics (linkerd/linkerd2-proxy#844) * inbound: Separate HTTP server logic from protocol detection (linkerd/linkerd2-proxy#843) * Correct gRPC 'max-concurrency exhausted' error messages (linkerd/linkerd2-proxy#847) * Update tonic to v0.4 (linkerd/linkerd2-proxy#849) * failfast: Improve diagnostic logging (linkerd/linkerd2-proxy#848) * Update the base docker image (linkerd/linkerd2-proxy#850) * stack: Implement Clone for ResultService (linkerd/linkerd2-proxy#851) * Ensure services in failfast can become ready (linkerd/linkerd2-proxy#858) * tests: replace string matching on metrics with parsing (linkerd/linkerd2-proxy#859) * Decouple tls::accept from TcpStream (linkerd/linkerd2-proxy#853) * metrics: Handle NoPeerIdFromRemote properly (linkerd/linkerd2-proxy#857) * metrics: Reorder metrics labels (linkerd/linkerd2-proxy#856) * Rename tls::accept to tls::server (linkerd/linkerd2-proxy#854) * Annotate socket-level errors with a scope (linkerd/linkerd2-proxy#852) * test: reduce repetition in metrics tests (linkerd/linkerd2-proxy#860) * tls: Disambiguate client and server identities (linkerd/linkerd2-proxy#855) * Update to tower v0.4.4 (linkerd/linkerd2-proxy#864) * Update cargo dependencies (linkerd/linkerd2-proxy#865) * metrics: add `target_addr` label for accepted transport metrics (linkerd/linkerd2-proxy#861) * outbound: Strip endpoint identity when disabled (linkerd/linkerd2-proxy#862) --- The opaque-ports test has been updated to reflect proxy metrics changes.
This release improves diagnostics about the proxy's failfast state: * Warnings are now emitted when the failfast state is entered; * The "max concurrency exhausted" gRPC message has been changed to more-clearly indicate a failfast state error; and * Failfast recovery has been made more robust, ensuring that a service can recover indepenently of new requests being received. Furthermore, metric labeling has been improved: * TCP server metrics are now annotated with the original `target_addr`; * The `tls` label is now set to true for inbound TLS connections that lack a client ID. This is mostly helpful to clarify inbound metrics on the `identity` controller; * Outbound `tls` metrics could be reported incorrectly when a proxy was configured to not use identity. This has been corrected. Finally, socket-level errors now include a _client_ or _server_ prefix to indicate which side of the proxy encountered the error. --- * stack: remove `map_response` (linkerd/linkerd2-proxy#835) * replace `RequestFilter` with Tower's upstream impl (linkerd/linkerd2-proxy#842) * tracing: fix incorrect field format when logging in JSON (linkerd/linkerd2-proxy#845) * replace `FutureService` with Tower's upstream impl (linkerd/linkerd2-proxy#839) * integration: improve tracing in tests (linkerd/linkerd2-proxy#846) * service-profiles: Prevent Duration coercion panics (linkerd/linkerd2-proxy#844) * inbound: Separate HTTP server logic from protocol detection (linkerd/linkerd2-proxy#843) * Correct gRPC 'max-concurrency exhausted' error messages (linkerd/linkerd2-proxy#847) * Update tonic to v0.4 (linkerd/linkerd2-proxy#849) * failfast: Improve diagnostic logging (linkerd/linkerd2-proxy#848) * Update the base docker image (linkerd/linkerd2-proxy#850) * stack: Implement Clone for ResultService (linkerd/linkerd2-proxy#851) * Ensure services in failfast can become ready (linkerd/linkerd2-proxy#858) * tests: replace string matching on metrics with parsing (linkerd/linkerd2-proxy#859) * Decouple tls::accept from TcpStream (linkerd/linkerd2-proxy#853) * metrics: Handle NoPeerIdFromRemote properly (linkerd/linkerd2-proxy#857) * metrics: Reorder metrics labels (linkerd/linkerd2-proxy#856) * Rename tls::accept to tls::server (linkerd/linkerd2-proxy#854) * Annotate socket-level errors with a scope (linkerd/linkerd2-proxy#852) * test: reduce repetition in metrics tests (linkerd/linkerd2-proxy#860) * tls: Disambiguate client and server identities (linkerd/linkerd2-proxy#855) * Update to tower v0.4.4 (linkerd/linkerd2-proxy#864) * Update cargo dependencies (linkerd/linkerd2-proxy#865) * metrics: add `target_addr` label for accepted transport metrics (linkerd/linkerd2-proxy#861) * outbound: Strip endpoint identity when disabled (linkerd/linkerd2-proxy#862) --- The opaque-ports test has been updated to reflect proxy metrics changes. Signed-off-by: Jijeesh <jijeesh.ka@gmail.com>
This release improves diagnostics about the proxy's failfast state: * Warnings are now emitted when the failfast state is entered; * The "max concurrency exhausted" gRPC message has been changed to more-clearly indicate a failfast state error; and * Failfast recovery has been made more robust, ensuring that a service can recover indepenently of new requests being received. Furthermore, metric labeling has been improved: * TCP server metrics are now annotated with the original `target_addr`; * The `tls` label is now set to true for inbound TLS connections that lack a client ID. This is mostly helpful to clarify inbound metrics on the `identity` controller; * Outbound `tls` metrics could be reported incorrectly when a proxy was configured to not use identity. This has been corrected. Finally, socket-level errors now include a _client_ or _server_ prefix to indicate which side of the proxy encountered the error. --- * stack: remove `map_response` (linkerd/linkerd2-proxy#835) * replace `RequestFilter` with Tower's upstream impl (linkerd/linkerd2-proxy#842) * tracing: fix incorrect field format when logging in JSON (linkerd/linkerd2-proxy#845) * replace `FutureService` with Tower's upstream impl (linkerd/linkerd2-proxy#839) * integration: improve tracing in tests (linkerd/linkerd2-proxy#846) * service-profiles: Prevent Duration coercion panics (linkerd/linkerd2-proxy#844) * inbound: Separate HTTP server logic from protocol detection (linkerd/linkerd2-proxy#843) * Correct gRPC 'max-concurrency exhausted' error messages (linkerd/linkerd2-proxy#847) * Update tonic to v0.4 (linkerd/linkerd2-proxy#849) * failfast: Improve diagnostic logging (linkerd/linkerd2-proxy#848) * Update the base docker image (linkerd/linkerd2-proxy#850) * stack: Implement Clone for ResultService (linkerd/linkerd2-proxy#851) * Ensure services in failfast can become ready (linkerd/linkerd2-proxy#858) * tests: replace string matching on metrics with parsing (linkerd/linkerd2-proxy#859) * Decouple tls::accept from TcpStream (linkerd/linkerd2-proxy#853) * metrics: Handle NoPeerIdFromRemote properly (linkerd/linkerd2-proxy#857) * metrics: Reorder metrics labels (linkerd/linkerd2-proxy#856) * Rename tls::accept to tls::server (linkerd/linkerd2-proxy#854) * Annotate socket-level errors with a scope (linkerd/linkerd2-proxy#852) * test: reduce repetition in metrics tests (linkerd/linkerd2-proxy#860) * tls: Disambiguate client and server identities (linkerd/linkerd2-proxy#855) * Update to tower v0.4.4 (linkerd/linkerd2-proxy#864) * Update cargo dependencies (linkerd/linkerd2-proxy#865) * metrics: add `target_addr` label for accepted transport metrics (linkerd/linkerd2-proxy#861) * outbound: Strip endpoint identity when disabled (linkerd/linkerd2-proxy#862) --- The opaque-ports test has been updated to reflect proxy metrics changes. Signed-off-by: Jijeesh <jijeesh.ka@gmail.com>
When a Service is in failfast, the inner service is only polled as new
requests are processed. This means it's theoretically possible for
certain service tasks to be starved.
This change ensures that these layers are paired with a
SpawnReady
layer to ensure that the inner service is always driven to readiness.
This could potentially explain behavior as described in
linkerd/linkerd2#5183; though we don't have strong evidence to support
that. This seems like a healthy defensive measure, in any case.
This change also improves stack commentary to favor larger descriptive
comments over layer-level annotations.
While auditing services for readiness, an unnecessary buffer has been
removed from the ingress HTTP stack.