Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metrics: Reorder metrics labels #856

Merged
merged 2 commits into from
Jan 19, 2021
Merged

metrics: Reorder metrics labels #856

merged 2 commits into from
Jan 19, 2021

Conversation

olix0r
Copy link
Member

@olix0r olix0r commented Jan 18, 2021

This is a superficial change that moves the direction label before the
authority label on HTTP endpoint metrics.

This is a superficial change that moves the `direction` label before the
`authority` label on HTTP endpoint metrics.
@olix0r olix0r requested a review from a team January 18, 2021 20:55
olix0r pushed a commit that referenced this pull request Jan 19, 2021
This branch changes the metrics integration tests so that we no longer
make assertions about metrics by searching the entire metrics scrape for
a specific string. Instead, we now have a simple builder API for
constructing a type that tries to parse the metrics scrape to find
individual metrics with certain label combinations and/or values.

Advantages of the new approach include:
- it's not ordering dependent --- changing the label ordering doesn't
  change whether or not the match succeeds. for example, changes
  like #856 won't break the tests.
- the presence of *additional* labels doesn't break the assertions
  (although we may want to add an "exact" matching mode in case we
  *do* want to assert that other labels are not present).
- metrics can easily be matched with and without values
- the failure reporting is *much* better. when the scrape doesn't
  contain the required metric, we can now report the metric we wanted
  in a nicely formatted way, and show only a list of the "similar"
  metrics that were found. with the previous code, panics included the
  entire metrics scrape, which was very hard to read and debug.

This should make it much easier to modify the tests to add new metric
labels or add existing ones, since we no longer depend on specific
hard-coded strings, but are instead making assertions about the
properties we actually want to find.

Example of the new errors:

```
thread 'tests::telemetry::response_classification::inbound_http' panicked at 'did not find a `response_total` metric
  with labels: ["authority=\"tele.test.svc.cluster.local\"", "direction=\"inbound\"", "tls=\"disabled\"", "status_code=\"200 OK\"", "classification=\"success\""]
found metrics: [
    "response_total{authority=\"tele.test.svc.cluster.local\",direction=\"inbound\",tls=\"disabled\",status_code=\"200\",classification=\"success\"} 1",
    "response_total{authority=\"tele.test.svc.cluster.local\",direction=\"inbound\",tls=\"disabled\",status_code=\"304\",classification=\"success\"} 1",
]
  retried 5 times (73.314652ms)', linkerd/app/integration/src/metrics.rs:128:51
stack backtrace:
    ... snip ...
```

These *could* probably be improved even more, for example, by
highlighting the specific label that was missing. However, this felt
like a pretty significant improvement for a first pass.

In a follow-up, I'll also reduce some of the code duplication between
the tests that are repeated for inbound and outbound. While working
on this change, I found that modifying the tests was much harder since
almost all of them were duplicated on the inbound and outbound halves
of the proxy. I'll merge that in a separate PR to reduce the size of
this diff, though.
@hawkw hawkw merged commit e6a1ee7 into main Jan 19, 2021
@hawkw hawkw deleted the ver/direction-labels branch January 19, 2021 22:00
olix0r added a commit to linkerd/linkerd2 that referenced this pull request Jan 21, 2021
This release improves diagnostics about the proxy's failfast state:

* Warnings are now emitted when the failfast state is entered;
* The "max concurrency exhausted" gRPC message has been changed to
  more-clearly indicate a failfast state error; and
* Failfast recovery has been made more robust, ensuring that a service
  can recover indepenently of new requests being received.

Furthermore, metric labeling has been improved:

* TCP server metrics are now annotated with the original `target_addr`;
* The `tls` label is now set to true for inbound TLS connections that
  lack a client ID. This is mostly helpful to clarify inbound metrics on
  the `identity` controller;
* Outbound `tls` metrics could be reported incorrectly when a proxy was
  configured to not use identity. This has been corrected.

Finally, socket-level errors now include a _client_ or _server_ prefix
to indicate which side of the proxy encountered the error.

---

* stack: remove `map_response` (linkerd/linkerd2-proxy#835)
* replace `RequestFilter` with Tower's upstream impl (linkerd/linkerd2-proxy#842)
* tracing: fix incorrect field format when logging in JSON (linkerd/linkerd2-proxy#845)
* replace `FutureService` with Tower's upstream impl (linkerd/linkerd2-proxy#839)
* integration: improve tracing in tests (linkerd/linkerd2-proxy#846)
* service-profiles: Prevent Duration coercion panics (linkerd/linkerd2-proxy#844)
* inbound: Separate HTTP server logic from protocol detection (linkerd/linkerd2-proxy#843)
* Correct gRPC 'max-concurrency exhausted' error messages (linkerd/linkerd2-proxy#847)
* Update tonic to v0.4 (linkerd/linkerd2-proxy#849)
* failfast: Improve diagnostic logging (linkerd/linkerd2-proxy#848)
* Update the base docker image (linkerd/linkerd2-proxy#850)
* stack: Implement Clone for ResultService (linkerd/linkerd2-proxy#851)
* Ensure services in failfast can become ready (linkerd/linkerd2-proxy#858)
* tests: replace string matching on metrics with parsing (linkerd/linkerd2-proxy#859)
* Decouple tls::accept from TcpStream (linkerd/linkerd2-proxy#853)
* metrics: Handle NoPeerIdFromRemote properly (linkerd/linkerd2-proxy#857)
* metrics: Reorder metrics labels (linkerd/linkerd2-proxy#856)
* Rename tls::accept to tls::server (linkerd/linkerd2-proxy#854)
* Annotate socket-level errors with a scope (linkerd/linkerd2-proxy#852)
* test: reduce repetition in metrics tests (linkerd/linkerd2-proxy#860)
* tls: Disambiguate client and server identities (linkerd/linkerd2-proxy#855)
* Update to tower v0.4.4 (linkerd/linkerd2-proxy#864)
* Update cargo dependencies (linkerd/linkerd2-proxy#865)
* metrics: add `target_addr` label for accepted transport metrics (linkerd/linkerd2-proxy#861)
* outbound: Strip endpoint identity when disabled (linkerd/linkerd2-proxy#862)

---

The opaque-ports test has been updated to reflect proxy metrics changes.
olix0r added a commit to linkerd/linkerd2 that referenced this pull request Jan 21, 2021
This release improves diagnostics about the proxy's failfast state:

* Warnings are now emitted when the failfast state is entered;
* The "max concurrency exhausted" gRPC message has been changed to
  more-clearly indicate a failfast state error; and
* Failfast recovery has been made more robust, ensuring that a service
  can recover indepenently of new requests being received.

Furthermore, metric labeling has been improved:

* TCP server metrics are now annotated with the original `target_addr`;
* The `tls` label is now set to true for inbound TLS connections that
  lack a client ID. This is mostly helpful to clarify inbound metrics on
  the `identity` controller;
* Outbound `tls` metrics could be reported incorrectly when a proxy was
  configured to not use identity. This has been corrected.

Finally, socket-level errors now include a _client_ or _server_ prefix
to indicate which side of the proxy encountered the error.

---

* stack: remove `map_response` (linkerd/linkerd2-proxy#835)
* replace `RequestFilter` with Tower's upstream impl (linkerd/linkerd2-proxy#842)
* tracing: fix incorrect field format when logging in JSON (linkerd/linkerd2-proxy#845)
* replace `FutureService` with Tower's upstream impl (linkerd/linkerd2-proxy#839)
* integration: improve tracing in tests (linkerd/linkerd2-proxy#846)
* service-profiles: Prevent Duration coercion panics (linkerd/linkerd2-proxy#844)
* inbound: Separate HTTP server logic from protocol detection (linkerd/linkerd2-proxy#843)
* Correct gRPC 'max-concurrency exhausted' error messages (linkerd/linkerd2-proxy#847)
* Update tonic to v0.4 (linkerd/linkerd2-proxy#849)
* failfast: Improve diagnostic logging (linkerd/linkerd2-proxy#848)
* Update the base docker image (linkerd/linkerd2-proxy#850)
* stack: Implement Clone for ResultService (linkerd/linkerd2-proxy#851)
* Ensure services in failfast can become ready (linkerd/linkerd2-proxy#858)
* tests: replace string matching on metrics with parsing (linkerd/linkerd2-proxy#859)
* Decouple tls::accept from TcpStream (linkerd/linkerd2-proxy#853)
* metrics: Handle NoPeerIdFromRemote properly (linkerd/linkerd2-proxy#857)
* metrics: Reorder metrics labels (linkerd/linkerd2-proxy#856)
* Rename tls::accept to tls::server (linkerd/linkerd2-proxy#854)
* Annotate socket-level errors with a scope (linkerd/linkerd2-proxy#852)
* test: reduce repetition in metrics tests (linkerd/linkerd2-proxy#860)
* tls: Disambiguate client and server identities (linkerd/linkerd2-proxy#855)
* Update to tower v0.4.4 (linkerd/linkerd2-proxy#864)
* Update cargo dependencies (linkerd/linkerd2-proxy#865)
* metrics: add `target_addr` label for accepted transport metrics (linkerd/linkerd2-proxy#861)
* outbound: Strip endpoint identity when disabled (linkerd/linkerd2-proxy#862)

---

The opaque-ports test has been updated to reflect proxy metrics changes.
jijeesh pushed a commit to jijeesh/linkerd2 that referenced this pull request Mar 23, 2021
This release improves diagnostics about the proxy's failfast state:

* Warnings are now emitted when the failfast state is entered;
* The "max concurrency exhausted" gRPC message has been changed to
  more-clearly indicate a failfast state error; and
* Failfast recovery has been made more robust, ensuring that a service
  can recover indepenently of new requests being received.

Furthermore, metric labeling has been improved:

* TCP server metrics are now annotated with the original `target_addr`;
* The `tls` label is now set to true for inbound TLS connections that
  lack a client ID. This is mostly helpful to clarify inbound metrics on
  the `identity` controller;
* Outbound `tls` metrics could be reported incorrectly when a proxy was
  configured to not use identity. This has been corrected.

Finally, socket-level errors now include a _client_ or _server_ prefix
to indicate which side of the proxy encountered the error.

---

* stack: remove `map_response` (linkerd/linkerd2-proxy#835)
* replace `RequestFilter` with Tower's upstream impl (linkerd/linkerd2-proxy#842)
* tracing: fix incorrect field format when logging in JSON (linkerd/linkerd2-proxy#845)
* replace `FutureService` with Tower's upstream impl (linkerd/linkerd2-proxy#839)
* integration: improve tracing in tests (linkerd/linkerd2-proxy#846)
* service-profiles: Prevent Duration coercion panics (linkerd/linkerd2-proxy#844)
* inbound: Separate HTTP server logic from protocol detection (linkerd/linkerd2-proxy#843)
* Correct gRPC 'max-concurrency exhausted' error messages (linkerd/linkerd2-proxy#847)
* Update tonic to v0.4 (linkerd/linkerd2-proxy#849)
* failfast: Improve diagnostic logging (linkerd/linkerd2-proxy#848)
* Update the base docker image (linkerd/linkerd2-proxy#850)
* stack: Implement Clone for ResultService (linkerd/linkerd2-proxy#851)
* Ensure services in failfast can become ready (linkerd/linkerd2-proxy#858)
* tests: replace string matching on metrics with parsing (linkerd/linkerd2-proxy#859)
* Decouple tls::accept from TcpStream (linkerd/linkerd2-proxy#853)
* metrics: Handle NoPeerIdFromRemote properly (linkerd/linkerd2-proxy#857)
* metrics: Reorder metrics labels (linkerd/linkerd2-proxy#856)
* Rename tls::accept to tls::server (linkerd/linkerd2-proxy#854)
* Annotate socket-level errors with a scope (linkerd/linkerd2-proxy#852)
* test: reduce repetition in metrics tests (linkerd/linkerd2-proxy#860)
* tls: Disambiguate client and server identities (linkerd/linkerd2-proxy#855)
* Update to tower v0.4.4 (linkerd/linkerd2-proxy#864)
* Update cargo dependencies (linkerd/linkerd2-proxy#865)
* metrics: add `target_addr` label for accepted transport metrics (linkerd/linkerd2-proxy#861)
* outbound: Strip endpoint identity when disabled (linkerd/linkerd2-proxy#862)

---

The opaque-ports test has been updated to reflect proxy metrics changes.

Signed-off-by: Jijeesh <jijeesh.ka@gmail.com>
jijeesh pushed a commit to jijeesh/linkerd2 that referenced this pull request Apr 21, 2021
This release improves diagnostics about the proxy's failfast state:

* Warnings are now emitted when the failfast state is entered;
* The "max concurrency exhausted" gRPC message has been changed to
  more-clearly indicate a failfast state error; and
* Failfast recovery has been made more robust, ensuring that a service
  can recover indepenently of new requests being received.

Furthermore, metric labeling has been improved:

* TCP server metrics are now annotated with the original `target_addr`;
* The `tls` label is now set to true for inbound TLS connections that
  lack a client ID. This is mostly helpful to clarify inbound metrics on
  the `identity` controller;
* Outbound `tls` metrics could be reported incorrectly when a proxy was
  configured to not use identity. This has been corrected.

Finally, socket-level errors now include a _client_ or _server_ prefix
to indicate which side of the proxy encountered the error.

---

* stack: remove `map_response` (linkerd/linkerd2-proxy#835)
* replace `RequestFilter` with Tower's upstream impl (linkerd/linkerd2-proxy#842)
* tracing: fix incorrect field format when logging in JSON (linkerd/linkerd2-proxy#845)
* replace `FutureService` with Tower's upstream impl (linkerd/linkerd2-proxy#839)
* integration: improve tracing in tests (linkerd/linkerd2-proxy#846)
* service-profiles: Prevent Duration coercion panics (linkerd/linkerd2-proxy#844)
* inbound: Separate HTTP server logic from protocol detection (linkerd/linkerd2-proxy#843)
* Correct gRPC 'max-concurrency exhausted' error messages (linkerd/linkerd2-proxy#847)
* Update tonic to v0.4 (linkerd/linkerd2-proxy#849)
* failfast: Improve diagnostic logging (linkerd/linkerd2-proxy#848)
* Update the base docker image (linkerd/linkerd2-proxy#850)
* stack: Implement Clone for ResultService (linkerd/linkerd2-proxy#851)
* Ensure services in failfast can become ready (linkerd/linkerd2-proxy#858)
* tests: replace string matching on metrics with parsing (linkerd/linkerd2-proxy#859)
* Decouple tls::accept from TcpStream (linkerd/linkerd2-proxy#853)
* metrics: Handle NoPeerIdFromRemote properly (linkerd/linkerd2-proxy#857)
* metrics: Reorder metrics labels (linkerd/linkerd2-proxy#856)
* Rename tls::accept to tls::server (linkerd/linkerd2-proxy#854)
* Annotate socket-level errors with a scope (linkerd/linkerd2-proxy#852)
* test: reduce repetition in metrics tests (linkerd/linkerd2-proxy#860)
* tls: Disambiguate client and server identities (linkerd/linkerd2-proxy#855)
* Update to tower v0.4.4 (linkerd/linkerd2-proxy#864)
* Update cargo dependencies (linkerd/linkerd2-proxy#865)
* metrics: add `target_addr` label for accepted transport metrics (linkerd/linkerd2-proxy#861)
* outbound: Strip endpoint identity when disabled (linkerd/linkerd2-proxy#862)

---

The opaque-ports test has been updated to reflect proxy metrics changes.

Signed-off-by: Jijeesh <jijeesh.ka@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants