Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backport of Add exporting of Envoy metrics to HCP for linked clusters into release/1.4.x #422

Merged

Conversation

hc-github-team-consul-core
Copy link
Collaborator

Backport

This PR is auto-generated from #370 to be assessed for backporting due to the inclusion of the label backport/1.4.

🚨

Warning automatic cherry-pick of commits failed. If the first commit failed,
you will see a blank no-op commit below. If at least one commit succeeded, you
will see the cherry-picked commits up to, not including, the commit where
the merge conflict occurred.

The person who merged in the original PR is:
@loshz
This person should manually cherry-pick the original PR into a new backport PR,
and close this one when the manual backport PR is merged in.

merge conflict error: POST https://api.github.com/repos/hashicorp/consul-dataplane/merges: 409 Merge conflict []

The below text is copied from the body of the original PR.


Overview

This builds off the work of the Consul PR: hashicorp/consul#20257

We want to let users send Envoy metrics to HCP for use in service-level dashboards and coming service topology map features. We also want to support that without requiring that users deploy a dedicated Consul Telemetry Collector.

The behavior added to Consul Dataplane in this PR is:

  1. Watch for changes to TelemetryState in Consul via the resource API
  2. On creation of a TelemetryState resource, create an HTTP client to push metrics to HCP's Consul metrics endpoint
  3. Start periodically (minutely) scraping Envoy's stats endpoint
  4. Convert those Prometheus metrics to OTLP metrics
  5. POST the metrics to HCP
Screenshot 2024-01-26 at 11 21 03 AM

More details are available in the RFC: https://docs.google.com/document/d/1gh3_zUtdIDeDPIH7z3d0-kkx1scB2oZLi7yuYg88PK8

Testing

Unit/integration Testing

$go test -timeout 30s -tags integration github.com/hashicorp/consul-dataplane/pkg/hcp/telemetry

ok  	github.com/hashicorp/consul-dataplane/pkg/hcp/telemetry	0.788s	coverage: 90.7% of statements
go test -timeout 30s -tags integration github.com/hashicorp/consul-dataplane/pkg/hcp/telemetry/otlphttp

ok  	github.com/hashicorp/consul-dataplane/pkg/hcp/telemetry/otlphttp	0.651s	coverage: 88.9% of statements

E2E Testing with HCP

To test this with real values I:

  1. created a resource in consul-core by calling the resource API with grpcurl (using a real endpoint, filters, labels, and service principal)
  2. validated that metrics were pushed through the HCP metrics endpoint and available for querying
# create a resource manually
grpcurl -d @ \
  -plaintext \
  -protoset pkg/consul.protoset \
  127.0.0.1:8502 \
  hashicorp.consul.resource.ResourceService.Write \
<<EOF
  {
    "resource": {
      "id": {
        "type": {
          "group": "hcp",
          "group_version": "v2",
          "kind": "TelemetryState"
        },
        "name": "global"
      },
      "metadata": {
        "consul.io/hcp/telemetry-state/debug/skip-deletion": "true"
      },
      "data": {
        "@type": "types.googleapis.com/hashicorp.consul.hcp.v2.TelemetryState",
        "resource_id": "organization/de9edc98-06f2-45b3-96ba-e4b0ced62717/project/a2c4c2d9-0087-4a1b-a9d6-b6030d55ea5c/hashicorp.consul.global-network-manager.cluster/josh-test-cdp-metrics",
        "client_id": "foo",
        "client_secret": "barr",
        "hcp_config": {
          "auth_url": "https://auth.idp.hcp.dev"
        },
        "proxy": {
          "http_proxy": "http://192.168.0.135:8080"
        },
        "metrics": {
          "endpoint": "http://consul-telemetry.hcp.dev/otlp/v1/metrics",
          "include_list": [".+"],
          "disabled": false,
          "labels": {
            "foo": "bar"
          }
        }
      }
    }
  }
EOF
Screenshot 2023-12-22 at 12 23 20 PM

E2E Testing with HCP and a Proxy Server

TelemetryState has an HTTP proxy for use in environments where egress to the internet (ie HCP) is restricted to a single point of egress. This is similar to the Datadog agent's proxy settings.

message TelemetryState {
  ProxyConfig proxy = 5;
}

message ProxyConfig {
  // HttpProxy configures the http proxy to use for HTTP (non-TLS) requests.
  string http_proxy = 1;

  // HttpsProxy configures the http proxy to use for HTTPS (TLS) requests.
  string https_proxy = 2;

  // NoProxy can be configured to include domains which should NOT be forwarded through the configured http proxy
  repeated string no_proxy = 3;
}

To test pushing of metrics through a proxy, I set up Nginx locally as a forward proxy:

http {
...
    server {
        listen               8080;
        server_name  192.168.0.135;

        location /otlp/v1/metrics {
            proxy_ssl_verify off;
            proxy_pass https://consul-telemetry.hcp.dev/otlp/v1/metrics;
        }

And confirmed that metrics were pushed thru it to CTGW (both in logs in metrics in Grafana):

192.168.0.135 - - [25/Jan/2024:19:56:51 -0500] "POST http://consul-telemetry.hcp.dev/otlp/v1/metrics HTTP/1.1" 200 23 "-" "consul-dataplane/1.4.0-dev (linux/arm64)" "-"

Overview of commits

@hashicorp-cla
Copy link

hashicorp-cla commented Feb 12, 2024

CLA assistant check
All committers have signed the CLA.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Auto approved Consul Bot automated PR

* Bump consul/proto-public and create mocks

* Revert some dep changes

* Add changelog

* Init metrics pushing from consul-dataplane

* Add node.id label

* Remove unused field

* Rename worker vars to exporter

* Return a *state in modState

* Bump dependency

* Remove pointer to exporter

* Add mocks in internal/mocks

* Fix test

* Restore directive

* Revert directive

* Rename metrics > scraper, remove unused changelog

* Fix WatchList for new consul api

* Infer auth endpoint if not in state

* debug: expose pprof/runtime metrics

* Handle snapshot end more gracefully

* Fix x-channel header

* Fix path registration

* Try w/o setting handlers

* Use a larger example stats file in envoy_admin_stats_prometheus

* Bump proto-public

* Remove pprof

---------

Co-authored-by: Dan Bond <danbond@protonmail.com>
(cherry picked from commit 3bf86d3)
@loshz loshz force-pushed the backport/feat/hcp-metrics/obviously-desired-mouse branch from 09bc74c to 54d6463 Compare February 12, 2024 20:15
@loshz loshz marked this pull request as ready for review February 12, 2024 20:15
@loshz loshz requested a review from a team as a code owner February 12, 2024 20:15
@loshz loshz enabled auto-merge (squash) February 12, 2024 20:16
@loshz loshz merged commit 19609ae into release/1.4.x Feb 12, 2024
23 checks passed
@loshz loshz deleted the backport/feat/hcp-metrics/obviously-desired-mouse branch February 12, 2024 20:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants