Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add exporting of Envoy metrics to HCP for linked clusters #370

Merged
merged 27 commits into from
Feb 12, 2024
Merged

Conversation

jjti
Copy link
Contributor

@jjti jjti commented Dec 19, 2023

Overview

This builds off the work of the Consul PR: hashicorp/consul#20257

We want to let users send Envoy metrics to HCP for use in service-level dashboards and coming service topology map features. We also want to support that without requiring that users deploy a dedicated Consul Telemetry Collector.

The behavior added to Consul Dataplane in this PR is:

  1. Watch for changes to TelemetryState in Consul via the resource API
  2. On creation of a TelemetryState resource, create an HTTP client to push metrics to HCP's Consul metrics endpoint
  3. Start periodically (minutely) scraping Envoy's stats endpoint
  4. Convert those Prometheus metrics to OTLP metrics
  5. POST the metrics to HCP
Screenshot 2024-01-26 at 11 21 03 AM

More details are available in the RFC: https://docs.google.com/document/d/1gh3_zUtdIDeDPIH7z3d0-kkx1scB2oZLi7yuYg88PK8

Testing

Unit/integration Testing

$go test -timeout 30s -tags integration github.com/hashicorp/consul-dataplane/pkg/hcp/telemetry

ok  	github.com/hashicorp/consul-dataplane/pkg/hcp/telemetry	0.788s	coverage: 90.7% of statements
go test -timeout 30s -tags integration github.com/hashicorp/consul-dataplane/pkg/hcp/telemetry/otlphttp

ok  	github.com/hashicorp/consul-dataplane/pkg/hcp/telemetry/otlphttp	0.651s	coverage: 88.9% of statements

E2E Testing with HCP

To test this with real values I:

  1. created a resource in consul-core by calling the resource API with grpcurl (using a real endpoint, filters, labels, and service principal)
  2. validated that metrics were pushed through the HCP metrics endpoint and available for querying
# create a resource manually
grpcurl -d @ \
  -plaintext \
  -protoset pkg/consul.protoset \
  127.0.0.1:8502 \
  hashicorp.consul.resource.ResourceService.Write \
<<EOF
  {
    "resource": {
      "id": {
        "type": {
          "group": "hcp",
          "group_version": "v2",
          "kind": "TelemetryState"
        },
        "name": "global"
      },
      "metadata": {
        "consul.io/hcp/telemetry-state/debug/skip-deletion": "true"
      },
      "data": {
        "@type": "types.googleapis.com/hashicorp.consul.hcp.v2.TelemetryState",
        "resource_id": "organization/de9edc98-06f2-45b3-96ba-e4b0ced62717/project/a2c4c2d9-0087-4a1b-a9d6-b6030d55ea5c/hashicorp.consul.global-network-manager.cluster/josh-test-cdp-metrics",
        "client_id": "foo",
        "client_secret": "barr",
        "hcp_config": {
          "auth_url": "https://auth.idp.hcp.dev"
        },
        "proxy": {
          "http_proxy": "http://192.168.0.135:8080"
        },
        "metrics": {
          "endpoint": "http://consul-telemetry.hcp.dev/otlp/v1/metrics",
          "include_list": [".+"],
          "disabled": false,
          "labels": {
            "foo": "bar"
          }
        }
      }
    }
  }
EOF
Screenshot 2023-12-22 at 12 23 20 PM

E2E Testing with HCP and a Proxy Server

TelemetryState has an HTTP proxy for use in environments where egress to the internet (ie HCP) is restricted to a single point of egress. This is similar to the Datadog agent's proxy settings.

message TelemetryState {
  ProxyConfig proxy = 5;
}

message ProxyConfig {
  // HttpProxy configures the http proxy to use for HTTP (non-TLS) requests.
  string http_proxy = 1;

  // HttpsProxy configures the http proxy to use for HTTPS (TLS) requests.
  string https_proxy = 2;

  // NoProxy can be configured to include domains which should NOT be forwarded through the configured http proxy
  repeated string no_proxy = 3;
}

To test pushing of metrics through a proxy, I set up Nginx locally as a forward proxy:

http {
...
    server {
        listen               8080;
        server_name  192.168.0.135;

        location /otlp/v1/metrics {
            proxy_ssl_verify off;
            proxy_pass https://consul-telemetry.hcp.dev/otlp/v1/metrics;
        }

And confirmed that metrics were pushed thru it to CTGW (both in logs in metrics in Grafana):

192.168.0.135 - - [25/Jan/2024:19:56:51 -0500] "POST http://consul-telemetry.hcp.dev/otlp/v1/metrics HTTP/1.1" 200 23 "-" "consul-dataplane/1.4.0-dev (linux/arm64)" "-"

@jjti jjti force-pushed the feat/hcp-metrics branch 4 times, most recently from 132d863 to 02f8e13 Compare December 20, 2023 21:08
@jjti jjti changed the title feat/hcp metrics wip / hcp telemetry Dec 22, 2023
pkg/dns/dns_test.go Outdated Show resolved Hide resolved
go.mod Outdated Show resolved Hide resolved
@jjti jjti changed the title wip / hcp telemetry Support pushing of Envoy metrics to HCP Jan 26, 2024
@jjti jjti force-pushed the feat/hcp-metrics branch from 75ff9d2 to e9781f2 Compare January 26, 2024 16:34
@jjti jjti changed the title Support pushing of Envoy metrics to HCP Add pushing of Envoy metrics to HCP Jan 26, 2024
@jjti jjti marked this pull request as ready for review January 26, 2024 16:48
@jjti jjti requested a review from a team as a code owner January 26, 2024 16:48
Copy link

@johnbuonassisi johnbuonassisi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Great work in here.

pkg/hcp/telemetry/exporter_test.go Outdated Show resolved Hide resolved
pkg/hcp/telemetry/metrics.go Outdated Show resolved Hide resolved
pkg/hcp/telemetry/state_test.go Outdated Show resolved Hide resolved
pkg/hcp/telemetry/exporter.go Show resolved Hide resolved
pkg/hcp/telemetry/exporter_test.go Outdated Show resolved Hide resolved
pkg/consuldp/consul_dataplane.go Outdated Show resolved Hide resolved
@jjti jjti force-pushed the feat/hcp-metrics branch from a38b851 to 559813e Compare February 1, 2024 15:28
pkg/dns/dns_test.go Outdated Show resolved Hide resolved
@jjti jjti changed the title Add pushing of Envoy metrics to HCP Add exporting of Envoy metrics to HCP for linked clusters Feb 1, 2024
@jjti jjti added the backport/1.4 Changes are backported to 1.4 label Feb 1, 2024
Copy link
Contributor

@david-yu david-yu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved we'd like to get this into 1.18 GA

.changelog/406.txt Outdated Show resolved Hide resolved
pkg/hcp/telemetry/metrics.go Outdated Show resolved Hide resolved
cmd/consul-dataplane/main.go Outdated Show resolved Hide resolved
@loshz loshz merged commit 3bf86d3 into main Feb 12, 2024
34 checks passed
@loshz loshz deleted the feat/hcp-metrics branch February 12, 2024 20:10
loshz pushed a commit that referenced this pull request Feb 12, 2024
* Bump consul/proto-public and create mocks

* Revert some dep changes

* Add changelog

* Init metrics pushing from consul-dataplane

* Add node.id label

* Remove unused field

* Rename worker vars to exporter

* Return a *state in modState

* Bump dependency

* Remove pointer to exporter

* Add mocks in internal/mocks

* Fix test

* Restore directive

* Revert directive

* Rename metrics > scraper, remove unused changelog

* Fix WatchList for new consul api

* Infer auth endpoint if not in state

* debug: expose pprof/runtime metrics

* Handle snapshot end more gracefully

* Fix x-channel header

* Fix path registration

* Try w/o setting handlers

* Use a larger example stats file in envoy_admin_stats_prometheus

* Bump proto-public

* Remove pprof

---------

Co-authored-by: Dan Bond <danbond@protonmail.com>
(cherry picked from commit 3bf86d3)
loshz pushed a commit that referenced this pull request Feb 12, 2024
… into release/1.4.x (#422)

Add exporting of Envoy metrics to HCP for linked clusters (#370)

* Bump consul/proto-public and create mocks

* Revert some dep changes

* Add changelog

* Init metrics pushing from consul-dataplane

* Add node.id label

* Remove unused field

* Rename worker vars to exporter

* Return a *state in modState

* Bump dependency

* Remove pointer to exporter

* Add mocks in internal/mocks

* Fix test

* Restore directive

* Revert directive

* Rename metrics > scraper, remove unused changelog

* Fix WatchList for new consul api

* Infer auth endpoint if not in state

* debug: expose pprof/runtime metrics

* Handle snapshot end more gracefully

* Fix x-channel header

* Fix path registration

* Try w/o setting handlers

* Use a larger example stats file in envoy_admin_stats_prometheus

* Bump proto-public

* Remove pprof

---------

Co-authored-by: Dan Bond <danbond@protonmail.com>
(cherry picked from commit 3bf86d3)

Co-authored-by: Joshua Timmons <joshua.timmons1@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.4 Changes are backported to 1.4
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants