[Stack Monitoring] PoC for kibana instrumentation using opentelemetry metrics sdk #128755

matschaffer · 2022-03-29T13:43:37Z

We discussed a number of possible implementations for ongoing kibana instrumentations in (internal) https://github.com/elastic/observability-dev/issues/2054

In this issue we'll build a proof of concept for how that might work.

Here are the two options we'd like to PoC on. They should both be very similar at the code level, the main difference is the collection mechanism (pull from metricbeat vs push to apm-server).

option 2: OpenTelemetry Metrics API prometheus endpoint with Elastic Agent prometheus input

Here we use the official otel metrics sdk and expose that via prometheus protocol for elastic-agent to poll via the underlying metricbeat prometheus module.

graph LR

subgraph ElasticDeployment["Elastic Deployment"]
  subgraph kibana
    OtelMetricsSDK["Otel Metrics SDK"]
    OtelMetricsPrometheusExporter["/metrics (prometheus-protocol)"]
    OtelMetricsSDK-->OtelMetricsPrometheusExporter

    click OtelMetricsSDK "https://opentelemetry.io/docs/instrumentation/js/getting-started/nodejs/#metrics"
  end

  subgraph elastic-agent
    Metricbeat["metrics/prometheus"]
  end

  Metricbeat-->|"poll (prometheus protocol)"|OtelMetricsPrometheusExporter
  Metricbeat-->|_bulk|elasticsearch
end

option 3: OpenTelemetry Metrics API exported as OpenTelemetry Protocol

Here we use the official otel metrics sdk and push that via OpenTelemetry Protocol. OpenTelemetry Protocol is natively supported by Elastic APM so we use that to receive the data. There are some caveats for otel collection, but none of them should hinder the collection of platform observability metrics today.

Ideally this apm-server is managed by elastic-agent, but that work is still TBD. See 2022-01 - Elastic Agent Pipeline Runtime Environment for latest info.

graph LR

subgraph ElasticDeployment["Elastic Deployment"]
  subgraph kibana
    OtelMetricsSDK["Otel Metrics SDK"]
  end

  subgraph elastic-agent
    APMServer["apm-server"]
  end

  OtelMetricsSDK-->|"push (OTLP)"|APMServer["apm-server"]
  APMServer-->|_bulk|elasticsearch
end

Some consumers to keep in mind (see internal companion issue):

Stack Monitoring
High Level Health API
APM instrumentation of stack
Telemetry (Event based telemetry) - could maybe leave this as it's own entity, the above are more critical to align

Steps

Get otel SDK into kibana
Add otel SDK instrumentation similar to those already found in stack monitoring (using [ResponseOps] Visualize alerting metrics in Stack Monitoring #123726 as a recent/fresh example)
Build comparable visualizations of otel data
Validate both collection options
Test deployment & validity on an ESS cluster

AC: Recording of PoC as walkthrough

elasticmachine · 2022-03-29T13:43:39Z

Pinging @elastic/infra-monitoring-ui (Team:Infra Monitoring UI)

matschaffer · 2022-03-29T13:45:16Z

This should go in @elastic/infra-monitoring-ui cycle 9 once it gets created.

matschaffer · 2022-04-13T06:50:11Z

Reposting an early diagram from @chrisronline on how the kibana internal API might look.

chrisronline · 2022-04-14T14:06:12Z

Love this effort!

I'll just add some thoughts about the parts in the diagram above.

I don't have a strong opinion on the options listed in this issue, but I do want to stress the desire to make writes and reads as easy as possible for Kibana plugin owners. In an ideal world, they are able to either directly use some open telemetry SDK to write metrics (in my example above, I abstracted this detail away by adding write apis to the monitoring_collection plugin but maintaining the consistent terminology as it is standardized) and then they have some easy way to read the metrics back and show them inside of their UIs - keep in mind that the location of the data could be on a separate cluster and plugin owners do need to know this in order to read the data back.

The other part of this that I think is important to mention is how the Stack Monitoring plugin evolves as a result of this - IMO, it should turn into a pure read plugin that subscribes to the same read APIs that other Kibana plugins do. It still has a significant purpose because it is the place where users will see metrics at a birds-eye view, which is very helpful in understanding correlation of problems.

I know these things are probably in everyone's mind around this effort, but I don't see it mentioned explicitly so I want to ensure we have a plan for this too

cyrille-leclerc · 2022-04-14T14:53:19Z

have some easy way to read the metrics back and show them inside of their UIs - keep in mind that the location of the data could be on a separate cluster and plugin owners do need to know this in order to read the data back.

Did you consider an abstraction so that plugin authors would just have a read API on observability data and the location of these data (local versus remote Elasticsearch) would be injected by the "Platform Observability" configuration?

chrisronline · 2022-04-14T14:56:15Z

Did you consider an abstraction so that plugin authors would just have a read API on observability data and the location of these data (local versus remote Elasticsearch) would be injected by the "Platform Observability" configuration?

Exactly what I think we should do - in my model above, that's the other purpose of monitoring_collection. It serves a write abstraction (we could remove this if folks aren't a fan) and a read abstraction, allowing for a singe point of configuration.

Now, how that configuration gets there is another story. Following the stack monitoring path, we'd just need to document the need to configure it appropriately but maybe there is something fancy that Elastic agent can do here - I'm not well versed in that area.

matschaffer · 2022-05-30T04:51:57Z

For approach, I'm planning to try to replicate #123726 for a good metric comparison. If there's anything in that new response ops work that we can't do with the otel metricspace, we should highlight it as early as possible.

matschaffer · 2022-05-31T02:33:03Z

Noting that open-telemetry/opentelemetry-js#2929 is merged, so we may be able to use a >0.27 version here. That was the PR blocking grpc support in the 0.28 release. Current as of writing is 0.29.

matschaffer · 2022-05-31T05:21:07Z

So I got some data coming from something along side

kibana/x-pack/plugins/alerting/server/task_runner/task_runner.ts

Line 800 in fdf2086

this.inMemoryMetrics.increment(IN_MEMORY_METRICS.RULE_EXECUTIONS);

Issues so far:

counter isn't incrementing, I'm just getting a ton of "1"s reported. Thinking I need to move counter initialization higher into the plugin initialization.
I have this.metrics.ruleExecutions.add(1, { rule: this.ruleType.id }); set, but it's not coming through as a a label. I'll try my demo app to make sure this isn't a bug in apm-server 8.2.2

matschaffer · 2022-05-31T06:07:33Z

Yeah, definitely need to move metric creation up. I put it in the TaskRunner constructor but looks like that probably gets created once for each rule evaluation.

matschaffer · 2022-05-31T06:30:21Z

Winning! I'll open up an initial PR so folks can play a little.

matschaffer · 2022-05-31T06:40:39Z

Doc counts still seem really high. Not sure what's up with that.

update apm-server delivers once a minute with event.ingested reflecting the otel interval. The above screenshots are by @timestamp, so at least 6 docs per counter per minute.

matschaffer · 2022-06-14T07:20:50Z

Success!

This is option 3 running in ESS by adding this to the kibana configuration:

monitoring_collection.opentelemetry.metrics:
  otlp:
    url: "https://MY-MONITORING-CLUSTER.apm.us-west2.gcp.elastic-cloud.com"
    headers:
      Authorization: "Bearer REDACTED"
  prometheus.enabled: true

The prometheus endpoint is active too:

I'm trying to see if I can get the ESS-included agent polling it, but not sure if that's possible. Might have to attach a self-managed agent.

matschaffer · 2022-06-15T02:55:32Z

We have a demo & notes posted internally (https://drive.google.com/file/d/1uAOvX9IXi5Y3D2QhrMu2pMm8yplxXxbn/view?usp=sharing) which I think meets the acceptance criterial for this issue.

The PoC PR is still open and I'll open new issues to work toward merging it as the conversation evolves.

matschaffer added Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services Feature:Stack Monitoring labels Mar 29, 2022

matschaffer self-assigned this May 30, 2022

matschaffer mentioned this issue May 31, 2022

OpenTelemetry metric labels lost between 8.1.2 and 8.2.2 elastic/apm-server#8260

Closed

matschaffer mentioned this issue May 31, 2022

[Proof of Concept] Add otel metrics to alerting plugin #133171

Closed

10 tasks

matschaffer mentioned this issue Jun 14, 2022

feat(opentelemetry-exporter-prometheus): export PrometheusSerializer open-telemetry/opentelemetry-js#3034

Merged

4 tasks

matschaffer closed this as completed Jun 15, 2022

matschaffer added the Platform Observability Platform Observability WG issues https://github.com/elastic/observability-dev/issues/2055 label Jun 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Stack Monitoring] PoC for kibana instrumentation using opentelemetry metrics sdk #128755

[Stack Monitoring] PoC for kibana instrumentation using opentelemetry metrics sdk #128755

matschaffer commented Mar 29, 2022 •

edited

Loading

elasticmachine commented Mar 29, 2022

matschaffer commented Mar 29, 2022

matschaffer commented Apr 13, 2022

chrisronline commented Apr 14, 2022 •

edited by miltonhultgren

Loading

cyrille-leclerc commented Apr 14, 2022

chrisronline commented Apr 14, 2022

matschaffer commented May 30, 2022

matschaffer commented May 31, 2022

matschaffer commented May 31, 2022 •

edited

Loading

matschaffer commented May 31, 2022

matschaffer commented May 31, 2022

matschaffer commented May 31, 2022 •

edited

Loading

matschaffer commented Jun 14, 2022 •

edited

Loading

matschaffer commented Jun 15, 2022

[Stack Monitoring] PoC for kibana instrumentation using opentelemetry metrics sdk #128755

[Stack Monitoring] PoC for kibana instrumentation using opentelemetry metrics sdk #128755

Comments

matschaffer commented Mar 29, 2022 • edited Loading

option 2: OpenTelemetry Metrics API prometheus endpoint with Elastic Agent prometheus input

option 3: OpenTelemetry Metrics API exported as OpenTelemetry Protocol

Steps

elasticmachine commented Mar 29, 2022

matschaffer commented Mar 29, 2022

matschaffer commented Apr 13, 2022

chrisronline commented Apr 14, 2022 • edited by miltonhultgren Loading

cyrille-leclerc commented Apr 14, 2022

chrisronline commented Apr 14, 2022

matschaffer commented May 30, 2022

matschaffer commented May 31, 2022

matschaffer commented May 31, 2022 • edited Loading

matschaffer commented May 31, 2022

matschaffer commented May 31, 2022

matschaffer commented May 31, 2022 • edited Loading

matschaffer commented Jun 14, 2022 • edited Loading

matschaffer commented Jun 15, 2022

matschaffer commented Mar 29, 2022 •

edited

Loading

chrisronline commented Apr 14, 2022 •

edited by miltonhultgren

Loading

matschaffer commented May 31, 2022 •

edited

Loading

matschaffer commented May 31, 2022 •

edited

Loading

matschaffer commented Jun 14, 2022 •

edited

Loading