Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POC: OpenTelemetry metrics rework #3448

Closed
wants to merge 19 commits into from
Closed

Conversation

timwoj
Copy link
Member

@timwoj timwoj commented Nov 9, 2023

This is a draft PR for a rework I’ve been working on of our telemetry subsystem. The intention for the PR is to discuss the current code structure and whether anything needs to change before finishing off the rest of the implementation.

The goal here is to move the telemetry code out of the Broker framework into a framework of its own, using OpenTelemetry as the underlying transport framework. The unit tests all pass but the btests are currently failing in a lot of places. The failures are mostly for common reasons (especially for histograms, the bucket comment below).

What’s implemented:

  • Basic family and instrument infrastructure. The existing instrument types are all there with minimal changes to the API for using them.
  • Script-level family and instrument API. This exists just as it does in the Broker version, with no changes.
  • Script-level collection API. This changes quite a bit, and will be described below.

What’s features are still missing:

  • A thread that reads process state (cpu, memory, etc) and reports it to prometheus. I would have liked to use OTel’s observable instruments for this but ran into some snags. The primary one is that each observable only ties to a single metric, so you’d need a separate observer thread for each of cpu, memory, and so on. This leads to having to read that system information multiple times, or having a separate tread to read it once on a timer and store it. At that point, there might as well just be a thread that reads it and sets non-observable instruments.
  • Aggregation of metric data to a single node. I could see this being implemented one of two ways. We could simply make each of the methods in the instrument classes (Counter, Gauge, and Histogram) write the current value to a topic whenever something changes. This could lead to a lot of extra event traffic though, and may not be terribly efficient. A second method would be to implement an OTel exporter, add it to a periodic metric reader, and have it export the values to the topic on a timer. We’d need to figure out how to set the values in the instruments directly, since the API OTel exposes only allows you to add to the value.
  • The endpoint attribute that is added to every metric in the Broker implementation is missing.

Some broken things in OTel:

  • Histogram buckets aren’t being set correctly. This is a known issue on OTel’s side, there’s an issue open, and we’re waiting for them to fix it.
  • OTel doesn’t handle the kSum aggregation type correctly when it comes to their Prometheus exporter, so it doesn’t set the _total suffix on the metric name before exporting. This is already fixed in an open PR that will hopefully land before the next OTel release.
  • OTel doesn’t handle blank unit values correctly in the Prometheus exporter and always appends the underscore. This is also fixed in the same PR as above. We’re handling both of these two on our side already, but the workaround could be removed once it’s fixed upstream.

API Changes and proposed changes:

  • All of the objects stored by the telemetry framework changed to std::shared_ptrs, including the data passed to OpaqueVals for script land. This required a few changes around the rest of the code base, but nothing really significant.
  • The script-level collection API changes from multiple C++ methods to a single method. Previously we would collect a list of metrics that match a filter, return them back to the BIF, then for each of those collect a RecordVal for the data and store them in a VectorVal in the BIF itself. This changes to a single method you can call in the manager that returns the VectorVal containing all of the matching records directly instead of the intermediate step.
  • I’m fairly certain that the placeholder “1” value for units can be removed and the default value for the methods that can take it can instead change to a blank string. OTel will handle the blank string correctly, so we’d have to change it on our side as it is.

Builds on Windows are currently failing because of Conan dependency issues.

@timwoj timwoj force-pushed the topic/timw/open-telemetry branch 2 times, most recently from 2da4d00 to 8c7ef63 Compare November 10, 2023 16:11
@timwoj timwoj added the fullci Run the full set of tasks in CI label Nov 10, 2023
@timwoj timwoj marked this pull request as draft November 10, 2023 16:54
@timwoj timwoj force-pushed the topic/timw/open-telemetry branch 7 times, most recently from 5b1e4bd to f1fbfef Compare November 13, 2023 22:22
@timwoj timwoj changed the title RFC: OpenTelemetry POC: OpenTelemetry Nov 14, 2023
@timwoj timwoj changed the title POC: OpenTelemetry POC: OpenTelemetry metrics rework Nov 14, 2023
@timwoj timwoj force-pushed the topic/timw/open-telemetry branch 7 times, most recently from f80d41a to 11ff00a Compare November 18, 2023 00:00
@timwoj timwoj removed the fullci Run the full set of tasks in CI label Nov 20, 2023
@timwoj timwoj force-pushed the topic/timw/open-telemetry branch 3 times, most recently from 25f564b to df5d2d2 Compare November 22, 2023 17:24
Copy link
Contributor

@awelzel awelzel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I skimmed through and tried some things, left some comments. I can't say I fully understand all the OTEL concepts and also have ignored the histogram situation, but otherwise not too intimidating :-)

I left a comment about performance of Inc(), probably need to figure that out. Otherwise no separate stats thread and making callback based metrics first-class would be interesting. I'm not overly fond of the "update something ever x seconds" which may lead to stale metrics.

src/telemetry/Counter.h Outdated Show resolved Hide resolved
src/telemetry/Gauge.h Outdated Show resolved Hide resolved
src/telemetry/ProcessStats.h Show resolved Hide resolved
src/telemetry/ProcessStats.cc Outdated Show resolved Hide resolved
src/telemetry/ProcessStats.cc Show resolved Hide resolved
src/telemetry/Manager.cc Outdated Show resolved Hide resolved
src/telemetry/ProcessStats.cc Outdated Show resolved Hide resolved
src/telemetry/Manager.cc Show resolved Hide resolved

/**
* Increments the value by @p amount.
* @pre `amount >= 0`
*/
void Inc(int64_t amount) noexcept { broker::telemetry::inc(hdl, amount); }
void Inc(BaseType amount) noexcept {
handle->Add(amount, attributes);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I smoke tested performance locally with a 200k SYN packet pcap and zeek -r time goes from ~10sec to ~11.3sec, Not crazy, but seems too much. There's opentelemetry stacks showing in flamegraphs for the EventHandler and log write counting and session handling significantly.

Is it possible to get access to the counter directly? Skimming the stacks it appears there's a lookup done in some hashtable for each Add() call. Or alternatively switch to a Callback based / Observable instrument exposing the value that's kept. Otherwise, this is introducing significant overhead even if metrics aren't used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flamegraph with "telemetry" highlighted.

Screenshot-20231123-164219

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW: caf::telemetry matches 0.2% of samples on current master.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a lookup in the GetOrAdd methods that does a comparison across all of the existing counters for a matching set of labels. That's probably what it is, without having actually looked at the flamegraph. In some cases we do store the counters locally (session::Manager, for one) but we don't do that in every case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I took it as something within handle->Add(amount, attributes). opentelemetry needs to internally find the right counter of the given labels/attributes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm looking at a flamegraph of it now. I might have to open an issue on their side, as this dives really quickly down into their code. MetricAttributeIterable::ForEachKeyValue is our code. I might be able to optimize the loop itself by storing the values locally as opentelemetry::common::AttributeValue objects instead of strings. That'd avoid a conversion every time we do a lookup, but ForEachKeyValue is dominated mostly by the call to the callback method passed as an argument. That callback function comes from OTEL itself so we don't have much control over its performance.

Copy link
Member Author

@timwoj timwoj Dec 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ec5b2ab cuts down on it a bit by storing a secondary map that are views into the existing string map, but in the otel types so it doesn't have to do conversions. With the 200k syns pcap we use on os-perf-1, it cuts down the contribution of zeek::telemetry from 11.5% to 9.3%. There's still a performance hit from the function call though. I asked on their Slack channel what else we can do about it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I talked to one of the OTEL guys and shared a flamegraph with him. He agreed that it's a performance issue for it to be rehashing the map every time, even though the map hasn't changed at all since the last time. He said he'd look into it as part of some other performance issues he's investigating soon.

src/telemetry/Counter.cc Show resolved Hide resolved
@awelzel
Copy link
Contributor

awelzel commented Nov 23, 2023

The endpoint attribute that is added to every metric in the Broker implementation is missing.

But it does add {otel_scope_name} - would be nice if that was suppressed :-)

# HELP zeek_log_stream_writes_total_ Total number of log writes for the given stream.                                                                                                                                                                                                                                                                                                                                                                                                                                                                 │  CPUs which run at the same hardware frequency: 7
# TYPE zeek_log_stream_writes_total_ counter                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          │  CPUs which need to have their frequency coordinated by software: 7
zeek_log_stream_writes_total_{endpoint="",module="PacketFilter",stream="PacketFilter::LOG",otel_scope_name="zeek"} 1                                                                                                                                                                                                                                                                                                                                                                                                                                  │  maximum transition latency: 20.0 us.
# TYPE process_open_fds_ gauge                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        │  hardware limits: 400 MHz - 4.60 GHz
process_open_fds_{otel_scope_name="process"} 25       

@timwoj
Copy link
Member Author

timwoj commented Dec 2, 2023

The endpoint attribute that is added to every metric in the Broker implementation is missing.

But it does add {otel_scope_name} - would be nice if that was suppressed :-)

# HELP zeek_log_stream_writes_total_ Total number of log writes for the given stream.                                                                                                                                                                                                                                                                                                                                                                                                                                                                 │  CPUs which run at the same hardware frequency: 7
# TYPE zeek_log_stream_writes_total_ counter                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          │  CPUs which need to have their frequency coordinated by software: 7
zeek_log_stream_writes_total_{endpoint="",module="PacketFilter",stream="PacketFilter::LOG",otel_scope_name="zeek"} 1                                                                                                                                                                                                                                                                                                                                                                                                                                  │  maximum transition latency: 20.0 us.
# TYPE process_open_fds_ gauge                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        │  hardware limits: 400 MHz - 4.60 GHz
process_open_fds_{otel_scope_name="process"} 25       

I agree that I'd love for that to be suppressed but it's apparently part of the otel spec: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/compatibility/prometheus_and_openmetrics.md#instrumentation-scope-1

Prometheus exporters MUST add the scope name as the otel_scope_name label and the scope version as the otel_scope_version label on all metric points by default, based on the scope the original data point was nested in.

@awelzel
Copy link
Contributor

awelzel commented Dec 4, 2023

I agree that I'd love for that to be suppressed but it's apparently part of the otel spec: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/compatibility/prometheus_and_openmetrics.md#instrumentation-scope-1

Prometheus exporters MUST add the scope name as the otel_scope_name label and the scope version as the otel_scope_version label on all metric points by default, based on the scope the original data point was nested in.

Which is followed by this sentence:

Prometheus exporters SHOULD provide a configuration option to disable the otel_scope_info metric and otel_scope_ labels.

There's no such option? :-)

@timwoj timwoj force-pushed the topic/timw/open-telemetry branch 2 times, most recently from 50679fd to 3d3e30f Compare December 7, 2023 21:46
@timwoj timwoj force-pushed the topic/timw/open-telemetry branch from 3d3e30f to 6ec89cc Compare December 8, 2023 00:23
@timwoj
Copy link
Member Author

timwoj commented Dec 8, 2023

There's no such option? :-)

Not that I could find. I opened open-telemetry/opentelemetry-cpp#2442 after asking on Slack and getting no response for a few days.

EDIT: This is fixed by open-telemetry/opentelemetry-cpp#2451, once that gets merged and released.

timwoj added 8 commits January 2, 2024 15:12
With the broker version the copying was mostly ok because all of the
actual telemetry data was stored internally in broker and accessed
via handle objects. Copying an object wouldn't matter as long as the
handle was maintained. In the OTEL version, a copy of the count data
is stored local to the Counter/Gauge/Histogram objects, so copies
through the OpaqueVal objects will screw things up and the values
get "lost".

This is unfortunately a breaking change to the API, as GetOrAdd()
returns a shared_ptr now instead of a bare object.
- Reduce code duplication via a set of base classes for Counter/Gauge/Histogram
- Only create a single otel instrument for each prefix/name combination
@timwoj timwoj force-pushed the topic/timw/open-telemetry branch from 2722f68 to dc3dfe5 Compare January 2, 2024 22:12
@timwoj timwoj force-pushed the topic/timw/open-telemetry branch from 8c124c1 to 9e8cc24 Compare January 3, 2024 22:49
@timwoj timwoj force-pushed the topic/timw/open-telemetry branch from f69b93c to 6ff2cc3 Compare January 9, 2024 00:19
@timwoj
Copy link
Member Author

timwoj commented Jan 18, 2024

The otel work is on hold until they fix all of the issues that I've opened with them, since most of them are blockers to our usage. I'm going to close this and will open a new one when there's something to review.

@timwoj timwoj closed this Jan 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants