-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[docs] Clean up internal observability docs (#10454)
#### Description Now that [4246](open-telemetry/opentelemetry.io#4246), [4322](open-telemetry/opentelemetry.io#4322), and [4529](open-telemetry/opentelemetry.io#4529) have been merged, and the new [Internal telemetry](https://opentelemetry.io/docs/collector/internal-telemetry/) and [Troubleshooting](https://opentelemetry.io/docs/collector/troubleshooting/) pages are live, it's time to clean up the underlying Collector repo docs so that the website is the single source of truth. I've deleted any content that was moved to the website, and linked to the relevant sections where possible. I've consolidated what content remains in the observability.md file and left troubleshooting.md and monitoring.md as stubs that point to the website. I also searched the Collector repo for cross-references to these files and adjusted links where appropriate. ~~Note that this PR is blocked by [4731](open-telemetry/opentelemetry.io#4731 EDIT: #4731 is merged and no longer a blocker. <!-- Issue number if applicable --> #### Link to tracking issue Fixes #8886
- Loading branch information
Showing
6 changed files
with
115 additions
and
507 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,70 +1,7 @@ | ||
# Monitoring | ||
|
||
Many metrics are provided by the Collector for its monitoring. Below some | ||
key recommendations for alerting and monitoring are listed. | ||
To learn how to monitor the Collector using its own telemetry, see the [Internal | ||
telemetry] page. | ||
|
||
## Critical Monitoring | ||
|
||
### Data Loss | ||
|
||
Use rate of `otelcol_processor_dropped_spans > 0` and | ||
`otelcol_processor_dropped_metric_points > 0` to detect data loss, depending on | ||
the requirements set up a minimal time window before alerting, avoiding | ||
notifications for small losses that are not considered outages or within the | ||
desired reliability level. | ||
|
||
### Low on CPU Resources | ||
|
||
This depends on the CPU metrics available on the deployment, eg.: | ||
`kube_pod_container_resource_limits{resource="cpu", unit="core"}` for Kubernetes. Let's call it | ||
`available_cores` below. The idea here is to have an upper bound of the number | ||
of available cores, and the maximum expected ingestion rate considered safe, | ||
let's call it `safe_rate`, per core. This should trigger increase of resources/ | ||
instances (or raise an alert as appropriate) whenever | ||
`(actual_rate/available_cores) < safe_rate`. | ||
|
||
The `safe_rate` depends on the specific configuration being used. | ||
// TODO: Provide reference `safe_rate` for a few selected configurations. | ||
|
||
## Secondary Monitoring | ||
|
||
### Queue Length | ||
|
||
Most exporters offer a [queue/retry mechanism](../exporter/exporterhelper/README.md) | ||
that is recommended as the retry mechanism for the Collector and as such should | ||
be used in any production deployment. | ||
|
||
The `otelcol_exporter_queue_capacity` indicates the capacity of the retry queue (in batches). The `otelcol_exporter_queue_size` indicates the current size of retry queue. So you can use these two metrics to check if the queue capacity is enough for your workload. | ||
|
||
The `otelcol_exporter_enqueue_failed_spans`, `otelcol_exporter_enqueue_failed_metric_points` and `otelcol_exporter_enqueue_failed_log_records` indicate the number of span/metric points/log records failed to be added to the sending queue. This may be cause by a queue full of unsettled elements, so you may need to decrease your sending rate or horizontally scale collectors. | ||
|
||
The queue/retry mechanism also supports logging for monitoring. Check | ||
the logs for messages like `"Dropping data because sending_queue is full"`. | ||
|
||
### Receive Failures | ||
|
||
Sustained rates of `otelcol_receiver_refused_spans` and | ||
`otelcol_receiver_refused_metric_points` indicate too many errors returned to | ||
clients. Depending on the deployment and the client’s resilience this may | ||
indicate data loss at the clients. | ||
|
||
Sustained rates of `otelcol_exporter_send_failed_spans` and | ||
`otelcol_exporter_send_failed_metric_points` indicate that the Collector is not | ||
able to export data as expected. | ||
It doesn't imply data loss per se since there could be retries but a high rate | ||
of failures could indicate issues with the network or backend receiving the | ||
data. | ||
|
||
## Data Flow | ||
|
||
### Data Ingress | ||
|
||
The `otelcol_receiver_accepted_spans` and | ||
`otelcol_receiver_accepted_metric_points` metrics provide information about | ||
the data ingested by the Collector. | ||
|
||
### Data Egress | ||
|
||
The `otecol_exporter_sent_spans` and | ||
`otelcol_exporter_sent_metric_points`metrics provide information about | ||
the data exported by the Collector. | ||
[Internal telemetry]: | ||
https://opentelemetry.io/docs/collector/internal-telemetry/#use-internal-telemetry-to-monitor-the-collector |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,140 +1,134 @@ | ||
# OpenTelemetry Collector Observability | ||
# OpenTelemetry Collector internal observability | ||
|
||
## Goal | ||
The [Internal telemetry] page on OpenTelemetry's website contains the | ||
documentation for the Collector's internal observability, including: | ||
|
||
The goal of this document is to have a comprehensive description of observability of the Collector and changes needed to achieve observability part of our [vision](vision.md). | ||
- Which types of observability are emitted by the Collector. | ||
- How to enable and configure these signals. | ||
- How to use this telemetry to monitor your Collector instance. | ||
|
||
## What Needs Observation | ||
If you need to troubleshoot the Collector, see [Troubleshooting]. | ||
|
||
The following elements of the Collector need to be observable. | ||
Read on to learn about experimental features and the project's overall vision | ||
for internal telemetry. | ||
|
||
### Current Values | ||
## Experimental trace telemetry | ||
|
||
- Resource consumption: CPU, RAM (in the future also IO - if we implement persistent queues) and any other metrics that may be available to Go apps (e.g. garbage size, etc). | ||
The Collector does not expose traces by default, but an effort is underway to | ||
[change this][issue7532]. The work includes supporting configuration of the | ||
OpenTelemetry SDK used to produce the Collector's internal telemetry. This | ||
feature is behind two feature gates: | ||
|
||
- Receiving data rate, broken down by receivers and by data type (traces/metrics). | ||
|
||
- Exporting data rate, broken down by exporters and by data type (traces/metrics). | ||
|
||
- Data drop rate due to throttling, broken down by data type. | ||
|
||
- Data drop rate due to invalid data received, broken down by data type. | ||
|
||
- Current throttling state: Not Throttled/Throttled by Downstream/Internally Saturated. | ||
|
||
- Incoming connection count, broken down by receiver. | ||
|
||
- Incoming connection rate (new connections per second), broken down by receiver. | ||
|
||
- In-memory queue size (in bytes and in units). Note: measurements in bytes may be difficult / expensive to obtain and should be used cautiously. | ||
|
||
- Persistent queue size (when supported). | ||
|
||
- End-to-end latency (from receiver input to exporter output). Note that with multiple receivers/exporters we potentially have NxM data paths, each with different latency (plus different pipelines in the future), so realistically we should likely expose the average of all data paths (perhaps broken down by pipeline). | ||
|
||
- Latency broken down by pipeline elements (including exporter network roundtrip latency for request/response protocols). | ||
|
||
“Rate” values must reflect the average rate of the last 10 seconds. Rates must exposed in bytes/sec and units/sec (e.g. spans/sec). | ||
|
||
Note: some of the current values and rates may be calculated as derivatives of cumulative values in the backend, so it is an open question if we want to expose them separately or no. | ||
|
||
### Cumulative Values | ||
|
||
- Total received data, broken down by receivers and by data type (traces/metrics). | ||
|
||
- Total exported data, broken down by exporters and by data type (traces/metrics). | ||
|
||
- Total dropped data due to throttling, broken down by data type. | ||
|
||
- Total dropped data due to invalid data received, broken down by data type. | ||
|
||
- Total incoming connection count, broken down by receiver. | ||
|
||
- Uptime since start. | ||
|
||
### Trace or Log on Events | ||
|
||
We want to generate the following events (log and/or send as a trace with additional data): | ||
|
||
- Collector started/stopped. | ||
|
||
- Collector reconfigured (if we support on-the-fly reconfiguration). | ||
|
||
- Begin dropping due to throttling (include throttling reason, e.g. local saturation, downstream saturation, downstream unavailable, etc). | ||
|
||
- Stop dropping due to throttling. | ||
|
||
- Begin dropping due to invalid data (include sample/first invalid data). | ||
|
||
- Stop dropping due to invalid data. | ||
|
||
- Crash detected (differentiate clean stopping and crash, possibly include crash data if available). | ||
|
||
For begin/stop events we need to define an appropriate hysteresis to avoid generating too many events. Note that begin/stop events cannot be detected in the backend simply as derivatives of current rates, the events include additional data that is not present in the current value. | ||
```bash | ||
--feature-gates=telemetry.useOtelWithSDKConfigurationForInternalTelemetry | ||
``` | ||
|
||
### Host Metrics | ||
The gate `useOtelWithSDKConfigurationForInternalTelemetry` enables the Collector | ||
to parse any configuration that aligns with the [OpenTelemetry Configuration] | ||
schema. Support for this schema is experimental, but it does allow telemetry to | ||
be exported using OTLP. | ||
|
||
The service should collect host resource metrics in addition to service's own process metrics. This may help to understand that the problem that we observe in the service is induced by a different process on the same host. | ||
The following configuration can be used in combination with the aforementioned | ||
feature gates to emit internal metrics and traces from the Collector to an OTLP | ||
backend: | ||
|
||
## How We Expose Telemetry | ||
```yaml | ||
service: | ||
telemetry: | ||
metrics: | ||
readers: | ||
- periodic: | ||
interval: 5000 | ||
exporter: | ||
otlp: | ||
protocol: grpc/protobuf | ||
endpoint: https://backend:4317 | ||
traces: | ||
processors: | ||
- batch: | ||
exporter: | ||
otlp: | ||
protocol: grpc/protobuf | ||
endpoint: https://backend2:4317 | ||
``` | ||
By default, the Collector exposes service telemetry in two ways currently: | ||
See the [example configuration][kitchen-sink] for additional options. | ||
- internal metrics are exposed via a Prometheus interface which defaults to port `8888` | ||
- logs are emitted to stdout | ||
> This configuration does not support emitting logs as there is no support for | ||
> [logs] in the OpenTelemetry Go SDK at this time. | ||
Traces are not exposed by default. There is an effort underway to [change this][issue7532]. The work includes supporting | ||
configuration of the OpenTelemetry SDK used to produce the Collector's internal telemetry. This feature is | ||
currently behind two feature gates: | ||
You can also configure the Collector to send its own traces using the OTLP | ||
exporter. Send the traces to an OTLP server running on the same Collector, so it | ||
goes through configured pipelines. For example: | ||
```bash | ||
--feature-gates=telemetry.useOtelWithSDKConfigurationForInternalTelemetry | ||
```yaml | ||
service: | ||
telemetry: | ||
traces: | ||
processors: | ||
batch: | ||
exporter: | ||
otlp: | ||
protocol: grpc/protobuf | ||
endpoint: ${MY_POD_IP}:4317 | ||
``` | ||
The gate `useOtelWithSDKConfigurationForInternalTelemetry` enables the Collector to parse configuration | ||
that aligns with the [OpenTelemetry Configuration] schema. The support for this schema is still | ||
experimental, but it does allow telemetry to be exported via OTLP. | ||
## Goals of internal telemetry | ||
The following configuration can be used in combination with the feature gates aforementioned | ||
to emit internal metrics and traces from the Collector to an OTLP backend: | ||
The Collector's internal telemetry is an important part of fulfilling | ||
OpenTelemetry's [project vision](vision.md). The following section explains the | ||
priorities for making the Collector an observable service. | ||
```yaml | ||
service: | ||
telemetry: | ||
metrics: | ||
readers: | ||
- periodic: | ||
interval: 5000 | ||
exporter: | ||
otlp: | ||
protocol: grpc/protobuf | ||
endpoint: https://backend:4317 | ||
traces: | ||
processors: | ||
- batch: | ||
exporter: | ||
otlp: | ||
protocol: grpc/protobuf | ||
endpoint: https://backend2:4317 | ||
``` | ||
### Observable elements | ||
See the configuration's [example][kitchen-sink] for additional configuration options. | ||
The following aspects of the Collector need to be observable. | ||
Note that this configuration does not support emitting logs as there is no support for [logs] in | ||
OpenTelemetry Go SDK at this time. | ||
- [Current values] | ||
- Some of the current values and rates might be calculated as derivatives of | ||
cumulative values in the backend, so it's an open question whether to expose | ||
them separately or not. | ||
- [Cumulative values] | ||
- [Trace or log events] | ||
- For start or stop events, an appropriate hysteresis must be defined to avoid | ||
generating too many events. Note that start and stop events can't be | ||
detected in the backend simply as derivatives of current rates. The events | ||
include additional data that is not present in the current value. | ||
- [Host metrics] | ||
- Host metrics can help users determine if the observed problem in a service | ||
is caused by a different process on the same host. | ||
### Impact | ||
We need to be able to assess the impact of these observability improvements on the core performance of the Collector. | ||
The impact of these observability improvements on the core performance of the | ||
Collector must be assessed. | ||
### Configurable Level of Observability | ||
### Configurable level of observability | ||
Some of the metrics/traces can be high volume and may not be desirable to always observe. We should consider adding an observability verboseness “level” that allows configuring the Collector to send more or less observability data (or even finer granularity to allow turning on/off specific metrics). | ||
Some metrics and traces can be high volume and users might not always want to | ||
observe them. An observability verboseness “level” allows configuration of the | ||
Collector to send more or less observability data or with even finer | ||
granularity, to allow turning on or off specific metrics. | ||
The default level of observability must be defined in a way that has insignificant performance impact on the service. | ||
The default level of observability must be defined in a way that has | ||
insignificant performance impact on the service. | ||
[issue7532]: https://github.com/open-telemetry/opentelemetry-collector/issues/7532 | ||
[issue7454]: https://github.com/open-telemetry/opentelemetry-collector/issues/7454 | ||
[Internal telemetry]: | ||
https://opentelemetry.io/docs/collector/internal-telemetry/ | ||
[Troubleshooting]: https://opentelemetry.io/docs/collector/troubleshooting/ | ||
[issue7532]: | ||
https://github.com/open-telemetry/opentelemetry-collector/issues/7532 | ||
[issue7454]: | ||
https://github.com/open-telemetry/opentelemetry-collector/issues/7454 | ||
[logs]: https://github.com/open-telemetry/opentelemetry-go/issues/3827 | ||
[OpenTelemetry Configuration]: https://github.com/open-telemetry/opentelemetry-configuration | ||
[kitchen-sink]: https://github.com/open-telemetry/opentelemetry-configuration/blob/main/examples/kitchen-sink.yaml | ||
[OpenTelemetry Configuration]: | ||
https://github.com/open-telemetry/opentelemetry-configuration | ||
[kitchen-sink]: | ||
https://github.com/open-telemetry/opentelemetry-configuration/blob/main/examples/kitchen-sink.yaml | ||
[Current values]: | ||
https://opentelemetry.io/docs/collector/internal-telemetry/#values-observable-with-internal-metrics | ||
[Cumulative values]: | ||
https://opentelemetry.io/docs/collector/internal-telemetry/#values-observable-with-internal-metrics | ||
[Trace or log events]: | ||
https://opentelemetry.io/docs/collector/internal-telemetry/#events-observable-with-internal-logs | ||
[Host metrics]: | ||
https://opentelemetry.io/docs/collector/internal-telemetry/#lists-of-internal-metrics |
Oops, something went wrong.