Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect Behavior in OpenTelemetry Collector Spanmetrics #27472

Closed
lucasoares opened this issue Oct 6, 2023 · 13 comments
Closed

Incorrect Behavior in OpenTelemetry Collector Spanmetrics #27472

lucasoares opened this issue Oct 6, 2023 · 13 comments
Labels

Comments

@lucasoares
Copy link

lucasoares commented Oct 6, 2023

Component(s)

connector/spanmetrics, exporter/prometheusremotewrite

What happened?

Subject: Issue Report: Incorrect Behavior in OpenTelemetry Collector Spanmetrics

Issue Description:

We're facing a peculiar issue with the OpenTelemetry Collector's Spanmetrics connector and could use some help sorting it out.

Here's a quick rundown:

Problem:

  • We've set up an architecture using Grafana Stack LGTM, with Grafana Loki, Tempo, and Mimir for logs, tracing, and metrics, respectively.
  • The goal is to sample traces efficiently but capture 100% of spanmetrics for a comprehensive APM dashboard.
  • Our setup involves the otel/opentelemetry-collector-contrib as a load balancer, handling trace metrics with the 'spanmetrics' connector and routing traces/metrics based on an attribute_source to apply our internal's tenant distribuition inside Grafana's services.
  • Traces are correctly routed and stored in Grafana Tempo, but the spanmetrics exhibit strange behavior on Grafana Mimir.

Spanmetrics Configuration:

connectors:
  spanmetrics:
    histogram:
      explicit:
        buckets: [1ms, 2ms, ... , 10000s]
    namespace: traces.spanmetrics
    dimensions:
      - name: http.status_code
      - name: http.method
      - name: rpc.grpc.status_code
      - name: db.system
      - name: external.service
      - name: k8s.cluster.name

Issue Details:

  • Executing code that generates a specific span 10 times accumulates the counter timeseries correctly.
  • However, querying the metric using PromQL functions like increase or rate yields inaccurate results.
  • For example, increase(traces_spanmetrics_calls_total{service_name="my-service"}[5m]) shows a continuously increasing line, reaching 600 executions, and never returning to 0, even after a trace-free period.

Observations:

  • The discrepancy is causing inflated values in application metrics, with rate showing over 100,000,000 spans/minute for an app generating 40,000 spans/minute.

  • We sought help on the Grafana Mimir Slack channel (link) without success, but since we haven't found issues with metrics generated by our own applications, it suggests the problem lies within the OpenTelemetry Collector.

Screenshots:

image

image

In this last example, the metric only stopped because we restarted the opentelemetry-collector that was serving these spanmetrics

Another example of the metric being incorrect after the application no longer generates new spans:

image

If you need more details or logs, just let us know!

Collector version

0.83.0

Environment information

Environment

Kubernetes using official helm-chart:

image:
  # If you want to use the core image `otel/opentelemetry-collector`, you also need to change `command.name` value to `otelcol`.
  repository: otel/opentelemetry-collector-contrib
  pullPolicy: IfNotPresent
  # Overrides the image tag whose default is the chart appVersion.
  tag: "0.83.0"
  # When digest is set to a non-empty value, images will be pulled by digest (regardless of tag value).
  digest: ""

OpenTelemetry Collector configuration

There are 2 yaml helm configurations in this section.

The loadbalancer:

# Default values for opentelemetry-collector.
# This is a YAML-formatted file.
# Declare variables to be passed into your templates.

nameOverride: ""
fullnameOverride: ""

# Valid values are "daemonset", "deployment", and "statefulset".
mode: "deployment"

configMap:
  # Specifies whether a configMap should be created (true by default)
  create: true

# Base collector configuration.
# Supports templating. To escape existing instances of {{ }}, use {{` <original content> `}}.
# For example, {{ REDACTED_EMAIL }} becomes {{` {{ REDACTED_EMAIL }} `}}.
config:
  receivers:
    jaeger: null
    zipkin: null
    prometheus: null
    otlp:
      protocols:
        grpc:
          endpoint: ${env:MY_POD_IP}:4317
          max_recv_msg_size_mib: 500
        http:
          endpoint: ${env:MY_POD_IP}:4318
  processors:
    batch:
      send_batch_max_size: 8192
    routing:
      from_attribute: k8s.cluster.name
      attribute_source: resource
      table:
      - value: a
        exporters:
          - prometheusremotewrite/mimir-a
      - value: b
        exporters:
          - prometheusremotewrite/mimir-b
      - value: c
        exporters:
          - prometheusremotewrite/mimir-c
      - value: d
        exporters:
          - prometheusremotewrite/mimir-d
      - value: e
        exporters:
          - prometheusremotewrite/mimir-e
      - value: e
        exporters:
          - prometheusremotewrite/mimir-f
      - value: f
        exporters:
          - prometheusremotewrite/mimir-g
      - value: g
        exporters:
          - prometheusremotewrite/mimir-h
      - value: h
        exporters:
          - prometheusremotewrite/mimir-i
      - value: J
        exporters:
          - prometheusremotewrite/mimir-j
    # If set to null, will be overridden with values based on k8s resource limits
    memory_limiter: null
  connectors:
    spanmetrics:
      histogram:
        explicit:
          buckets: [1ms, 2ms, 4ms, 6ms, 8ms, 10ms, 50ms, 100ms, 200ms, 400ms, 800ms, 1s, 1400ms, 2s, 5s, 10s, 15s, 20s, 40s, 100s, 500s, 1000s, 10000s]
      namespace: traces.spanmetrics
      dimensions:
        - name: http.status_code
        - name: http.method
        - name: rpc.grpc.status_code
        - name: db.system
        - name: external.service
        - name: k8s.cluster.name
  exporters:
    logging: null
    prometheusremotewrite/mimir-a:
      endpoint: http://mimir-distributor.mimir-system.svc.cluster.local:8080/api/v1/push
      resource_to_telemetry_conversion:
        enabled: true
      tls:
        insecure: true
      headers:
        X-Scope-OrgID: grafanaaMimir
      remote_write_queue:
        enabled: true
        queue_size: 10000
        num_consumers: 5
    prometheusremotewrite/mimir-b:
      endpoint: http://mimir-distributor.mimir-system.svc.cluster.local:8080/api/v1/push
      resource_to_telemetry_conversion:
        enabled: true
      tls:
        insecure: true
      headers:
        X-Scope-OrgID: grafanabMimir
      remote_write_queue:
        enabled: true
        queue_size: 10000
        num_consumers: 5
    prometheusremotewrite/mimir-c:
      endpoint: http://mimir-distributor.mimir-system.svc.cluster.local:8080/api/v1/push
      resource_to_telemetry_conversion:
        enabled: true
      tls:
        insecure: true
      headers:
        X-Scope-OrgID: grafanacMimir
      remote_write_queue:
        enabled: true
        queue_size: 10000
        num_consumers: 5
    prometheusremotewrite/mimir-d:
      endpoint: http://mimir-distributor.mimir-system.svc.cluster.local:8080/api/v1/push
      resource_to_telemetry_conversion:
        enabled: true
      tls:
        insecure: true
      headers:
        X-Scope-OrgID: grafanadMimir
      remote_write_queue:
        enabled: true
        queue_size: 10000
        num_consumers: 5
    prometheusremotewrite/mimir-e:
      endpoint: http://mimir-distributor.mimir-system.svc.cluster.local:8080/api/v1/push
      resource_to_telemetry_conversion:
        enabled: true
      tls:
        insecure: true
      headers:
        X-Scope-OrgID: grafanaFirehoseMimir
      remote_write_queue:
        enabled: true
        queue_size: 10000
        num_consumers: 5
    prometheusremotewrite/mimir-f:
      endpoint: http://mimir-distributor.mimir-system.svc.cluster.local:8080/api/v1/push
      resource_to_telemetry_conversion:
        enabled: true
      tls:
        insecure: true
      headers:
        X-Scope-OrgID: grafanaeMimir
      remote_write_queue:
        enabled: true
        queue_size: 10000
        num_consumers: 5
    prometheusremotewrite/mimir-g:
      endpoint: http://mimir-distributor.mimir-system.svc.cluster.local:8080/api/v1/push
      resource_to_telemetry_conversion:
        enabled: true
      tls:
        insecure: true
      headers:
        X-Scope-OrgID: grafanafMimir
      remote_write_queue:
        enabled: true
        queue_size: 10000
        num_consumers: 5
    prometheusremotewrite/mimir-h:
      endpoint: http://mimir-distributor.mimir-system.svc.cluster.local:8080/api/v1/push
      resource_to_telemetry_conversion:
        enabled: true
      tls:
        insecure: true
      headers:
        X-Scope-OrgID: grafanagMimir
      remote_write_queue:
        enabled: true
        queue_size: 10000
        num_consumers: 5
    prometheusremotewrite/mimir-i:
      endpoint: http://mimir-distributor.mimir-system.svc.cluster.local:8080/api/v1/push
      resource_to_telemetry_conversion:
        enabled: true
      tls:
        insecure: true
      headers:
        X-Scope-OrgID: grafanahMimir
      remote_write_queue:
        enabled: true
        queue_size: 10000
        num_consumers: 5
    prometheusremotewrite/mimir-j:
      endpoint: http://mimir-distributor.mimir-system.svc.cluster.local:8080/api/v1/push
      resource_to_telemetry_conversion:
        enabled: true
      tls:
        insecure: true
      headers:
        X-Scope-OrgID: grafanajMimir
      remote_write_queue:
        enabled: true
        queue_size: 10000
        num_consumers: 5
    loadbalancing:
      protocol:
        otlp:
          tls:
            insecure: true
      resolver:
        dns:
          hostname: opentelemetry-collector-tail.tempo-system.svc.cluster.local
          port: 4317

  extensions:
    # The health_check extension is mandatory for this chart.
    # Without the health_check extension the collector will fail the readiness and liveliness probes.
    # The health_check extension can be modified, but should never be removed.
    health_check: {}
    memory_ballast:
      size_in_percentage: 33
  service:
    telemetry:
      metrics:
        address: 0.0.0.0:8888
      logs:
        encoding: json
    extensions:
      - health_check
      - memory_ballast
    pipelines:
      logs: null
      metrics:
        receivers:
          - spanmetrics
        processors:
          - memory_limiter
          - batch
          - routing
        exporters:
          - prometheusremotewrite/mimir-a
          - prometheusremotewrite/mimir-b
          - prometheusremotewrite/mimir-c
          - prometheusremotewrite/mimir-d
          - prometheusremotewrite/mimir-e
          - prometheusremotewrite/mimir-f
          - prometheusremotewrite/mimir-g
          - prometheusremotewrite/mimir-h
          - prometheusremotewrite/mimir-i
          - prometheusremotewrite/mimir-j
      traces:
        receivers:
          - otlp
        processors:
          - memory_limiter
          - batch
        exporters:
          - loadbalancing
          - spanmetrics

image:
  # If you want to use the core image `otel/opentelemetry-collector`, you also need to change `command.name` value to `otelcol`.
  repository: otel/opentelemetry-collector-contrib
  pullPolicy: IfNotPresent
  # Overrides the image tag whose default is the chart appVersion.
  tag: "0.83.0"
  # When digest is set to a non-empty value, images will be pulled by digest (regardless of tag value).
  digest: ""
imagePullSecrets: []

# OpenTelemetry Collector executable
command:
  name: otelcol-contrib
  extraArgs:
    - --feature-gates=pkg.translator.prometheus.NormalizeName

nodeSelector:
  role: lgtm
tolerations:
- effect: NoSchedule
  key: grafana-stack
  operator: Exists

# Configuration for ports
# nodePort is also allowed
ports:
  otlp:
    enabled: true
    containerPort: 4317
    servicePort: 4317
    hostPort: 4317
    protocol: TCP
    # nodePort: 30317
    appProtocol: grpc
  otlp-http:
    enabled: true
    containerPort: 4318
    servicePort: 4318
    hostPort: 4318
    protocol: TCP
  jaeger-compact:
    enabled: false
    containerPort: 6831
    servicePort: 6831
    hostPort: 6831
    protocol: UDP
  jaeger-thrift:
    enabled: false
    containerPort: 14268
    servicePort: 14268
    hostPort: 14268
    protocol: TCP
  jaeger-grpc:
    enabled: false
    containerPort: 14250
    servicePort: 14250
    hostPort: 14250
    protocol: TCP
  zipkin:
    enabled: false
    containerPort: 9411
    servicePort: 9411
    hostPort: 9411
    protocol: TCP
  metrics:
    # The metrics port is disabled by default. However you need to enable the port
    # in order to use the ServiceMonitor (serviceMonitor.enabled) or PodMonitor (podMonitor.enabled).
    enabled: true
    containerPort: 8888
    servicePort: 8888
    protocol: TCP

# Resource limits & requests. Update according to your own use case as these values might be too low for a typical deployment.
resources:
  limits:
    cpu: 1
    memory: 1Gi
  requests:
    cpu: 100m
    memory: 100Mi

podAnnotations:
  prometheus.io/scrape: "true"
  prometheus.io/port: "8888"

# only used with deployment mode
replicaCount: 4

# only used with deployment mode
revisionHistoryLimit: 10

service:
  type: ClusterIP
  # type: LoadBalancer
  # loadBalancerIP: 1.2.3.4
  # loadBalancerSourceRanges: []
  annotations: {}

# PodDisruptionBudget is used only if deployment enabled
podDisruptionBudget:
  enabled: true
#   minAvailable: 2
  maxUnavailable: 1

rollout:
  rollingUpdate: {}
  # When 'mode: daemonset', maxSurge cannot be used when hostPort is set for any of the ports
  # maxSurge: 25%
  # maxUnavailable: 0
  strategy: RollingUpdate

clusterRole:
  # Specifies whether a clusterRole should be created
  # Some presets also trigger the creation of a cluster role and cluster role binding.
  # If using one of those presets, this field is no-op.
  create: false
  # Annotations to add to the clusterRole
  # Can be used in combination with presets that create a cluster role.
  annotations: {}
  # The name of the clusterRole to use.
  # If not set a name is generated using the fullname template
  # Can be used in combination with presets that create a cluster role.
  name: ""
  # A set of rules as documented here : https://kubernetes.io/docs/reference/access-authn-authz/rbac/
  # Can be used in combination with presets that create a cluster role to add additional rules.
  rules:
  - apiGroups:
    - ''
    resources:
    - 'endpoints'
    verbs:
    - 'get'
    - 'list'
    - 'watch'

  clusterRoleBinding:
    # Annotations to add to the clusterRoleBinding
    # Can be used in combination with presets that create a cluster role binding.
    annotations: {}
    # The name of the clusterRoleBinding to use.
    # If not set a name is generated using the fullname template
    # Can be used in combination with presets that create a cluster role binding.
    name: ""

The tail sampler:

# Default values for opentelemetry-collector.
# This is a YAML-formatted file.
# Declare variables to be passed into your templates.

nameOverride: ""
fullnameOverride: ""

# Valid values are "daemonset", "deployment", and "statefulset".
mode: "deployment"

configMap:
  # Specifies whether a configMap should be created (true by default)
  create: true

# Base collector configuration.
# Supports templating. To escape existing instances of {{ }}, use {{` <original content> `}}.
# For example, {{ REDACTED_EMAIL }} becomes {{` {{ REDACTED_EMAIL }} `}}.
config:
  receivers:
    jaeger: null
    zipkin: null
    prometheus: null
    otlp:
      protocols:
        grpc:
          endpoint: ${env:MY_POD_IP}:4317
          max_recv_msg_size_mib: 500
        http: null
  processors:
    batch:
      send_batch_max_size: 8192
    # If set to null, will be overridden with values based on k8s resource limits
    memory_limiter: null
    tail_sampling:
      decision_wait: 60s
      policies:
      - name: probabilistic
        type: probabilistic
        probabilistic:
          sampling_percentage: 10
    routing:
      from_attribute: k8s.cluster.name
      attribute_source: resource
      # default_exporters:
      # - otlp/default
      table:
      - value: a
        exporters:
          - otlp/tempo-a
      - value: b
        exporters:
          - otlp/tempo-b
      - value: c
        exporters:
          - otlp/tempo-c
      - value: d
        exporters:
          - otlp/tempo-d
      - value: e
        exporters:
          - otlp/tempo-e
      - value: f
        exporters:
          - otlp/tempo-f
      - value: g
        exporters:
          - otlp/tempo-g
      - value: h
        exporters:
          - otlp/tempo-h
      - value: i
        exporters:
          - otlp/tempo-i
      - value: j
        exporters:
          - otlp/tempo-j
  exporters:
    logging: null
    # otlp/default:
    #   endpoint: tempo-distributor.tempo-system.svc.cluster.local:4317
    #   tls:
    #     insecure: true
    #   headers:
    #     x-scope-orgid: aMimir
    otlp/tempo-a:
      endpoint: tempo-distributor.tempo-system.svc.cluster.local:4317
      tls:
        insecure: true
      headers:
        x-scope-orgid: grafanaaTempo
    otlp/tempo-b:
      endpoint: tempo-distributor.tempo-system.svc.cluster.local:4317
      tls:
        insecure: true
      headers:
        x-scope-orgid: grafanabTempo
    otlp/tempo-c:
      endpoint: tempo-distributor.tempo-system.svc.cluster.local:4317
      tls:
        insecure: true
      headers:
        x-scope-orgid: grafanacTempo
    otlp/tempo-d:
      endpoint: tempo-distributor.tempo-system.svc.cluster.local:4317
      tls:
        insecure: true
      headers:
        x-scope-orgid: grafanadTempo
    otlp/tempo-e:
      endpoint: tempo-distributor.tempo-system.svc.cluster.local:4317
      tls:
        insecure: true
      headers:
        x-scope-orgid: grafanaeTempo
    otlp/tempo-f:
      endpoint: tempo-distributor.tempo-system.svc.cluster.local:4317
      tls:
        insecure: true
      headers:
        x-scope-orgid: grafanafTempo
    otlp/tempo-g:
      endpoint: tempo-distributor.tempo-system.svc.cluster.local:4317
      tls:
        insecure: true
      headers:
        x-scope-orgid: grafanagTempo
    otlp/tempo-h:
      endpoint: tempo-distributor.tempo-system.svc.cluster.local:4317
      tls:
        insecure: true
      headers:
        x-scope-orgid: grafanahTempo
    otlp/tempo-i:
      endpoint: tempo-distributor.tempo-system.svc.cluster.local:4317
      tls:
        insecure: true
      headers:
        x-scope-orgid: grafanaiTempo
    otlp/tempo-j:
      endpoint: tempo-distributor.tempo-system.svc.cluster.local:4317
      tls:
        insecure: true
      headers:
        x-scope-orgid: grafanajTempo
  extensions:
    # The health_check extension is mandatory for this chart.
    # Without the health_check extension the collector will fail the readiness and liveliness probes.
    # The health_check extension can be modified, but should never be removed.
    health_check: {}
    memory_ballast:
      size_in_percentage: 33
  service:
    telemetry:
      metrics:
        address: 0.0.0.0:8888
      logs:
        encoding: json
    extensions:
      - health_check
      - memory_ballast
    pipelines:
      logs: null
      metrics: null
      traces:
        receivers:
          - otlp
        processors:
          - memory_limiter
          - tail_sampling
          - batch
          - routing
        exporters:
          - otlp/tempo-a
          - otlp/tempo-b
          - otlp/tempo-c
          - otlp/tempo-d
          - otlp/tempo-e
          - otlp/tempo-f
          - otlp/tempo-g
          - otlp/tempo-h
          - otlp/tempo-i
          - otlp/tempo-j

image:
  # If you want to use the core image `otel/opentelemetry-collector`, you also need to change `command.name` value to `otelcol`.
  repository: otel/opentelemetry-collector-contrib
  pullPolicy: IfNotPresent
  # Overrides the image tag whose default is the chart appVersion.
  tag: "0.83.0"
  # When digest is set to a non-empty value, images will be pulled by digest (regardless of tag value).
  digest: ""
imagePullSecrets: []

# OpenTelemetry Collector executable
command:
  name: otelcol-contrib
  extraArgs: []

nodeSelector:
  role: lgtm
tolerations:
- effect: NoSchedule
  key: grafana-stack
  operator: Exists

# Configuration for ports
# nodePort is also allowed
ports:
  otlp:
    enabled: true
    containerPort: 4317
    servicePort: 4317
    hostPort: 4317
    protocol: TCP
    # nodePort: 30317
    appProtocol: grpc
  otlp-http:
    enabled: false
    containerPort: 4318
    servicePort: 4318
    hostPort: 4318
    protocol: TCP
  jaeger-compact:
    enabled: false
    containerPort: 6831
    servicePort: 6831
    hostPort: 6831
    protocol: UDP
  jaeger-thrift:
    enabled: false
    containerPort: 14268
    servicePort: 14268
    hostPort: 14268
    protocol: TCP
  jaeger-grpc:
    enabled: false
    containerPort: 14250
    servicePort: 14250
    hostPort: 14250
    protocol: TCP
  zipkin:
    enabled: false
    containerPort: 9411
    servicePort: 9411
    hostPort: 9411
    protocol: TCP
  metrics:
    # The metrics port is disabled by default. However you need to enable the port
    # in order to use the ServiceMonitor (serviceMonitor.enabled) or PodMonitor (podMonitor.enabled).
    enabled: true
    containerPort: 8888
    servicePort: 8888
    protocol: TCP

# Resource limits & requests. Update according to your own use case as these values might be too low for a typical deployment.
resources:
  limits:
    cpu: 1
    memory: 2Gi
  requests:
    cpu: 100m
    memory: 500Mi

podAnnotations:
  prometheus.io/scrape: "true"
  prometheus.io/port: "8888"

# only used with deployment mode
replicaCount: 4

# only used with deployment mode
revisionHistoryLimit: 10

service:
  type: ClusterIP
  # type: LoadBalancer
  # loadBalancerIP: 1.2.3.4
  # loadBalancerSourceRanges: []
  clusterIP: None
  annotations: {}

# PodDisruptionBudget is used only if deployment enabled
podDisruptionBudget:
  enabled: true
#   minAvailable: 2
  maxUnavailable: 1

rollout:
  rollingUpdate: {}
  # When 'mode: daemonset', maxSurge cannot be used when hostPort is set for any of the ports
  # maxSurge: 25%
  # maxUnavailable: 0
  strategy: RollingUpdate

ignore the exporters' names, I replaced them off



### Log output

_No response_

### Additional context

_No response_
@lucasoares lucasoares added bug Something isn't working needs triage New item requiring triage labels Oct 6, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Oct 6, 2023

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@albertteoh
Copy link
Contributor

Thanks for those details @lucasoares. I agree, that increase of 600 doesn't seem right to me.

It seems like something that's relatively straightforward for us to reproduce locally just with spanmetrics connector + prometheus server with the objective of eliminating mimir from the equation to confirm (or deny) if the problem somehow relates to the spanmetrics connector?

You could use this working docker-compose setup with the spanmetrics connector + prometheus (+ jaeger) as a template: https://github.com/jaegertracing/jaeger/tree/main/docker-compose/monitor.

@diogenesblip
Copy link

diogenesblip commented Oct 19, 2023

Thank you for the suggestion to set up a local test environment with the spanmetrics connector and Prometheus. We followed your instructions and the configuration worked perfectly in our local test environment.

This helped us confirm that the spanmetrics connector and Prometheus configuration appear to be working as expected, and the initial issue we were facing may not be directly related to these components.

However, we have a Homologation (HMG) environment that is identical to the production environment, but we have not been able to observe the same erroneous behavior in it.

Below are the configuration files for the HMG environment:

image image image

The loadbalancer:

nameOverride: ""
fullnameOverride: ""

mode: "deployment"

configMap:
create: true

config:
receivers:
jaeger: null
zipkin: null
prometheus: null
otlp:
protocols:
grpc:
endpoint: ${env:MY_POD_IP}:4317
max_recv_msg_size_mib: 500
http:
endpoint: ${env:MY_POD_IP}:4318
processors:
batch:
send_batch_max_size: 8192
routing:
from_attribute: k8s.cluster.name
attribute_source: resource
table:
- value: a
exporters:
- prometheusremotewrite/mimir-a
- value: b
exporters:
- prometheusremotewrite/mimir-b
- value: c
exporters:
- prometheusremotewrite/mimir-c
- value: d
exporters:
- prometheusremotewrite/mimir-d

memory_limiter: null

connectors:
spanmetrics:
histogram:
explicit:
buckets: [1ms, 2ms, 4ms, 6ms, 8ms, 10ms, 50ms, 100ms, 200ms, 400ms, 800ms, 1s, 1400ms, 2s, 5s, 10s, 15s, 20s, 40s, 100s, 500s, 1000s, 10000s]
namespace: traces.spanmetrics
dimensions:
- name: http.status_code
- name: http.method
- name: rpc.grpc.status_code
- name: db.system
- name: external.service
- name: k8s.cluster.name
exporters:
logging: null
prometheusremotewrite/mimir-a:
endpoint: http://mimir-distributor.mimir.svc.cluster.local:8080/api/v1/push
resource_to_telemetry_conversion:
enabled: true
tls:
insecure: true
headers:
X-Scope-OrgID: grafanaaMimir
remote_write_queue:
enabled: true
queue_size: 10000
num_consumers: 5
prometheusremotewrite/mimir-b:
endpoint: http://mimir-distributor.mimir.svc.cluster.local:8080/api/v1/push
resource_to_telemetry_conversion:
enabled: true
tls:
insecure: true
headers:
X-Scope-OrgID: grafanabMimir
remote_write_queue:
enabled: true
queue_size: 10000
num_consumers: 5
prometheusremotewrite/mimir-c:
endpoint: http://mimir-distributor.mimir.svc.cluster.local:8080/api/v1/push
resource_to_telemetry_conversion:
enabled: true
tls:
insecure: true
headers:
X-Scope-OrgID: grafanacMimir
remote_write_queue:
enabled: true
queue_size: 10000
num_consumers: 5
prometheusremotewrite/mimir-d:
endpoint: http://mimir-distributor.mimir.svc.cluster.local:8080/api/v1/push
resource_to_telemetry_conversion:
enabled: true
tls:
insecure: true
headers:
X-Scope-OrgID: grafanadMimir
remote_write_queue:
enabled: true
queue_size: 10000
num_consumers: 5
loadbalancing:
protocol:
otlp:
tls:
insecure: true
resolver:
dns:
hostname: opentelemetry-collector-tail.tempo.svc.cluster.local
port: 4317

extensions:
health_check: {}
memory_ballast:
size_in_percentage: 33
service:
telemetry:
metrics:
address: 0.0.0.0:8888
logs:
encoding: json
extensions:
- health_check
- memory_ballast
pipelines:
logs: null
metrics:
receivers:
- spanmetrics
processors:
- memory_limiter
- batch
- routing
exporters:
- prometheusremotewrite/mimir-a
- prometheusremotewrite/mimir-b
- prometheusremotewrite/mimir-c
- prometheusremotewrite/mimir-d
traces:
receivers:
- otlp
processors:
- memory_limiter
- batch
exporters:
- loadbalancing
- spanmetrics

image:
otelcol.
repository: otel/opentelemetry-collector-contrib
pullPolicy: IfNotPresent
tag: "0.83.0"
digest: ""
imagePullSecrets: []

command:
name: otelcol-contrib
extraArgs:
- --feature-gates=pkg.translator.prometheus.NormalizeName

nodeSelector:
component: prometheus
tolerations:

  • effect: NoSchedule
    key: kind
    operator: Equal
    value: prometheus
  • effect: NoSchedule
    key: "kubernetes.azure.com/scalesetpriority"
    operator: Equal
    value: spot

ports:
otlp:
enabled: true
containerPort: 4317
servicePort: 4317
hostPort: 4317
protocol: TCP
appProtocol: grpc
otlp-http:
enabled: true
containerPort: 4318
servicePort: 4318
hostPort: 4318
protocol: TCP
jaeger-compact:
enabled: false
containerPort: 6831
servicePort: 6831
hostPort: 6831
protocol: UDP
jaeger-thrift:
enabled: false
containerPort: 14268
servicePort: 14268
hostPort: 14268
protocol: TCP
jaeger-grpc:
enabled: false
containerPort: 14250
servicePort: 14250
hostPort: 14250
protocol: TCP
zipkin:
enabled: false
containerPort: 9411
servicePort: 9411
hostPort: 9411
protocol: TCP
metrics:
enabled: true
containerPort: 8888
servicePort: 8888
protocol: TCP

deployment.
resources:
limits:
cpu: 1
memory: 1Gi
requests:
cpu: 100m
memory: 100Mi

podAnnotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8888"

replicaCount: 2

revisionHistoryLimit: 10

service:
type: ClusterIP
annotations: {}

podDisruptionBudget:
enabled: true
maxUnavailable: 1

rollout:
rollingUpdate: {}
strategy: RollingUpdate

clusterRole:
create: false
annotations: {}
name: ""
rules:

  • apiGroups:
    • ''
      resources:
    • 'endpoints'
      verbs:
    • 'get'
    • 'list'
    • 'watch'

clusterRoleBinding:
annotations: {}
name: ""

The tail sampler:

nameOverride: ""
fullnameOverride: ""

mode: "deployment"

configMap:
create: true

config:
receivers:
jaeger: null
zipkin: null
prometheus: null
otlp:
protocols:
grpc:
endpoint: ${env:MY_POD_IP}:4317
max_recv_msg_size_mib: 500
http: null
processors:
batch:
send_batch_max_size: 8192
memory_limiter: null
tail_sampling:
decision_wait: 60s
policies:
- name: probabilistic
type: probabilistic
probabilistic:
sampling_percentage: 10
routing:
from_attribute: k8s.cluster.name
attribute_source: resource
table:
- value: a
exporters:
- otlp/tempo-a
- value: b
exporters:
- otlp/tempo-b
- value: c
exporters:
- otlp/tempo-c
- value: d
exporters:
- otlp/tempo-d
exporters:
logging: null
otlp/tempo-a:
endpoint: tempo-distributor.tempo.svc.cluster.local:4317
tls:
insecure: true
headers:
x-scope-orgid: grafanaaTempo
otlp/tempo-b:
endpoint: tempo-distributor.tempo.svc.cluster.local:4317
tls:
insecure: true
headers:
x-scope-orgid: grafanabTempo
otlp/tempo-c:
endpoint: tempo-distributor.tempo.svc.cluster.local:4317
tls:
insecure: true
headers:
x-scope-orgid: grafanacTempo
otlp/tempo-d:
endpoint: tempo-distributor.tempo.svc.cluster.local:4317
tls:
insecure: true
headers:
x-scope-orgid: grafanadTempo
extensions:
health_check: {}
memory_ballast:
size_in_percentage: 33
service:
telemetry:
metrics:
address: 0.0.0.0:8888
logs:
encoding: json
extensions:
- health_check
- memory_ballast
pipelines:
logs: null
metrics: null
traces:
receivers:
- otlp
processors:
- memory_limiter
- tail_sampling
- batch
- routing
exporters:
- otlp/tempo-a
- otlp/tempo-b
- otlp/tempo-c
- otlp/tempo-d

image:
otelcol.
repository: otel/opentelemetry-collector-contrib
pullPolicy: IfNotPresent
tag: "0.83.0"
digest: ""
imagePullSecrets: []

OpenTelemetry Collector executable

command:
name: otelcol-contrib
extraArgs: []

nodeSelector:
component: prometheus
tolerations:

  • effect: NoSchedule
    key: kind
    operator: Equal
    value: prometheus
  • effect: NoSchedule
    key: "kubernetes.azure.com/scalesetpriority"
    operator: Equal
    value: spot

ports:
otlp:
enabled: true
containerPort: 4317
servicePort: 4317
hostPort: 4317
protocol: TCP
appProtocol: grpc
otlp-http:
enabled: false
containerPort: 4318
servicePort: 4318
hostPort: 4318
protocol: TCP
jaeger-compact:
enabled: false
containerPort: 6831
servicePort: 6831
hostPort: 6831
protocol: UDP
jaeger-thrift:
enabled: false
containerPort: 14268
servicePort: 14268
hostPort: 14268
protocol: TCP
jaeger-grpc:
enabled: false
containerPort: 14250
servicePort: 14250
hostPort: 14250
protocol: TCP
zipkin:
enabled: false
containerPort: 9411
servicePort: 9411
hostPort: 9411
protocol: TCP
metrics:
enabled: true
containerPort: 8888
servicePort: 8888
protocol: TCP

deployment.
resources:
limits:
cpu: 1
memory: 1Gi
requests:
cpu: 100m
memory: 100Mi

podAnnotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8888"

replicaCount: 2

revisionHistoryLimit: 10

service:
type: ClusterIP
clusterIP: None
annotations: {}

podDisruptionBudget:
enabled: true
maxUnavailable: 1

rollout:
rollingUpdate: {}
strategy: RollingUpdate

@luistilingue
Copy link

Could that issue be related with this problem?

#27080

I'm using 0.83.0 version and I've upgraded to 0.88.0 to check it was fixed. I'll go back here to tell the results :D

@crobert-1
Copy link
Member

@luistilingue Have you been able to test this yet? (Or @lucasoares)

@luistilingue
Copy link

@crobert-1 The issue still persists, event updating to 0.91.0.

Could be related to prune caches like stated here grafana/agent#5271 and here #17306 ?

That behavior is impacting our usage of otel-collectors :(

@luistilingue
Copy link

@nijave Could you tell us if you could fix that behavior?

@luistilingue
Copy link

I got problem solved. It's related to Mimir HA Dedup. So after adding the external_labels in prometheusremotewriter exporter, metrics values returns to the normal. I think we can close this issuse.

@crobert-1
Copy link
Member

@lucasoares Can you confirm what @luistilingue has suggested resolves your issue?

@lucasoares
Copy link
Author

@lucasoares Can you confirm what @luistilingue has suggested resolves your issue?

Yes

@crobert-1
Copy link
Member

I'm going to close the issue for now as it appears to be resolved, but let me know if there's anything else required here.

@chewrocca
Copy link

I got problem solved. It's related to Mimir HA Dedup. So after adding the external_labels in prometheusremotewriter exporter, metrics values returns to the normal. I think we can close this issuse.

Can you elaborate a bit more? We experience similar issues.

@nijave
Copy link
Contributor

nijave commented Jan 11, 2024

I got problem solved. It's related to Mimir HA Dedup. So after adding the external_labels in prometheusremotewriter exporter, metrics values returns to the normal. I think we can close this issuse.

Can you elaborate a bit more? We experience similar issues.

https://grafana.com/docs/mimir/latest/configure/configure-high-availability-deduplication/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants