Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ECS container.cpu.utilized metric unit claification #1368

Closed
tomiszili opened this issue Jul 22, 2022 · 14 comments
Closed

ECS container.cpu.utilized metric unit claification #1368

tomiszili opened this issue Jul 22, 2022 · 14 comments

Comments

@tomiszili
Copy link

Hello!

What is the unit of the ECS container.cpu.utilized metric?
Please help me to understand what is the unit of the container.cpu.utilized metric, because it does not align with the CPUUtilization of the EC2 instance.
image

@github-actions
Copy link
Contributor

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 30 days.

@github-actions github-actions bot added the stale label Sep 25, 2022
@imishchuk-carbon
Copy link

Hello, team.

We've been trying to figure out how container.cpu.utilized metric is calculated, but no luck.

Platform AWS ECS (Fargate or EC2, observed on both capacity providers)
ADOT version: v0.22.0

We have few observations, shared below:

Initially, we set CPU value on Task level only, leaving container level CPU at 0

{
    ...
    "containerDefinitions": [
      {
        "cpu": 0,
        "name": "app"
      }
    ],
    
    "requiresCompatibilities": [
      "FARGATE"
    ],
    "cpu": "8192",
    ...
}

Container level CPU value of 0 is getting converted to 2 in background to be passed to docker run --cpu-shares. Explanation of this behavior is here and here

This can be confirmed by graphing metric container.cpu.reserved for Task that has container level CPU set to 0
image

And in that case container.cpu.utilized looks sane
image

Next, we set container level CPU to non-zero value like

{
    ...
    "containerDefinitions": [
      {
        "cpu": 7168,
        "name": "app"
      }
    ],
    
    "requiresCompatibilities": [
      "FARGATE"
    ],
    "cpu": "8192",
    ...
}

This is correctly reflected in container.cpu.reserved metric
image

But container.cpu.utilized gets messed up
image

Questions:

  • How is container.cpu.utilized related to container.cpu.reserved?
  • Which CloudWatch metrics are used for container.cpu.utilized?

Thank you.

@bryan-aguilar
Copy link
Contributor

@imishchuk-carbon can you share the collector config you are using?

@imishchuk-carbon
Copy link

Hey @bryan-aguilar
Thanks for looking into this.
Config below

extensions:
  health_check:
  sigv4auth:
    region: "us-east-1"

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
  awsxray:
    endpoint: 0.0.0.0:2000
    transport: udp
  awsecscontainermetrics:
  prometheus:
    config:
      scrape_configs:
        - job_name: otel-envoy-eg
          scrape_interval: 5s
          metrics_path: /stats/prometheus
          static_configs:
            - targets: ["localhost:9901"]
              labels:
                __ecs_container_metadata_uri: ${ECS_CONTAINER_METADATA_URI}
          relabel_configs:
            - source_labels: [__ecs_container_metadata_uri]
              target_label: ecs_task_id
              regex: '.*?/([a-z0-9]+)-\d+$'
processors:
  memory_limiter:
    check_interval: 1s
    limit_percentage: 20
    spike_limit_percentage: 15
  batch/traces:
    timeout: 5s
    send_batch_size: 50
  batch/metrics:
    timeout: 60s
  resourcedetection:
    detectors:
      - env
      - system
      - ecs
      - ec2

exporters:
  otlphttp:
    endpoint: "${OTEL_COLLECTOR_ENDPOINT}"
  awsemf:
    namespace: ECS/AWSOTel/Application
    log_group_name: '/aws/ecs/application/metrics'
    region: "${OTEL_EXPORT_AMP_REGION}"
  # AWS Managed Prometheus Collector configuration
  prometheusremotewrite:
    endpoint: "${OTEL_EXPORTER_AMP_ENDPOINT}"
    auth:
      authenticator: sigv4auth
    resource_to_telemetry_conversion:
      enabled: true
    remote_write_queue:
      enabled: true
      num_consumers: 1
      queue_size: 5000
  logging:
    loglevel: debug


service:
  pipelines:
    traces:
      receivers: [otlp, awsxray]
      processors: [memory_limiter, resourcedetection, batch/traces]
      exporters: [otlphttp]
    metrics/application:
      receivers: [otlp]
      processors: [memory_limiter, resourcedetection, batch/metrics]
      exporters: [prometheusremotewrite]
    metrics/envoy:
      receivers: [prometheus]
      processors: [memory_limiter, batch/metrics]
      exporters: [prometheusremotewrite]
    metrics:
      receivers: [awsecscontainermetrics]
      processors: [memory_limiter, batch/metrics]
      exporters: [otlphttp]

  extensions: [sigv4auth,health_check]

@bryan-aguilar
Copy link
Contributor

CPU utilization is calculated as the Container CPU Usage / Container CPU Reserved. You can see that in the receiver code here.

@imishchuk-carbon
Copy link

imishchuk-carbon commented Oct 28, 2022

if containerMetrics.CPUReserved > 0 {
    containerMetrics.CPUUtilized = (containerMetrics.CPUUtilized / containerMetrics.CPUReserved)
}

Okay, can you help me understand the logic behind this calculation, please? Why is containerMetrics.CPUUtilized redefined as its own value divided by containerMetrics.CPUReserved?

containerMetrics.CPUReserved is never 0, minimal it gets is 2. And when it's set to 2 it means that container does not have a guaranteed CPU share in Task.
E.g.

{
    ...
    "containerDefinitions": [
      {
        "cpu": 0,
        "name": "app"
      }
    ],
    
    "requiresCompatibilities": [
      "FARGATE"
    ],
    "cpu": "2048",
    ...
}

Container metadata

curl -s  $ECS_CONTAINER_METADATA_URI_V4/task | jq  -rc '.Containers[] | "\(.Name): \(.Limits)"'
firelens: {"CPU":2}
app: {"CPU":2}
otel-collector: {"CPU":2}
envoy: {"CPU":2}

image

In light of the above, I think comparison if containerMetrics.CPUReserved > 0 should be changed to if containerMetrics.CPUReserved > 2 because 2 (and null, 0, 1) is a special case.


To clarify this comment, there are two types of Capacity providers available for ECS and they have different requirements for CPU configuration:

  1. FARGATE - Task level CPU is required, Container level is optional
  2. Non-Fargate (EC2 or External) - Both settings are optional

So following combinations are possible
FARGATE:

  1. Task level CPU is set, Container level CPU is not set (defaults to 2)
  2. Task level CPU is set, Container level CPU is set

EC2:

  1. Task level CPU is not set (Task can consume all CPU in EC2), Container level CPU is not set. What would be calculated instead of this?
{
    ...
    "Limits": {
        "Memory": 3584
    },
    ...
  1. Task level CPU is set, Container level CPU is not set
  2. Task level CPU is set, Container level CPU is set

Thank you.

@bryan-aguilar
Copy link
Contributor

We're going to take a deeper look at this and will get back to this issue when we have an update. Thanks for bringing our attention back to this!

@imishchuk-carbon
Copy link

Looking forward to updates.
Thank you.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 1, 2023

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 30 days.

@github-actions github-actions bot added the stale label Jan 1, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Feb 5, 2023

This issue was closed because it has been marked as stale for 30 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 5, 2023
@benabineri
Copy link

What this metric represents is unclear and not explained in the docs, could this issue be re-opened please?

@bmbferreira
Copy link

Can we please reopen this ticket? The docs are misleading. Here we can see that it says None for both ecs.task.cpu.utilized and container.cpu.utilized: https://aws-otel.github.io/docs/components/ecs-metrics-receiver
Screenshot 2023-08-18 at 04 19 28

However this is false, because as it is pointed out here on this comment and here, these are PERCENTAGES!
Screenshot 2023-08-18 at 04 23 21

Just lost a couple of hours trying to make any sense of these values... Please update the AWS distro page with the right units!

@humivo
Copy link
Contributor

humivo commented Aug 29, 2023

I have updated the AWS Distro page with the correct units for those metrics. Is there any more explanation needed here? If not, we can close the issue.

@humivo
Copy link
Contributor

humivo commented Sep 15, 2023

Closing this issue now that the metric unit has been clarified and there are no other questions.

@humivo humivo closed this as completed Sep 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants