ECS container.cpu.utilized metric unit claification #1368

tomiszili · 2022-07-22T08:48:17Z

Hello!

What is the unit of the ECS container.cpu.utilized metric?
Please help me to understand what is the unit of the container.cpu.utilized metric, because it does not align with the CPUUtilization of the EC2 instance.

github-actions · 2022-09-25T20:02:13Z

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 30 days.

imishchuk-carbon · 2022-10-27T12:24:35Z

Hello, team.

We've been trying to figure out how container.cpu.utilized metric is calculated, but no luck.

Platform AWS ECS (Fargate or EC2, observed on both capacity providers)
ADOT version: v0.22.0

We have few observations, shared below:

Initially, we set CPU value on Task level only, leaving container level CPU at 0

{
    ...
    "containerDefinitions": [
      {
        "cpu": 0,
        "name": "app"
      }
    ],
    
    "requiresCompatibilities": [
      "FARGATE"
    ],
    "cpu": "8192",
    ...
}

Container level CPU value of 0 is getting converted to 2 in background to be passed to docker run --cpu-shares. Explanation of this behavior is here and here

This can be confirmed by graphing metric container.cpu.reserved for Task that has container level CPU set to 0

And in that case container.cpu.utilized looks sane

Next, we set container level CPU to non-zero value like

{
    ...
    "containerDefinitions": [
      {
        "cpu": 7168,
        "name": "app"
      }
    ],
    
    "requiresCompatibilities": [
      "FARGATE"
    ],
    "cpu": "8192",
    ...
}

This is correctly reflected in container.cpu.reserved metric

But container.cpu.utilized gets messed up

Questions:

How is container.cpu.utilized related to container.cpu.reserved?
Which CloudWatch metrics are used for container.cpu.utilized?

Thank you.

bryan-aguilar · 2022-10-27T15:23:18Z

@imishchuk-carbon can you share the collector config you are using?

imishchuk-carbon · 2022-10-27T15:49:05Z

Hey @bryan-aguilar
Thanks for looking into this.
Config below

extensions:
  health_check:
  sigv4auth:
    region: "us-east-1"

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
  awsxray:
    endpoint: 0.0.0.0:2000
    transport: udp
  awsecscontainermetrics:
  prometheus:
    config:
      scrape_configs:
        - job_name: otel-envoy-eg
          scrape_interval: 5s
          metrics_path: /stats/prometheus
          static_configs:
            - targets: ["localhost:9901"]
              labels:
                __ecs_container_metadata_uri: ${ECS_CONTAINER_METADATA_URI}
          relabel_configs:
            - source_labels: [__ecs_container_metadata_uri]
              target_label: ecs_task_id
              regex: '.*?/([a-z0-9]+)-\d+$'
processors:
  memory_limiter:
    check_interval: 1s
    limit_percentage: 20
    spike_limit_percentage: 15
  batch/traces:
    timeout: 5s
    send_batch_size: 50
  batch/metrics:
    timeout: 60s
  resourcedetection:
    detectors:
      - env
      - system
      - ecs
      - ec2

exporters:
  otlphttp:
    endpoint: "${OTEL_COLLECTOR_ENDPOINT}"
  awsemf:
    namespace: ECS/AWSOTel/Application
    log_group_name: '/aws/ecs/application/metrics'
    region: "${OTEL_EXPORT_AMP_REGION}"
  # AWS Managed Prometheus Collector configuration
  prometheusremotewrite:
    endpoint: "${OTEL_EXPORTER_AMP_ENDPOINT}"
    auth:
      authenticator: sigv4auth
    resource_to_telemetry_conversion:
      enabled: true
    remote_write_queue:
      enabled: true
      num_consumers: 1
      queue_size: 5000
  logging:
    loglevel: debug


service:
  pipelines:
    traces:
      receivers: [otlp, awsxray]
      processors: [memory_limiter, resourcedetection, batch/traces]
      exporters: [otlphttp]
    metrics/application:
      receivers: [otlp]
      processors: [memory_limiter, resourcedetection, batch/metrics]
      exporters: [prometheusremotewrite]
    metrics/envoy:
      receivers: [prometheus]
      processors: [memory_limiter, batch/metrics]
      exporters: [prometheusremotewrite]
    metrics:
      receivers: [awsecscontainermetrics]
      processors: [memory_limiter, batch/metrics]
      exporters: [otlphttp]

  extensions: [sigv4auth,health_check]

bryan-aguilar · 2022-10-27T16:24:03Z

CPU utilization is calculated as the Container CPU Usage / Container CPU Reserved. You can see that in the receiver code here.

imishchuk-carbon · 2022-10-28T09:02:13Z

if containerMetrics.CPUReserved > 0 {
    containerMetrics.CPUUtilized = (containerMetrics.CPUUtilized / containerMetrics.CPUReserved)
}

Okay, can you help me understand the logic behind this calculation, please? Why is containerMetrics.CPUUtilized redefined as its own value divided by containerMetrics.CPUReserved?

containerMetrics.CPUReserved is never 0, minimal it gets is 2. And when it's set to 2 it means that container does not have a guaranteed CPU share in Task.
E.g.

{
    ...
    "containerDefinitions": [
      {
        "cpu": 0,
        "name": "app"
      }
    ],
    
    "requiresCompatibilities": [
      "FARGATE"
    ],
    "cpu": "2048",
    ...
}

Container metadata

curl -s  $ECS_CONTAINER_METADATA_URI_V4/task | jq  -rc '.Containers[] | "\(.Name): \(.Limits)"'
firelens: {"CPU":2}
app: {"CPU":2}
otel-collector: {"CPU":2}
envoy: {"CPU":2}

In light of the above, I think comparison if containerMetrics.CPUReserved > 0 should be changed to if containerMetrics.CPUReserved > 2 because 2 (and null, 0, 1) is a special case.

To clarify this comment, there are two types of Capacity providers available for ECS and they have different requirements for CPU configuration:

FARGATE - Task level CPU is required, Container level is optional
Non-Fargate (EC2 or External) - Both settings are optional

So following combinations are possible
FARGATE:

Task level CPU is set, Container level CPU is not set (defaults to 2)
Task level CPU is set, Container level CPU is set

EC2:

Task level CPU is not set (Task can consume all CPU in EC2), Container level CPU is not set. What would be calculated instead of this?

{
    ...
    "Limits": {
        "Memory": 3584
    },
    ...

Task level CPU is set, Container level CPU is not set
Task level CPU is set, Container level CPU is set

Thank you.

bryan-aguilar · 2022-10-28T18:42:10Z

We're going to take a deeper look at this and will get back to this issue when we have an update. Thanks for bringing our attention back to this!

imishchuk-carbon · 2022-11-02T12:10:51Z

Looking forward to updates.
Thank you.

github-actions · 2023-01-01T20:02:12Z

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 30 days.

github-actions · 2023-02-05T20:02:19Z

This issue was closed because it has been marked as stale for 30 days with no activity.

benabineri · 2023-08-01T10:55:02Z

What this metric represents is unclear and not explained in the docs, could this issue be re-opened please?

bmbferreira · 2023-08-18T03:23:59Z

Can we please reopen this ticket? The docs are misleading. Here we can see that it says None for both ecs.task.cpu.utilized and container.cpu.utilized: https://aws-otel.github.io/docs/components/ecs-metrics-receiver

However this is false, because as it is pointed out here on this comment and here, these are PERCENTAGES!

Just lost a couple of hours trying to make any sense of these values... Please update the AWS distro page with the right units!

humivo · 2023-08-29T16:30:50Z

I have updated the AWS Distro page with the correct units for those metrics. Is there any more explanation needed here? If not, we can close the issue.

humivo · 2023-09-15T20:15:28Z

Closing this issue now that the metric unit has been clarified and there are no other questions.

github-actions bot added the stale label Sep 25, 2022

bryan-aguilar removed the stale label Oct 27, 2022

github-actions bot added the stale label Jan 1, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 5, 2023

mhausenblas reopened this Aug 18, 2023

github-actions bot removed the stale label Aug 20, 2023

humivo mentioned this issue Aug 25, 2023

Change units for ecs.task.cpu.utilized and ecs.task.cpu.reserved metrics aws-otel/aws-otel.github.io#620

Merged

humivo closed this as completed Sep 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ECS container.cpu.utilized metric unit claification #1368

ECS container.cpu.utilized metric unit claification #1368

tomiszili commented Jul 22, 2022

github-actions bot commented Sep 25, 2022

imishchuk-carbon commented Oct 27, 2022

bryan-aguilar commented Oct 27, 2022

imishchuk-carbon commented Oct 27, 2022

bryan-aguilar commented Oct 27, 2022

imishchuk-carbon commented Oct 28, 2022 •

edited

Loading

bryan-aguilar commented Oct 28, 2022

imishchuk-carbon commented Nov 2, 2022

github-actions bot commented Jan 1, 2023

github-actions bot commented Feb 5, 2023

benabineri commented Aug 1, 2023

bmbferreira commented Aug 18, 2023

humivo commented Aug 29, 2023

humivo commented Sep 15, 2023

ECS container.cpu.utilized metric unit claification #1368

ECS container.cpu.utilized metric unit claification #1368

Comments

tomiszili commented Jul 22, 2022

github-actions bot commented Sep 25, 2022

imishchuk-carbon commented Oct 27, 2022

bryan-aguilar commented Oct 27, 2022

imishchuk-carbon commented Oct 27, 2022

bryan-aguilar commented Oct 27, 2022

imishchuk-carbon commented Oct 28, 2022 • edited Loading

bryan-aguilar commented Oct 28, 2022

imishchuk-carbon commented Nov 2, 2022

github-actions bot commented Jan 1, 2023

github-actions bot commented Feb 5, 2023

benabineri commented Aug 1, 2023

bmbferreira commented Aug 18, 2023

humivo commented Aug 29, 2023

humivo commented Sep 15, 2023

imishchuk-carbon commented Oct 28, 2022 •

edited

Loading