Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support AWS cloudwatch container insight metrics for EKS/ECS clusters #2307

Closed
pxaws opened this issue Feb 9, 2021 · 4 comments
Closed

Support AWS cloudwatch container insight metrics for EKS/ECS clusters #2307

pxaws opened this issue Feb 9, 2021 · 4 comments
Labels
comp:aws AWS components comp:aws-cw AWS CW related issues

Comments

@pxaws
Copy link
Contributor

pxaws commented Feb 9, 2021

Background
Cloudwatch container insight is a AWS monitoring solution for EKS and ECS clusters. It can collect, aggregate, and summarize metrics and logs from containerized applications and microservices. Currently the metrics and logs are collected by Cloudwatch agent running as a daemon set. We want to migrate to use opentelemetry collector instead. and achieve the same feature parity.

Issues:
Container insights generate a list of metrics for both EKS and ECS clusters: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-metrics.html. With openTelemetry collector, we want to generate the same set of metrics with same dimensions. So we examined the metrics provided by the existing receivers: kubeletstatsreceiver and k8sclusterreceiver in open telemetry and found that they don't satisfy our need for container insight.

The major issue is that some metrics required by container insight are not available in existing receivers. Cloudwatch agent embed cadvisor inside and can use it to collect a rich set of metrics. The kubeletstatsreceiver receiver currently get metrics from kubelet stats endpoint rather than from the kubelet cadvisor endpoint /metrics/cadvisor. This leads to some missing metrics as required by container insight (e.g. node_cpu_usage_total, pod_network_rx_bytes, pod_network_tx_bytes, .... ). Even if kubeletstatsreceiver begin to support the kubelet cadvisor endpoint, this will not cover our use case. As far as I know, the metrics that kubelet generated for cadvisor endpoint is taken from the cadvisor prometheus collector (See https://github.com/kubernetes/kubernetes/blob/release-1.18/pkg/kubelet/server/server.go#L334-L345 and https://github.com/kubernetes/kubernetes/blob/release-1.18/vendor/github.com/google/cadvisor/metrics/prometheus.go#L151-L1084). Unfortunately container insight uses more metrics than those defined in cadvisor prometheus collector (for example, container_memory_hierarchical_pgfault, container_memory_hierarchical_pgmajfault, node_diskio_io_service_bytes_read, ... ).

A second issue is about ECS. ECS clusters don't provide an endpoint like kubelet api endpoint. So the kubeletstatsreceiver receiver won't work. Since we want to continue to support existing container insight users for ECS clusters, we have to develop our own receiver based on cadvisor.

An additional concern with the existing receivers like kubeletstatsreceiver is that they all rely on kubelet endpoints which could limit extensibility of our container insight support. What if we want to use a metrics that are not provided by kubelet?

Proposal
So we (AWS Cloudwatch agent team) want to develop our own receiver awscontainerinsightreceiver by embedding cadvisor lib inside (like what we did for Cloudwatch agent) and contribute to open telemetry project so that existing container insight users can smoothly migrate. This receiver needs to be deployed as a daemon set and each receiver instance is responsible to collect relevant metrics for a node. We might also need to develop a processor to decorate existing metrics and do some computation to generate new metrics (if those logic are not suitable to put into the k8sprocessor)

Please comment if you have any suggestions. Thank you!

kisieland referenced this issue in kisieland/opentelemetry-collector-contrib Mar 16, 2021
Signed-off-by: Bogdan Drutu <bogdandrutu@gmail.com>
@dashpole
Copy link
Contributor

Some notes from the meeting earlier today:

  • a generic cadvisor receiver would be useful to have, for example to enable metrics that kubernetes has disabled.
  • It should be possible to have a "wrapper" around it to add the aws-specific bits that you need.
  • Note: running an extra instance of cAdvisor (the one in the kubelet plus this one) is expensive and duplicates work

@pxaws
Copy link
Contributor Author

pxaws commented Apr 21, 2021

We have hosted the relevant code here: https://github.com/aws-observability/aws-otel-collector/tree/container-insight-backup/internal temporarily and will work to migrate the code to opentelemetry-collector-contrib.

tigrannajaryan pushed a commit that referenced this issue Apr 30, 2021
Add constants and utils functions for aws container insights
* define constants for all metrics
* define units for the metrics
* add utils functions to convert metrics to OpenTelemetry metrics

This PR is a part of our efforts to migrate the [code for aws container insights](https://github.com/aws-observability/aws-otel-collector/tree/container-insight-backup/internal) to upstream. More PRs will come along the way. 

**Link to tracking Issue:**
#2307

**Testing:** 
Unit tests
tigrannajaryan pushed a commit that referenced this issue Jun 8, 2021
Add `k8sapiserver` component to collect cluster-level metrics from k8s api server:
* To guarantees that only one piece of cluster-level metric is generated per cluster, we utilize the leader election API provided by `kubernetes/client-go`. A dedicated configmap is used as the lock resource. Multiple receivers will try to acquire the lock and only one will succeed and generate cluster-level metrics.

**Link to tracking Issue:** 
#2307
@alolita alolita added the comp:aws AWS components label Sep 2, 2021
@alolita alolita added the comp:aws-cw AWS CW related issues label Sep 30, 2021
@sethAmazon
Copy link
Contributor

@pxaws is this completed?

@pxaws
Copy link
Contributor Author

pxaws commented Jan 6, 2022

It is completed. Let's close it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:aws AWS components comp:aws-cw AWS CW related issues
Projects
None yet
Development

No branches or pull requests

4 participants