Monitoring

There is a pre-built monitoring for you which you can deploy to your New Relic account per Terraform. But before, it is important to understand how the necessary telemetry data is collected! A detailed explanation of how this solution accomplishes that can be found here.

Moreover, the solution also provides you with a cost analysis! You can know how much money your individual workloads are costing and how much money you are losing due not utilizing your resources well out-of-the-box! Go check out the documentation!

Regarding to the explanation in the documentation above, the corresponding dashboards and alerts are implemented into a Terraform deployment. In order to deploy that, please refer to this documentation.

New Relic resources

The Terraform deployment will create the following New Relic resources for you:

Dashboards
Alerts

Dashboards

Cluster Overview - Nodes

Node capacities
Node to pod mapping
Namespaces & pods per node
CPU, MEM & STO usage/utilization per node

Cluster Overview - Namespaces

Deployments, statefulsets & daemonsets
Pods with running, pending, failed & unknown statuses
Namespaces & pods per node
CPU & MEM usage/utilization per namespace

Cluster Overview - Pods

Containers & their statuses
Pods with running, pending, failed & unknown statuses
CPU & MEM usage/utilization per pod/container
Filesystem read/write per pod/container
Network receive/transmit per pod/container

OTel Collectors Overview

Collector node capacities & statuses
Pods with running, pending, failed & unknown statuses
CPU & MEM usage/utilization per collector instance
Ratio of queue size to capacity per collector instance
Dropped telemetry data per collector instance
Failed receive/enqueue/export per collector instance

Kube API Server Overview

Collector node capacities & statuses
Pods with running, pending, failed & unknown statuses
CPU & MEM usage/utilization
Response latency
Throughput per status & request type
Workqueue

Core DNS Overview

Collector node capacities & statuses
Pods with running, pending, failed & unknown statuses
CPU & MEM usage/utilization
Response latency
Throughput per IP type & rcode
Rate of panics & cache hits

Data Ingest Overview

Ingest per telemetry type
Ingest of Prometheus scraping
- per jobs
- per collector types

Cost Analysis - Nodes

Node capacities
Node costs
Lost money of nodes due to not utilizing them to the fullest

Cost Analysis - Namespaces

Namespace costs
Lost money of namespaces due to not utilizing them to the fullest

Cost Analysis - Pods

Pod costs
Lost money of pods due to not utilizing them to the fullest

Alerts

The alerts have predefined threshold. If those are not applicable for your use-cases, feel free to adapt them accordingly!

Nodes

Status per instance remains not healthy for a certain amount of time
CPU utilization per instance exceeding a certain limit for a certain amount of time
Memory utilization per instance exceeding a certain limit for a certain amount of time
Storage utilization per instance exceeding a certain limit for a certain amount of time

Pods

Status per instance remains not healthy for a certain amount of time
CPU utilization per instance exceeding a certain limit for a certain amount of time
Memory utilization per instance exceeding a certain limit for a certain amount of time

OTel Collector

CPU utilization per instance exceeding a certain limit for a certain amount of time
Memory utilization per instance exceeding a certain limit for a certain amount of time
Queue utilization per instance exceeding a certain limit for a certain amount of time
Dropped metrics/spans/logs per instance at least once
Enqueue failures metrics/spans/logs per instance at least once
Receive failures metrics/spans/logs per instance at least once
Export failures metrics/spans/logs per instance at least once

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Monitoring

New Relic resources

Dashboards

Alerts

Files

README.md

Latest commit

History

README.md

File metadata and controls

Monitoring

New Relic resources

Dashboards

Alerts