Skip to content

Latest commit

 

History

History
139 lines (94 loc) · 5.89 KB

File metadata and controls

139 lines (94 loc) · 5.89 KB

Monitoring

There is a pre-built monitoring for you which you can deploy to your New Relic account per Terraform. But before, it is important to understand how the necessary telemetry data is collected! A detailed explanation of how this solution accomplishes that can be found here.

Moreover, the solution also provides you with a cost analysis! You can know how much money your individual workloads are costing and how much money you are losing due not utilizing your resources well out-of-the-box! Go check out the documentation!

Regarding to the explanation in the documentation above, the corresponding dashboards and alerts are implemented into a Terraform deployment. In order to deploy that, please refer to this documentation.

New Relic resources

The Terraform deployment will create the following New Relic resources for you:

Dashboards

Cluster Overview - Nodes

  • Node capacities
  • Node to pod mapping
  • Namespaces & pods per node
  • CPU, MEM & STO usage/utilization per node

Cluster Overview - Nodes

Cluster Overview - Namespaces

  • Deployments, statefulsets & daemonsets
  • Pods with running, pending, failed & unknown statuses
  • Namespaces & pods per node
  • CPU & MEM usage/utilization per namespace

Cluster Overview - Namespaces

Cluster Overview - Pods

  • Containers & their statuses
  • Pods with running, pending, failed & unknown statuses
  • CPU & MEM usage/utilization per pod/container
  • Filesystem read/write per pod/container
  • Network receive/transmit per pod/container

Cluster Overview - Pods

OTel Collectors Overview

  • Collector node capacities & statuses
  • Pods with running, pending, failed & unknown statuses
  • CPU & MEM usage/utilization per collector instance
  • Ratio of queue size to capacity per collector instance
  • Dropped telemetry data per collector instance
  • Failed receive/enqueue/export per collector instance

OTel Collectors Overview 1 OTel Collectors Overview 2

Kube API Server Overview

  • Collector node capacities & statuses
  • Pods with running, pending, failed & unknown statuses
  • CPU & MEM usage/utilization
  • Response latency
  • Throughput per status & request type
  • Workqueue

Kube API Server Overview

Core DNS Overview

  • Collector node capacities & statuses
  • Pods with running, pending, failed & unknown statuses
  • CPU & MEM usage/utilization
  • Response latency
  • Throughput per IP type & rcode
  • Rate of panics & cache hits

Core DNS Overview

Data Ingest Overview

  • Ingest per telemetry type
  • Ingest of Prometheus scraping
    • per jobs
    • per collector types

Data Ingest Overview 1 Data Ingest Overview 2

Cost Analysis - Nodes

  • Node capacities
  • Node costs
  • Lost money of nodes due to not utilizing them to the fullest

Cost Analysis - Nodes

Cost Analysis - Namespaces

  • Namespace costs
  • Lost money of namespaces due to not utilizing them to the fullest

Cost Analysis - Namespaces 1 Cost Analysis - Namespaces 2

Cost Analysis - Pods

  • Pod costs
  • Lost money of pods due to not utilizing them to the fullest

Cost Analysis - Pods 1 Cost Analysis - Pods 2

Alerts

The alerts have predefined threshold. If those are not applicable for your use-cases, feel free to adapt them accordingly!

Nodes

  • Status per instance remains not healthy for a certain amount of time
  • CPU utilization per instance exceeding a certain limit for a certain amount of time
  • Memory utilization per instance exceeding a certain limit for a certain amount of time
  • Storage utilization per instance exceeding a certain limit for a certain amount of time

Pods

  • Status per instance remains not healthy for a certain amount of time
  • CPU utilization per instance exceeding a certain limit for a certain amount of time
  • Memory utilization per instance exceeding a certain limit for a certain amount of time

OTel Collector

  • CPU utilization per instance exceeding a certain limit for a certain amount of time
  • Memory utilization per instance exceeding a certain limit for a certain amount of time
  • Queue utilization per instance exceeding a certain limit for a certain amount of time
  • Dropped metrics/spans/logs per instance at least once
  • Enqueue failures metrics/spans/logs per instance at least once
  • Receive failures metrics/spans/logs per instance at least once
  • Export failures metrics/spans/logs per instance at least once