There is a pre-built monitoring for you which you can deploy to your New Relic account per Terraform. But before, it is important to understand how the necessary telemetry data is collected! A detailed explanation of how this solution accomplishes that can be found here.
Moreover, the solution also provides you with a cost analysis! You can know how much money your individual workloads are costing and how much money you are losing due not utilizing your resources well out-of-the-box! Go check out the documentation!
Regarding to the explanation in the documentation above, the corresponding dashboards and alerts are implemented into a Terraform deployment. In order to deploy that, please refer to this documentation.
The Terraform
deployment will create the following New Relic resources for you:
- Node capacities
- Node to pod mapping
- Namespaces & pods per node
- CPU, MEM & STO usage/utilization per node
- Deployments, statefulsets & daemonsets
- Pods with running, pending, failed & unknown statuses
- Namespaces & pods per node
- CPU & MEM usage/utilization per namespace
- Containers & their statuses
- Pods with running, pending, failed & unknown statuses
- CPU & MEM usage/utilization per pod/container
- Filesystem read/write per pod/container
- Network receive/transmit per pod/container
- Collector node capacities & statuses
- Pods with running, pending, failed & unknown statuses
- CPU & MEM usage/utilization per collector instance
- Ratio of queue size to capacity per collector instance
- Dropped telemetry data per collector instance
- Failed receive/enqueue/export per collector instance
- Collector node capacities & statuses
- Pods with running, pending, failed & unknown statuses
- CPU & MEM usage/utilization
- Response latency
- Throughput per status & request type
- Workqueue
- Collector node capacities & statuses
- Pods with running, pending, failed & unknown statuses
- CPU & MEM usage/utilization
- Response latency
- Throughput per IP type & rcode
- Rate of panics & cache hits
- Ingest per telemetry type
- Ingest of Prometheus scraping
- per jobs
- per collector types
- Node capacities
- Node costs
- Lost money of nodes due to not utilizing them to the fullest
- Namespace costs
- Lost money of namespaces due to not utilizing them to the fullest
- Pod costs
- Lost money of pods due to not utilizing them to the fullest
The alerts have predefined threshold. If those are not applicable for your use-cases, feel free to adapt them accordingly!
- Status per instance remains not healthy for a certain amount of time
- CPU utilization per instance exceeding a certain limit for a certain amount of time
- Memory utilization per instance exceeding a certain limit for a certain amount of time
- Storage utilization per instance exceeding a certain limit for a certain amount of time
- Status per instance remains not healthy for a certain amount of time
- CPU utilization per instance exceeding a certain limit for a certain amount of time
- Memory utilization per instance exceeding a certain limit for a certain amount of time
- CPU utilization per instance exceeding a certain limit for a certain amount of time
- Memory utilization per instance exceeding a certain limit for a certain amount of time
- Queue utilization per instance exceeding a certain limit for a certain amount of time
- Dropped metrics/spans/logs per instance at least once
- Enqueue failures metrics/spans/logs per instance at least once
- Receive failures metrics/spans/logs per instance at least once
- Export failures metrics/spans/logs per instance at least once