Skip to content

Commit

Permalink
feat: add new metrics
Browse files Browse the repository at this point in the history
add info metrics about providers, enterprises, organizations,
repositories and pools.

Also expose most of the configurable pool information as metric like
e.g. max Runners as garm_pool_max_runners

Signed-off-by: Mario Constanti <mario.constanti@mercedes-benz.com>
  • Loading branch information
bavarianbidi committed Oct 6, 2023
1 parent a48ec0c commit 3cc6056
Show file tree
Hide file tree
Showing 10 changed files with 579 additions and 96 deletions.
52 changes: 48 additions & 4 deletions doc/config_metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,55 @@

This is one of the features in GARM that I really love having. For one thing, it's community contributed and for another, it really adds value to the project. It allows us to create some pretty nice visualizations of what is happening with GARM.

At the moment there are only three meaningful metrics being collected, besides the default ones that the prometheus golang package enables by default. These are:
## Common metrics

* `garm_health` - This is a gauge that is set to 1 if GARM is healthy and 0 if it is not. This is useful for alerting.
* `garm_runner_status` - This is a gauge value that gives us details about the runners garm spawns
* `garm_webhooks_received` - This is a counter that increments every time GARM receives a webhook from GitHub.
| Metric name | Type | Labels | Description |
|--------------------------|---------|-------------------------------------------------------------------|------------------------------------------------------------------------------------------------------|
| `garm_health` | Gauge | `controller_id`=&lt;controller id&gt; <br>`name`=&lt;hostname&gt; | This is a gauge that is set to 1 if GARM is healthy and 0 if it is not. This is useful for alerting. |
| `garm_webhooks_received` | Counter | `controller_id`=&lt;controller id&gt; <br>`name`=&lt;hostname&gt; | This is a counter that increments every time GARM receives a webhook from GitHub. |

## Enterprise metrics

| Metric name | Type | Labels | Description |
|---------------------------------------|-------|-------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------|
| `garm_enterprise_info` | Gauge | `id`=&lt;enterprise id&gt; <br>`name`=&lt;enterprise name&gt; | This is a gauge that is set to 1 and expose enterprise information |
| `garm_enterprise_pool_manager_status` | Gauge | `id`=&lt;enterprise id&gt; <br>`name`=&lt;enterprise name&gt; <br>`running`=&lt;true\|false&gt; | This is a gauge that is set to 1 if the enterprise pool manager is running and set to 0 if not |

## Organization metrics

| Metric name | Type | Labels | Description |
|-----------------------------------------|-------|-----------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------|
| `garm_organization_info` | Gauge | `id`=&lt;organization id&gt; <br>`name`=&lt;organization name&gt; | This is a gauge that is set to 1 and expose organization information |
| `garm_organization_pool_manager_status` | Gauge | `id`=&lt;organization id&gt; <br>`name`=&lt;organization name&gt; <br>`running`=&lt;true\|false&gt; | This is a gauge that is set to 1 if the organization pool manager is running and set to 0 if not |

## Repository metrics

| Metric name | Type | Labels | Description |
|---------------------------------------|-------|-------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------|
| `garm_repository_info` | Gauge | `id`=&lt;repository id&gt; <br>`name`=&lt;repository name&gt; | This is a gauge that is set to 1 and expose repository information |
| `garm_repository_pool_manager_status` | Gauge | `id`=&lt;repository id&gt; <br>`name`=&lt;repository name&gt; <br>`running`=&lt;true\|false&gt; | This is a gauge that is set to 1 if the repository pool manager is running and set to 0 if not |

## Provider metrics

| Metric name | Type | Labels | Description |
|----------------------|-------|-------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------|
| `garm_provider_info` | Gauge | `description`=&lt;provider description&gt; <br>`name`=&lt;provider name&gt; <br>`type`=&lt;internal\|external&gt; | This is a gauge that is set to 1 and expose provider information |

## Pool metrics

| Metric name | Type | Labels | Description |
|-------------------------------|-------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------|
| `garm_pool_info` | Gauge | `flavor`=&lt;flavor&gt; <br>`id`=&lt;pool id&gt; <br>`image`=&lt;image name&gt; <br>`os_arch`=&lt;defined OS arch&gt; <br>`os_type`=&lt;defined OS name&gt; <br>`pool_owner`=&lt;owner name&gt; <br>`pool_type`=&lt;repository\|organization\|enterprise&gt; <br>`prefix`=&lt;prefix&gt; <br>`provider`=&lt;provider name&gt; <br>`tags`=&lt;concatenated list of pool tags&gt; <br> | This is a gauge that is set to 1 and expose pool information |
| `garm_pool_status` | Gauge | `enabled`=&lt;true\|false&gt; <br>`id`=&lt;pool id&gt; | This is a gauge that is set to 1 if the pool is enabled and set to 0 if not |
| `garm_pool_bootstrap_timeout` | Gauge | `id`=&lt;pool id&gt; | This is a gauge that is set to the pool bootstrap timeout |
| `garm_pool_max_runners` | Gauge | `id`=&lt;pool id&gt; | This is a gauge that is set to the pool max runners |
| `garm_pool_min_idle_runners` | Gauge | `id`=&lt;pool id&gt; | This is a gauge that is set to the pool min idle runners |

## Runner metrics

| Metric name | Type | Labels | Description |
|----------------------|-------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------|
| `garm_runner_status` | Gauge | `controller_id`=&lt;controller id&gt; <br>`hostname`=&lt;hostname&gt; <br>`name`=&lt;runner name&gt; <br>`pool_owner`=&lt;owner name&gt; <br>`pool_type`=&lt;repository\|organization\|enterprise&gt; <br>`provider`=&lt;provider name&gt; <br>`runner_status`=&lt;running\|stopped\|error\|pending_delete\|deleting\|pending_create\|creating\|unknown&gt; <br>`status`=&lt;idle\|pending\|terminated\|installing\|failed\|active&gt; <br> | This is a gauge value that gives us details about the runners garm spawns |

More metrics will be added in the future.

Expand Down
50 changes: 50 additions & 0 deletions metrics/enterprise.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
package metrics

import (
"log"
"strconv"

"github.com/cloudbase/garm/auth"
"github.com/prometheus/client_golang/prometheus"
)

// CollectOrganizationMetric collects the metrics for the enterprise objects
func (c *GarmCollector) CollectEnterpriseMetric(ch chan<- prometheus.Metric, hostname string, controllerID string) {
ctx := auth.GetAdminContext()

enterprises, err := c.runner.ListEnterprises(ctx)
if err != nil {
log.Printf("listing providers: %s", err)
// continue anyway
}

for _, enterprise := range enterprises {

enterpriseInfo, err := prometheus.NewConstMetric(
c.enterpriseInfo,
prometheus.GaugeValue,
1,
enterprise.Name, // label: name
enterprise.ID, // label: id
)
if err != nil {
log.Printf("cannot collect enterpriseInfo metric: %s", err)
continue
}
ch <- enterpriseInfo

enterprisePoolManagerStatus, err := prometheus.NewConstMetric(
c.enterprisePoolManagerStatus,
prometheus.GaugeValue,
bool2float64(enterprise.PoolManagerStatus.IsRunning),
enterprise.Name, // label: name
enterprise.ID, // label: id
strconv.FormatBool(enterprise.PoolManagerStatus.IsRunning), // label: running
)
if err != nil {
log.Printf("cannot collect enterprisePoolManagerStatus metric: %s", err)
continue
}
ch <- enterprisePoolManagerStatus
}
}
22 changes: 22 additions & 0 deletions metrics/health.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
package metrics

import (
"log"

"github.com/prometheus/client_golang/prometheus"
)

func (c *GarmCollector) CollectHealthMetric(ch chan<- prometheus.Metric, hostname string, controllerID string) {
m, err := prometheus.NewConstMetric(
c.healthMetric,
prometheus.GaugeValue,
1,
hostname,
controllerID,
)
if err != nil {
log.Printf("error on creating health metric: %s", err)
return
}
ch <- m
}
79 changes: 79 additions & 0 deletions metrics/instance.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
package metrics

import (
"log"

"github.com/cloudbase/garm/auth"
"github.com/prometheus/client_golang/prometheus"
)

// CollectInstanceMetric collects the metrics for the runner instances
// reflecting the statuses and the pool they belong to.
func (c *GarmCollector) CollectInstanceMetric(ch chan<- prometheus.Metric, hostname string, controllerID string) {
ctx := auth.GetAdminContext()

instances, err := c.runner.ListAllInstances(ctx)
if err != nil {
log.Printf("cannot collect metrics, listing instances: %s", err)
return
}

pools, err := c.runner.ListAllPools(ctx)
if err != nil {
log.Printf("listing pools: %s", err)
// continue anyway
}

type poolInfo struct {
Name string
Type string
ProviderName string
}

poolNames := make(map[string]poolInfo)
for _, pool := range pools {
if pool.EnterpriseName != "" {
poolNames[pool.ID] = poolInfo{
Name: pool.EnterpriseName,
Type: string(pool.PoolType()),
ProviderName: pool.ProviderName,
}
} else if pool.OrgName != "" {
poolNames[pool.ID] = poolInfo{
Name: pool.OrgName,
Type: string(pool.PoolType()),
ProviderName: pool.ProviderName,
}
} else {
poolNames[pool.ID] = poolInfo{
Name: pool.RepoName,
Type: string(pool.PoolType()),
ProviderName: pool.ProviderName,
}
}
}

for _, instance := range instances {

m, err := prometheus.NewConstMetric(
c.instanceMetric,
prometheus.GaugeValue,
1,
instance.Name, // label: name
string(instance.Status), // label: status
string(instance.RunnerStatus), // label: runner_status
poolNames[instance.PoolID].Name, // label: pool_owner
poolNames[instance.PoolID].Type, // label: pool_type
instance.PoolID, // label: pool_id
hostname, // label: hostname
controllerID, // label: controller_id
poolNames[instance.PoolID].ProviderName, // label: provider
)

if err != nil {
log.Printf("cannot collect runner metric: %s", err)
continue
}
ch <- m
}
}
Loading

0 comments on commit 3cc6056

Please sign in to comment.