Skip to content

Commit

Permalink
add status metrics on prometheus in order to improve plugin and bundl…
Browse files Browse the repository at this point in the history
…e monitoring

Signed-off-by: rafael otero reinert <rafaelreinert@gmail.com>
  • Loading branch information
rafaelreinert committed Feb 8, 2022
1 parent 56558df commit 9f4187d
Show file tree
Hide file tree
Showing 14 changed files with 1,017 additions and 38 deletions.
1 change: 1 addition & 0 deletions docs/content/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -764,6 +764,7 @@ included in the actual bundle gzipped tarball.
| `status.service` | `string` | Yes | Name of service to use to contact remote server. |
| `status.partition_name` | `string` | No | Path segment to include in status updates. |
| `status.console` | `boolean` | No (default: `false`) | Log the status updates locally to the console. When enabled alongside a remote status update API the `service` must be configured, the default `service` selection will be disabled. |
| `status.prometheus` | `boolean` | No (default: `false`) | Export the status (bundle and plugin) metrics to prometheus (see [the monitoring documentation](../monitoring/#prometheus)). When enabled alongside a remote status update API the `service` must be configured, the default `service` selection will be disabled. |
| `status.plugin` | `string` | No | Use the named plugin for status updates. If this field exists, the other configuration fields are not required. |
| `status.trigger` | `string` (default: `periodic`) | No | Controls how status updates are reported to the remote server. Allowed values are `periodic` and `manual`. |

Expand Down
12 changes: 12 additions & 0 deletions docs/content/management-status.md
Original file line number Diff line number Diff line change
Expand Up @@ -238,3 +238,15 @@ This will dump all status updates to the console. See
> Warning: Status update messages are somewhat infrequent but can be very verbose! The
>`metrics.prometheus` portion of the status update in particular can create a considerable
> amount of log text at info level.

### Prometheus Status Metrics

Prometheus status metrics can be enabled via the `prometheus` config option. (see [the configuration documentation](../configuration/#status))
Example of minimal config to enable:

```yaml
status:
prometheus: true
```

When enabled the OPA instance's Prometheus endpoint exposes the metrics described on [the monitoring documentation](../monitoring/#status-metrics).
79 changes: 48 additions & 31 deletions docs/content/monitoring.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,37 +42,54 @@ scrape_configs:
The Prometheus endpoint exports Go runtime metrics as well as HTTP request latency metrics for all handlers (e.g., `v1/data`).

| Metric name | Metric type | Description |
| --- | --- | --- |
| go_gc_duration_seconds | summary | A summary of the GC invocation durations. |
| go_goroutines | gauge | Number of goroutines that currently exist. |
| go_info | gauge | Information about the Go environment. |
| go_memstats_alloc_bytes | gauge | Number of bytes allocated and still in use. |
| go_memstats_alloc_bytes_total | counter | Total number of bytes allocated, even if freed. |
| go_memstats_buck_hash_sys_bytes | gauge | Number of bytes used by the profiling bucket hash table. |
| go_memstats_frees_total | counter | Total number of frees. |
| go_memstats_gc_cpu_fraction | gauge | The fraction of this program's available CPU time used by the GC since the program started. |
| go_memstats_gc_sys_bytes | gauge | Number of bytes used for garbage collection system metadata. |
| go_memstats_heap_alloc_bytes | gauge | Number of heap bytes allocated and still in use. |
| go_memstats_heap_idle_bytes | gauge | Number of heap bytes waiting to be used. |
| go_memstats_heap_inuse_bytes | gauge | Number of heap bytes that are in use. |
| go_memstats_heap_objects | gauge | Number of allocated objects. |
| go_memstats_heap_released_bytes | gauge | Number of heap bytes released to OS. |
| go_memstats_heap_sys_bytes | gauge | Number of heap bytes obtained from system. |
| go_memstats_last_gc_time_seconds | gauge | Number of seconds since 1970 of last garbage collection. |
| go_memstats_lookups_total | counter | Total number of pointer lookups. |
| go_memstats_mallocs_total | counter | Total number of mallocs. |
| go_memstats_mcache_inuse_bytes | gauge | Number of bytes in use by mcache structures. |
| go_memstats_mcache_sys_bytes | gauge | Number of bytes used for mcache structures obtained from system. |
| go_memstats_mspan_inuse_bytes | gauge | Number of bytes in use by mspan structures. |
| go_memstats_mspan_sys_bytes | gauge | Number of bytes used for mspan structures obtained from system. |
| go_memstats_next_gc_bytes | gauge | Number of heap bytes when next garbage collection will take place. |
| go_memstats_other_sys_bytes | gauge | Number of bytes used for other system allocations. |
| go_memstats_stack_inuse_bytes | gauge | Number of bytes in use by the stack allocator. |
| go_memstats_stack_sys_bytes | gauge | Number of bytes obtained from system for stack allocator. |
| go_memstats_sys_bytes | gauge | Number of bytes obtained from system. |
| go_threads | gauge | Number of OS threads created. |
| http_request_duration_seconds | histogram | A histogram of duration for requests. |
| Metric name | Metric type | Description | Status |
| --- | --- | --- | --- |
| go_gc_duration_seconds | summary | A summary of the GC invocation durations. | STABLE |
| go_goroutines | gauge | Number of goroutines that currently exist. | STABLE |
| go_info | gauge | Information about the Go environment. | STABLE |
| go_memstats_alloc_bytes | gauge | Number of bytes allocated and still in use. | STABLE |
| go_memstats_alloc_bytes_total | counter | Total number of bytes allocated, even if freed. | STABLE |
| go_memstats_buck_hash_sys_bytes | gauge | Number of bytes used by the profiling bucket hash table. | STABLE |
| go_memstats_frees_total | counter | Total number of frees. | STABLE |
| go_memstats_gc_cpu_fraction | gauge | The fraction of this program's available CPU time used by the GC since the program started. | STABLE |
| go_memstats_gc_sys_bytes | gauge | Number of bytes used for garbage collection system metadata. | STABLE |
| go_memstats_heap_alloc_bytes | gauge | Number of heap bytes allocated and still in use. | STABLE |
| go_memstats_heap_idle_bytes | gauge | Number of heap bytes waiting to be used. | STABLE |
| go_memstats_heap_inuse_bytes | gauge | Number of heap bytes that are in use. | STABLE |
| go_memstats_heap_objects | gauge | Number of allocated objects. | STABLE |
| go_memstats_heap_released_bytes | gauge | Number of heap bytes released to OS. | STABLE |
| go_memstats_heap_sys_bytes | gauge | Number of heap bytes obtained from system. | STABLE |
| go_memstats_last_gc_time_seconds | gauge | Number of seconds since 1970 of last garbage collection. | STABLE |
| go_memstats_lookups_total | counter | Total number of pointer lookups. | STABLE |
| go_memstats_mallocs_total | counter | Total number of mallocs. | STABLE |
| go_memstats_mcache_inuse_bytes | gauge | Number of bytes in use by mcache structures. | STABLE |
| go_memstats_mcache_sys_bytes | gauge | Number of bytes used for mcache structures obtained from system. | STABLE |
| go_memstats_mspan_inuse_bytes | gauge | Number of bytes in use by mspan structures. | STABLE |
| go_memstats_mspan_sys_bytes | gauge | Number of bytes used for mspan structures obtained from system. | STABLE |
| go_memstats_next_gc_bytes | gauge | Number of heap bytes when next garbage collection will take place. | STABLE |
| go_memstats_other_sys_bytes | gauge | Number of bytes used for other system allocations. | STABLE |
| go_memstats_stack_inuse_bytes | gauge | Number of bytes in use by the stack allocator. | STABLE |
| go_memstats_stack_sys_bytes | gauge | Number of bytes obtained from system for stack allocator. | STABLE |
| go_memstats_sys_bytes | gauge | Number of bytes obtained from system. | STABLE |
| go_threads | gauge | Number of OS threads created. | STABLE |
| http_request_duration_seconds | histogram | A histogram of duration for requests. | STABLE |


### Status Metrics

When Prometheus is enabled in the status plugin (see [Configuration](../configuration/#status), the OPA instance's Prometheus endpoint also exposes these metrics:

| Metric name | Metric type | Description | Status |
| --- | --- |----------------------------------------------------------|--------|
| plugin_status_gauge | gauge | Number of plugins by name and status. | EXPERIMENTAL |
| bundle_loaded_counter | counter | Number of bundles loaded with success. | EXPERIMENTAL |
| bundle_failed_load_counter | counter | Number of bundles that failed to load. | EXPERIMENTAL |
| last_bundle_request | gauge | Last bundle request in UNIX nanoseconds. | EXPERIMENTAL |
| last_success_bundle_activation | gauge | Last successfully bundle activation in UNIX nanoseconds. | EXPERIMENTAL |
| last_success_bundle_download | gauge | Last successfully bundle download in UNIX nanoseconds. | EXPERIMENTAL |
| last_success_bundle_request | gauge | Last successfully bundle request in UNIX nanoseconds. | EXPERIMENTAL |
| bundle_loading_duration_ns | histogram | A histogram of duration for bundle loading. | EXPERIMENTAL |


## Health Checks

Expand Down
15 changes: 15 additions & 0 deletions internal/prometheus/prometheus.go
Original file line number Diff line number Diff line change
Expand Up @@ -149,6 +149,21 @@ func (p *Provider) Clear() {
p.inner.Clear()
}

// Register register the collectors on OPA prometheus registry
func (p *Provider) Register(c prometheus.Collector) error {
return p.registry.Register(c)
}

// MustRegister register the collectors on OPA prometheus registry and panics when an error occurs
func (p *Provider) MustRegister(cs ...prometheus.Collector) {
p.registry.MustRegister(cs...)
}

// Unregister unregister the collectors on OPA prometheus registry
func (p *Provider) Unregister(c prometheus.Collector) bool {
return p.registry.Unregister(c)
}

type captureStatusResponseWriter struct {
http.ResponseWriter
status int
Expand Down
15 changes: 15 additions & 0 deletions plugins/plugins.go
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@ import (
"sync"
"time"

"github.com/prometheus/client_golang/prometheus"

"github.com/gorilla/mux"
"github.com/open-policy-agent/opa/ast"
"github.com/open-policy-agent/opa/bundle"
Expand Down Expand Up @@ -187,6 +189,7 @@ type Manager struct {
printHook print.Hook
enablePrintStatements bool
router *mux.Router
prometheusRegister prometheus.Registerer
}

type managerContextKey string
Expand Down Expand Up @@ -344,6 +347,13 @@ func WithRouter(r *mux.Router) func(*Manager) {
}
}

// WithPrometheusRegister sets the passed prometheus.Registerer to be used by plugins
func WithPrometheusRegister(prometheusRegister prometheus.Registerer) func(*Manager) {
return func(m *Manager) {
m.prometheusRegister = prometheusRegister
}
}

// New creates a new Manager using config.
func New(raw []byte, id string, store storage.Store, opts ...func(*Manager)) (*Manager, error) {

Expand Down Expand Up @@ -893,3 +903,8 @@ func (m *Manager) RegisterCacheTrigger(trigger func(*cache.Config)) {
defer m.mtx.Unlock()
m.registeredCacheTriggers = append(m.registeredCacheTriggers, trigger)
}

// PrometheusRegister gets the prometheus.Registerer for this plugin manager.
func (m *Manager) PrometheusRegister() prometheus.Registerer {
return m.prometheusRegister
}
38 changes: 38 additions & 0 deletions plugins/plugins_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@ import (
"reflect"
"testing"

prom "github.com/prometheus/client_golang/prometheus"

"github.com/open-policy-agent/opa/internal/storage/mock"
"github.com/open-policy-agent/opa/logging"
"github.com/open-policy-agent/opa/logging/test"
Expand Down Expand Up @@ -340,6 +342,22 @@ func TestPluginManagerConsoleLogger(t *testing.T) {
}
}

func TestPluginManagerPrometheusRegister(t *testing.T) {
register := prometheusRegisterMock{Collectors: map[prom.Collector]bool{}}
mgr, err := New([]byte(`{}`), "", inmem.New(), WithPrometheusRegister(register))
if err != nil {
t.Fatal(err)
}

counter := prom.NewCounter(prom.CounterOpts{})
if err := mgr.PrometheusRegister().Register(counter); err != nil {
t.Fatal(err)
}
if register.Collectors[counter] != true {
t.Fatalf("Counter metric was not registered on prometheus")
}
}

func TestPluginManagerServerInitialized(t *testing.T) {
// Verify that ServerInitializedChannel is closed when
// ServerInitialized is called.
Expand Down Expand Up @@ -395,3 +413,23 @@ func (*myAuthPluginMock) Stop(context.Context) {
}
func (*myAuthPluginMock) Reconfigure(context.Context, interface{}) {
}

type prometheusRegisterMock struct {
Collectors map[prom.Collector]bool
}

func (p prometheusRegisterMock) Register(collector prom.Collector) error {
p.Collectors[collector] = true
return nil
}

func (p prometheusRegisterMock) MustRegister(collector ...prom.Collector) {
for _, c := range collector {
p.Collectors[c] = true
}
}

func (p prometheusRegisterMock) Unregister(collector prom.Collector) bool {
delete(p.Collectors, collector)
return true
}
55 changes: 55 additions & 0 deletions plugins/status/metrics.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
package status

import (
"github.com/prometheus/client_golang/prometheus"
)

var (
pluginStatus = prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Name: "plugin_status_gauge",
Help: "Gauge for the plugin by status."},
[]string{"name", "status"},
)
loaded = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "bundle_loaded_counter",
Help: "Counter for the bundle loaded."},
[]string{"name", "active_revision"},
)
failLoad = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "bundle_failed_load_counter",
Help: "Counter for the failed bundle load."},
[]string{"name", "active_revision", "code", "message"},
)
lastRequest = prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Name: "last_bundle_request",
Help: "Gauge for the last bundle request."},
[]string{"name", "active_revision"},
)
lastSuccessfulActivation = prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Name: "last_success_bundle_activation",
Help: "Gauge for the last success bundle activation."},
[]string{"name", "active_revision"},
)
lastSuccessfulDownload = prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Name: "last_success_bundle_download",
Help: "Gauge for the last success bundle download."},
[]string{"name", "active_revision"},
)
lastSuccessfulRequest = prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Name: "last_success_bundle_request",
Help: "Gauge for the last success bundle request."},
[]string{"name", "active_revision"},
)
bundleLoadDuration = prometheus.NewHistogramVec(prometheus.HistogramOpts{
Name: "bundle_loading_duration_ns",
Help: "Histogram for the bundle loading duration by stage.",
Buckets: prometheus.ExponentialBuckets(1000, 2, 20),
}, []string{"name", "active_revision", "stage"})
)
50 changes: 48 additions & 2 deletions plugins/status/plugin.go
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ import (
"reflect"

"github.com/pkg/errors"
prom "github.com/prometheus/client_golang/prometheus"

"github.com/open-policy-agent/opa/logging"
"github.com/open-policy-agent/opa/metrics"
Expand Down Expand Up @@ -65,6 +66,7 @@ type Config struct {
Service string `json:"service"`
PartitionName string `json:"partition_name,omitempty"`
ConsoleLogs bool `json:"console"`
Prometheus bool `json:"prometheus"`
Trigger *plugins.TriggerMode `json:"trigger,omitempty"` // trigger mode
}

Expand All @@ -86,7 +88,7 @@ func (c *Config) validateAndInjectDefaults(services []string, pluginsList []stri
if !found {
return fmt.Errorf("invalid plugin name %q in status", *c.Plugin)
}
} else if c.Service == "" && len(services) != 0 && !c.ConsoleLogs {
} else if c.Service == "" && len(services) != 0 && !(c.ConsoleLogs || c.Prometheus) {
// For backwards compatibility allow defaulting to the first
// service listed, but only if console logging is disabled. If enabled
// we can't tell if the deployer wanted to use only console logs or
Expand Down Expand Up @@ -171,7 +173,7 @@ func (b *ConfigBuilder) Parse() (*Config, error) {
return nil, err
}

if parsedConfig.Plugin == nil && parsedConfig.Service == "" && len(b.services) == 0 && !parsedConfig.ConsoleLogs {
if parsedConfig.Plugin == nil && parsedConfig.Service == "" && len(b.services) == 0 && !parsedConfig.ConsoleLogs && !parsedConfig.Prometheus {
// Nothing to validate or inject
return nil, nil
}
Expand Down Expand Up @@ -231,13 +233,27 @@ func (p *Plugin) Start(ctx context.Context) error {
// to prevent blocking threads pushing the plugin updates.
p.manager.RegisterPluginStatusListener(Name, p.UpdatePluginStatus)

if p.config.Prometheus && p.manager.PrometheusRegister() != nil {
p.register(p.manager.PrometheusRegister(), pluginStatus, loaded, failLoad,
lastRequest, lastSuccessfulActivation, lastSuccessfulDownload,
lastSuccessfulRequest, bundleLoadDuration)
}

// Set the status plugin's status to OK now that everything is registered and
// the loop is running. This will trigger an update on the listener with the
// current status of all the other plugins too.
p.manager.UpdatePluginStatus(Name, &plugins.Status{State: plugins.StateOK})
return nil
}

func (p *Plugin) register(r prom.Registerer, cs ...prom.Collector) {
for _, c := range cs {
if err := r.Register(c); err != nil {
p.logger.Error("Status metric failed to register on prometheus :%v.", err)
}
}
}

// Stop stops the plugin.
func (p *Plugin) Stop(ctx context.Context) {
p.logger.Info("Stopping status reporter.")
Expand Down Expand Up @@ -377,6 +393,10 @@ func (p *Plugin) oneShot(ctx context.Context) error {
}
}

if p.config.Prometheus {
updatePrometheusMetrics(req)
}

if p.config.Plugin != nil {
proxy, ok := p.manager.Plugin(*p.config.Plugin).(Logger)
if !ok {
Expand Down Expand Up @@ -454,3 +474,29 @@ func (p *Plugin) logUpdate(update *UpdateRequestV1) error {
}).Info("Status Log")
return nil
}

func updatePrometheusMetrics(u *UpdateRequestV1) {
pluginStatus.Reset()
for name, plugin := range u.Plugins {
pluginStatus.WithLabelValues(name, string(plugin.State)).Set(1)
}
for _, bundle := range u.Bundles {
if bundle.Code == "" && bundle.ActiveRevision != "" {
loaded.WithLabelValues(bundle.Name, bundle.ActiveRevision).Inc()
} else {
failLoad.WithLabelValues(bundle.Name, bundle.ActiveRevision, bundle.Code, bundle.Message).Inc()
}
lastSuccessfulActivation.WithLabelValues(bundle.Name, bundle.ActiveRevision).Set(float64(bundle.LastSuccessfulActivation.UnixNano()))
lastSuccessfulDownload.WithLabelValues(bundle.Name, bundle.ActiveRevision).Set(float64(bundle.LastSuccessfulDownload.UnixNano()))
lastSuccessfulRequest.WithLabelValues(bundle.Name, bundle.ActiveRevision).Set(float64(bundle.LastSuccessfulRequest.UnixNano()))
lastRequest.WithLabelValues(bundle.Name, bundle.ActiveRevision).Set(float64(bundle.LastRequest.UnixNano()))
if bundle.Metrics != nil {
for stage, metric := range bundle.Metrics.All() {
switch stage {
case "timer_bundle_request_ns", "timer_rego_data_parse_ns", "timer_rego_module_parse_ns", "timer_rego_module_compile_ns", "timer_rego_load_bundles_ns":
bundleLoadDuration.WithLabelValues(bundle.Name, bundle.ActiveRevision, stage).Observe(float64(metric.(int64)))
}
}
}
}
}
Loading

0 comments on commit 9f4187d

Please sign in to comment.