Skip to content

Commit

Permalink
Add support for Prometheus metric categories
Browse files Browse the repository at this point in the history
See updated README.md file for description.
  • Loading branch information
ulrfa committed Apr 12, 2022
1 parent de48add commit f4735e1
Show file tree
Hide file tree
Showing 9 changed files with 250 additions and 9 deletions.
103 changes: 103 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -433,6 +433,26 @@ http_address: localhost:8080

# If supplied, controls the verbosity of the access logger ("none" or "all"):
#access_log_level: none

# If supplied, declares prometheus metric category names, with allowed value set
# for each category.
#metric_categories:
# branch:
# - main
# user:
# - ci
# product:
# - aaa
# - bbb
# pipeline:
# - unit-test
# - integration-test
# - system-test
# os:
# - rhel8
# - rhel9
# - ubuntu21-04
# - ubuntu22-04
```

## Docker
Expand Down Expand Up @@ -610,3 +630,86 @@ To avoid per-prefix rate limiting with Amazon S3, you may want to try using
`--s3.key_format=2`, which stores blobs across a larger number of prefixes.
Reference:
[Optimizing Amazon S3 Performance](https://docs.aws.amazon.com/AmazonS3/latest/dev/optimizing-performance.html).

## Prometheus Categories

bazel-remote supports declaring categories for the prometheus metrics.
The categories can be used to calculate cache hit ratio separate per
type of build, see where most traffic comes from, etc.

Clients can set categories via HTTP and gRPC headers. With bazel
that is done via the bazel option --remote_header.

Allowed category names are declared in the bazel-remote configuration file.
The allowed value set for each category also have to be declared as an
attempt to avoid polluting Prometheus with too many different time series.
https://prometheus.io/docs/practices/naming/ warns:

> "CAUTION: Remember that every unique combination of key-value
> label pairs represent a new time series, which can dramatically
> increase the amount of data stored. Do not use labels to store
> dimensions with high cardinality (many different label values),
> such as user IDs, email addresses, or other unbounded sets of
> values."
Received headers that match a declared category name, but with a value outside
the declared allowed value set, is reported with the value "other" to
Prometheus. This is convenient for categories such as "branch", where it from a
cache hit ratio perspective often make sense to distinguish between "main"
branch and "other" branches.

### Example

Example from a bazel-remote configuration file that declares categories and
their allowed value sets:
```
metric_categories:
branch:
- main
user:
- ci
product:
- aaa
- bbb
pipeline:
- unit-test
- integration-test
- system-test
os:
- rhel8
- rhel9
- ubuntu21-04
- ubuntu22-04
```

Bazel clients can be configured to always add flags like:
```
--remote_header=user=$USER
--remote_header=host=$HOST
--remote_header=os=\`get_os_name.sh\`
```

And bazel clients invoked via CI can in addition add headers like:
```
--remote_header=branch=$BRANCH_NAME
--remote_header=product=$PRODUCT_NAME
--remote_header=pipeline=$CI_PIPELINE_NAME
```

The value set for user and branch is not bounded and could therefore not be
stored as is in Prometheus. But in many cases, it is good enough to distinguish
between if it was ci user or non-ci user, or if it was main or non-main branch.
That can be achieved with bazel-remote configuration above which limits the
value ranges to only "ci" and "main".

Example of prometheus query calculating cache hit ratio specifically for the
user "ci":
```
(sum (rate(bazel_remote_incoming_requests_total{kind="ac",method="get",user="ci",status="hit"}[$__rate_interval]))) / (sum (rate(bazel_remote_incoming_requests_total{kind="ac",method="get",user="ci"}[$__rate_interval])))
```

Example of prometheus query calculating incoming cache request rate, grouped
by product:
```
sum by(product) (rate(bazel_remote_incoming_requests_total[$__rate_interval]))
```
1 change: 1 addition & 0 deletions cache/disk/BUILD.bazel
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ go_library(
"@com_github_djherbis_atime//:go_default_library",
"@com_github_prometheus_client_golang//prometheus:go_default_library",
"@org_golang_google_grpc//codes:go_default_library",
"@org_golang_google_grpc//metadata:go_default_library",
"@org_golang_google_grpc//status:go_default_library",
"@org_golang_google_protobuf//proto:go_default_library",
"@org_golang_x_sync//semaphore:go_default_library",
Expand Down
5 changes: 2 additions & 3 deletions cache/disk/disk_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -1229,7 +1229,7 @@ func TestMetricsUnvalidatedAC(t *testing.T) {

testCacheI, err := New(cacheDir, cacheSize,
WithAccessLogger(testutils.NewSilentLogger()),
WithEndpointMetrics())
WithEndpointMetrics(make(map[string][]string)))
if err != nil {
t.Fatal(err)
}
Expand Down Expand Up @@ -1304,7 +1304,6 @@ func TestMetricsUnvalidatedAC(t *testing.T) {
if acHits != 2 {
t.Fatalf("Expected acHit counter to be 2, found %f", acHits)
}

acMiss = count(testCache.counter, acKind, missStatus)
if acMiss != 0 {
t.Fatalf("Expected acMiss counter to be 0, found %f", acMiss)
Expand Down Expand Up @@ -1339,7 +1338,7 @@ func TestMetricsValidatedAC(t *testing.T) {

testCacheI, err := New(cacheDir, cacheSize,
WithAccessLogger(testutils.NewSilentLogger()),
WithEndpointMetrics())
WithEndpointMetrics(make(map[string][]string)))
if err != nil {
t.Fatal(err)
}
Expand Down
94 changes: 91 additions & 3 deletions cache/disk/metrics.go
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@ import (
"io"

"github.com/buchgr/bazel-remote/cache"
"google.golang.org/grpc/metadata"
"net/http"

pb "github.com/buchgr/bazel-remote/genproto/build/bazel/remote/execution/v2"

Expand All @@ -14,15 +16,17 @@ import (
type metricsDecorator struct {
counter *prometheus.CounterVec
*diskCache
categories map[string][]string
}

const (
hitStatus = "hit"
missStatus = "miss"
hitStatus = "hit"
missStatus = "miss"
emptyStatus = ""

containsMethod = "contains"
getMethod = "get"
//putMethod = "put"
putMethod = "put"

acKind = "ac" // This must be lowercase to match cache.EntryKind.String()
casKind = "cas"
Expand All @@ -46,6 +50,7 @@ func (m *metricsDecorator) Get(ctx context.Context, kind cache.EntryKind, hash s
} else {
lbls["status"] = missStatus
}
m.addCategoryLabels(ctx, lbls)
m.counter.With(lbls).Inc()

return rc, size, nil
Expand All @@ -63,6 +68,7 @@ func (m *metricsDecorator) GetValidatedActionResult(ctx context.Context, hash st
} else {
lbls["status"] = missStatus
}
m.addCategoryLabels(ctx, lbls)
m.counter.With(lbls).Inc()

return ar, data, err
Expand All @@ -83,6 +89,7 @@ func (m *metricsDecorator) GetZstd(ctx context.Context, hash string, size int64,
} else {
lbls["status"] = missStatus
}
m.addCategoryLabels(ctx, lbls)
m.counter.With(lbls).Inc()

return rc, size, nil
Expand All @@ -97,6 +104,7 @@ func (m *metricsDecorator) Contains(ctx context.Context, kind cache.EntryKind, h
} else {
lbls["status"] = missStatus
}
m.addCategoryLabels(ctx, lbls)
m.counter.With(lbls).Inc()

return ok, size
Expand All @@ -118,17 +126,97 @@ func (m *metricsDecorator) FindMissingCasBlobs(ctx context.Context, blobs []*pb.
"kind": "cas",
"status": hitStatus,
}
m.addCategoryLabels(ctx, hitLabels)
hits := m.counter.With(hitLabels)

missLabels := prometheus.Labels{
"method": containsMethod,
"kind": "cas",
"status": missStatus,
}
m.addCategoryLabels(ctx, missLabels)
misses := m.counter.With(missLabels)

hits.Add(float64(numFound))
misses.Add(float64(numMissing))

return digests, nil
}

func (m *metricsDecorator) Put(ctx context.Context, kind cache.EntryKind, hash string, size int64, r io.Reader) error {
err := m.diskCache.Put(ctx, kind, hash, size, r)
if err != nil {
return err
}

lbls := prometheus.Labels{"method": putMethod, "kind": kind.String(), "status": emptyStatus}
m.addCategoryLabels(ctx, lbls)
m.counter.With(lbls).Inc()

return nil
}

// Update prometheus labels based on HTTP and gRPC headers available via the context.
func (m *metricsDecorator) addCategoryLabels(ctx context.Context, labels prometheus.Labels) {

if len(m.categories) == 0 {
return
}

httpHeaders := getHttpHeaders(ctx)
var grpcHeaders metadata.MD
if httpHeaders == nil {
grpcHeaders = getGrpcHeaders(ctx)
}

for categoryNameLowerCase, allowedValues := range m.categories {
// Lower case is canonical for gRPC headers and convention for prometheus.
var headerValue string = ""
if grpcHeaders != nil {
grpcHeaderValues := grpcHeaders[categoryNameLowerCase]
if len(grpcHeaderValues) > 0 {
// Pick the first header with matching name if multiple headers with same name
headerValue = grpcHeaderValues[0]
}
} else if httpHeaders != nil {
headerValue = httpHeaders.Get(categoryNameLowerCase)
}
if len(headerValue) == 0 {
labels[categoryNameLowerCase] = ""
} else if contains(allowedValues, headerValue) {
labels[categoryNameLowerCase] = headerValue
} else {
labels[categoryNameLowerCase] = "other"
}
}
}

func contains(s []string, e string) bool {
for _, a := range s {
if a == e {
return true
}
}
return false
}

type httpHeadersContextKey struct{}

// Creates a context copy with HTTP headers attached.
func ContextWithHttpHeaders(ctx context.Context, headers *http.Header) context.Context {
return context.WithValue(ctx, httpHeadersContextKey{}, headers)
}

// Retrieves HTTP headers from context. Minimizes type safety issues.
func getHttpHeaders(ctx context.Context) *http.Header {
headers, ok := ctx.Value(httpHeadersContextKey{}).(*http.Header)
if !ok {
return nil
}
return headers
}

func getGrpcHeaders(ctx context.Context) metadata.MD {
grpcHeaders, _ := metadata.FromIncomingContext(ctx)
return grpcHeaders
}
9 changes: 7 additions & 2 deletions cache/disk/options.go
Original file line number Diff line number Diff line change
Expand Up @@ -75,18 +75,23 @@ func WithAccessLogger(logger *log.Logger) Option {
}
}

func WithEndpointMetrics() Option {
func WithEndpointMetrics(categories map[string][]string) Option {
return func(c *CacheConfig) error {
if c.metrics != nil {
return fmt.Errorf("WithEndpointMetrics specified multiple times")
}

labels := []string{"method", "status", "kind"}
for categoryNameLowerCase := range categories {
labels = append(labels, categoryNameLowerCase)
}
c.metrics = &metricsDecorator{
counter: prometheus.NewCounterVec(prometheus.CounterOpts{
Name: "bazel_remote_incoming_requests_total",
Help: "The number of incoming cache requests",
},
[]string{"method", "kind", "status"}),
labels),
categories: categories,
}

return nil
Expand Down
7 changes: 7 additions & 0 deletions config/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@ type Config struct {
DisableGRPCACDepsCheck bool `yaml:"disable_grpc_ac_deps_check"`
EnableACKeyInstanceMangling bool `yaml:"enable_ac_key_instance_mangling"`
EnableEndpointMetrics bool `yaml:"enable_endpoint_metrics"`
MetricCategories map[string][]string `yaml:"metric_categories"`
MetricsDurationBuckets []float64 `yaml:"endpoint_metrics_duration_buckets"`
ExperimentalRemoteAssetAPI bool `yaml:"experimental_remote_asset_api"`
HTTPReadTimeout time.Duration `yaml:"http_read_timeout"`
Expand Down Expand Up @@ -351,6 +352,12 @@ func validateConfig(c *Config) error {
return errors.New("'access_log_level' must be set to either \"none\" or \"all\"")
}

for categoryName := range c.MetricCategories {
if categoryName != strings.ToLower(categoryName) {
return fmt.Errorf("Names in 'metric_categories' must be in lower case: %s", categoryName)
}
}

return nil
}

Expand Down
35 changes: 35 additions & 0 deletions config/config_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -461,3 +461,38 @@ func TestSocketPathMissing(t *testing.T) {
t.Fatal("Expected the error message to mention the missing 'http_address' key/flag")
}
}

func TestMetricCategories(t *testing.T) {
yaml := `
metric_categories:
os:
- rhel8
- rhel9
- ubuntu21-04
branch:
- main
dir: /opt/cache-dir
max_size: 42
storage_mode: zstd
`
config, err := newFromYaml([]byte(yaml))
if err != nil {
t.Fatal(err)
}
values, ok := config.MetricCategories["os"]
if !ok {
t.Fatalf("Missing os in config")
}
if !contains(values, "rhel9") {
t.Fatalf("Missing rhel9 in config")
}
}

func contains(s []string, e string) bool {
for _, a := range s {
if a == e {
return true
}
}
return false
}
Loading

0 comments on commit f4735e1

Please sign in to comment.