feat: Per-app usage metrics #3429

bryanhuhta · 2024-07-16T20:29:26Z

related https://github.com/grafana/pyroscope-squad/issues/162

Warning

This PR body is outdated, but I want to leave it here for posterity's sake. Please refer to this comment to get the most up-to-date and accurate summary of the changes in the PR.

This implements per-app usage for bytes ingested/dropped (both metrics required to calculate billable bytes). I took the following metrics

distributor_received_decompressed_bytes
discarded_bytes_total

and added the service_name label to them. This label value is typically blank, but when the tenant overrides have the following property set

overrides:
  "1234":
    distributor_usage_groups:
      - service_a
      - service_b

any series which has a service_name label that matches one of distributor_usage_groups will have service_name set on the ingest/dropped metrics. This should allow us to create recording rules to send to the billing cortex and let customers view per-app usage breakdowns.

Note that there is a cost to adding an app to the allowlist. We should be prudent when adding services, especially for large customers.

bryanhuhta · 2024-07-16T20:32:10Z

pkg/distributor/distributor.go

@@ -746,7 +754,7 @@ func (g *groupsWithFingerprints) add(stringTable []string, lbls phlaremodel.Labe
 	})
 }

-func extractSampleSeries(req *distributormodel.PushRequest, relabelRules []*relabel.Config) (result []*distributormodel.ProfileSeries, bytesRelabelDropped, profilesRelabelDropped float64) {
+func extractSampleSeries(req *distributormodel.PushRequest, usageGroups *validation.TenantUsageGroups, relabelRules []*relabel.Config) (result []*distributormodel.ProfileSeries, bytesRelabelDropped, profilesRelabelDropped float64) {


It's unfortunate this function signature had to change. This function previously returned all dropped bytes, so I couldn't use the return value as there may be series present that got dropped which need to specially labeled.

bryanhuhta · 2024-07-16T20:33:28Z

pkg/distributor/distributor.go

+		// note(bryanhuhta): We don't need to label this metric with service
+		// name as rate limited requests don't count towards billable bytes.
 		validation.DiscardedBytes.WithLabelValues(string(validation.RateLimited), tenantID).Add(float64(req.TotalBytesUncompressed))


I think this is a valid assumption. When calculating billable bytes, the recording rules seem to subtract out any bytes that were rate limited.

https://github.com/grafana/deployment_tools/blob/6b6e9832e7abb0fa2b8aeafc3aa6ef75d69880f6/ksonnet/lib/billing-mixin/recording_rules/profiles.libsonnet#L48-L50

bryanhuhta · 2024-07-16T20:34:01Z

pkg/distributor/distributor_test.go

+		ug := &validation.TenantUsageGroups{
+			TenantID: "",
+		}


This is only here because the extractSampleSeries signature changed.

bryanhuhta · 2024-07-16T20:34:38Z

pkg/validation/usage_groups.go

The implementation of the service name allowlist.

petethepig · 2024-07-16T20:59:49Z

Looks good. Few random thoughts:

We could copy the config format from Cloud metrics. Theirs is more flexible because you can set different selectors (not just service_name), but more complex
I might be overly paranoid, but I would maybe create a new metric instead of using the existing one. This way if something goes wrong (e.g cardinality explosion) it would be easier to fix.
I would avoid empty strings in label values. I'd call it "unknown" or "other" or something like that. This also has a benefit of showing up in the dashboard later so that it's clear for users what's going on.

I don't feel very strongly about 1 and 2, I do feel strongly on the third point.

Adopt more flexible config structure

bryanhuhta · 2024-07-18T22:54:35Z

After @petethepig's comments, I made the following changes:

Created new metrics to track per-app usage
- pyroscope_usage_group_received_decompressed_total => pyroscope_distributor_received_compressed_bytes
- pyroscope_usage_group_discarded_bytes_total => pyroscope_discarded_bytes_total
Implemented a more flexible config structure (if not exactly the same as metrics, it's highly similar)
Grouped all other values that don't correspond to a usage group into an "other" bucket

Here's it working in fire-dev-001: https://ops.grafana-ops.net/goto/yu16UVXIg?orgId=1

Using the following usage group definitions:

distributor_usage_groups:
  - pyroscope: '{service_name=~"fire-dev-001/.*"}'
  - cortex-dev-01/ingester: '{service_name="cortex-dev-01/ingester"}'

This groups all the fire-dev-001/* apps under the pyroscope usage group and puts cortex-dev-01/ingester into its own usage group.

simonswine

have a few concerns which I outlined in the line by line comments. Happy to dive into them a bit more if stuff is unclear.

It is also worth taking a look at how Mimir does it, because their implementaiton is obviously a bit more tested: https://github.com/grafana/mimir/blob/main/pkg/ingester/activeseries/custom_trackers_config.go

simonswine · 2024-07-19T12:18:38Z

pkg/validation/usage_groups.go

+				return nil, fmt.Errorf("no matchers for usage group %q and tenant %q", name, tenantID)
+			}
+
+			amMatchers, err := amlabels.ParseMatchers(matchersString)


Maybe I am missing something, but we probably should parse those the same way we do in the query path: https://github.com/simonswine/pyroscope/blob/ed0b5643f48792ac988d975e17e285b807d5c1f9/pkg/phlaredb/head.go#L440

Good point, a lot of this file was copied from how Mimir implemented this. So no, you aren't missing anything, I just happened to use Mimir's parse metrics approach instead of our own. I'll switch this.

simonswine · 2024-07-19T12:34:34Z

pkg/validation/usage_groups.go

+
+// DistributorUsageGroups returns the usage groups that are enabled for this
+// tenant.
+func (o *Overrides) DistributorUsageGroups(tenantID string) (*UsageGroupConfig, error) {


Any returned error in this method will lead to profiling traffic being dropped. This is not ideal.
This seems quite a severe reaction to "I have wrong accounting usage groups".

I do suggest we should move all those errors into the parsing part of the overrides.

Separately this method does a lot and is in the hot path of ingestion. We should also move as much as we can into the "loading the overrides" part of the code.

E.g. the parsing of label matchers should only happen only when the override is parsed and not for every profile ingested.

simonswine · 2024-07-19T12:36:48Z

pkg/validation/limits.go

@@ -50,6 +50,9 @@ type Limits struct {
 	MaxProfileStacktraceDepth        int `yaml:"max_profile_stacktrace_depth" json:"max_profile_stacktrace_depth"`
 	MaxProfileSymbolValueLength      int `yaml:"max_profile_symbol_value_length" json:"max_profile_symbol_value_length"`

+	// Distributor per-app usage breakdown.
+	DistributorUsageGroups []map[string]string `yaml:"distributor_usage_groups" json:"distributor_usage_groups"`


Do you mind explaining why this is a slice of maps, rather than just a map, which I would expect?

Here the yaml "explanation of my question":

current_usage_groups: - my_team: '{namespace="cool-stuff"' - other_team: '{namespace="boring-stuff"}' why_not_usage_groups: my_team: '{namespace="cool-stuff"}' other_team: '{namespace="boring-stuff"}'

I wanted the labels to be applied deterministically (the last usage group matched will be used). I originally used the why_not_usage_groups approach, but if two usages groups matched a label set, we wouldn't know which one would get chosen.

distributor_usage_groups: specific_cool_stuff: '{namespace="cool-stuff", service_name="thing"}' general_cool_stuff: '{namespace="cool-stuff"}'

Depending on how each key got hashed into the map, we'd get one or the other of the two usage groups. We could simplify the config and matching logic if we relaxed the determinism constraint, but I worry this might skew results oddly.

Totally get the order and deterministic result problem (now that you mentioned it😆 ), but I think the map inside the slice will still be non-deterministic in the current implementation.

I think maybe a slice of two string might be a better solution:

name string selector string }

or just sorting by the map key (name of the rule group).

Probably it is best to aim for consistency with other Grafana Products and do what Mimir does.

bryanhuhta

This implementation has had a few iterations and has changed significantly since I opened this PR. This is a recap of the implementation to-date.

Design

The idea of the usage group at this point is identical (or nearly so) to Mimir's approach. A usage group has a name and matcher(s). When a profile is ingested, it's labels are parsed and matched against each usage group. Any usage group it matches will be collected and the profile size will be counted by the pyroscope_usage_group_received_decompressed_total metric. Later, if that profile is dropped for any reason, it the drop will be counted by the pyroscope_usage_group_discarded_bytes_total metric.

As noted before, a profile can match 0 or more usage groups.

If 0, then it's bucketed into a default usage group called "other"
If 1 or more, the pyroscope_usage_group_received_decompressed_total will be counted with each matching usage group

As an example, consider the following labels attached to a profile:

[
  { "service_name": "foo" }
  { "namespace": "barbaz" }
]

along with the following usage groups:

distributor_usage_groups:
  app/foo: '{service_name="foo"}'
  namespace/bar: '{namespace=~"bar.*"}'

the pyroscope_usage_group_received_decompressed_total metric will be counted once for app/foo and once for namespace/bar:

pyroscope_usage_group_received_decompressed_total{usage_group="app/foo"}
pyroscope_usage_group_received_decompressed_total{usage_group="namespace/bar"}

Implementation

I created a struct UsageGroupConfig which is responsible for

unmarshalling the usage group config (from both yaml and json)
validating the group count (limit 50 for now)
normalizing usage group names
parsing and validating the matchers

This struct is built once at start up and any errors it generates will cause the app to fail to run. I believe this is reasonable.

UsageGroupConfig has a method GetUsageGroups(phlaremodel.Labels) which accepts a label set. It will match the label set against all the usage groups and return a list of all that match via the proxy object UsageGroupMatch. UsageGroupMatch has methods that can report a number as "bytes received" or "bytes dropped". This is where the logic of emitting a metric label for each usage group occurs.

Remarks

@simonswine identified one of the former approaches as being detrimental to the hot path both in aggressive failure cases and lots of new computation. Both these concerns are addressed by doing the bulk of the error-prone parsing and validation at startup, leaving simple accounting to the hot path.

Additionally, there was an ask to make this look and feel more like Mimir's approach. Whilst it isn't exactly the same, I'm hoping with the most recent changes we move much, much closer to their implementation. I took a lot of inspiration from how they implemented this feature, so the user-facing look-and-feel and the code of this feature should feel very similar to Mimir's.

Lastly, It's important to note that this approach isn't designed to be accurate down to the byte and penny. @petethepig and I agreed to take this approach because it should be approximately accurate and provide proportionally accurate results. So even if this feature doesn't report penny-exact numbers, the error should be proportionally the same across all usage groups. Big apps will produce big numbers, small apps will produce small numbers.

bryanhuhta · 2024-07-19T22:19:21Z

pkg/validation/usage_groups.go

+		// TODO(bryanhuhta): We should probably validate the usage group name
+		// is a valid label value.


I couldn't figure out where prometheus does its validation of label values, but wherever it is, we should copy it.

It is any valid utf8 string: https://github.com/prometheus/client_golang/blob/9f203a098ec630d179ccef4efbaa8c3341291d00/prometheus/labels.go#L158

I suggest we also disallow "other" to avoid undefined behaviour.

bryanhuhta · 2024-07-19T22:20:57Z

pkg/validation/usage_groups.go

+	// It should never be nil, but check just in case!
+	if config == nil {
+		config = &UsageGroupConfig{}
+	}


With the current code, the config will never be nil. However, I think we should leave this check here so we treat a nil config like a empty config, in case the config could be nil at this point.

simonswine

Thanks for looking into this and also for providing so much detailed comments and ensure this is properly tested. ❤️

I would like you to take a look at my two suggestions regarding validation and tenant_id flow. But I think this is already for a LGTM.

simonswine · 2024-07-22T13:14:47Z

pkg/validation/usage_groups.go

+		// TODO(bryanhuhta): We should probably validate the usage group name
+		// is a valid label value.


It is any valid utf8 string: https://github.com/prometheus/client_golang/blob/9f203a098ec630d179ccef4efbaa8c3341291d00/prometheus/labels.go#L158

I suggest we also disallow "other" to avoid undefined behaviour.

simonswine · 2024-07-22T13:25:10Z

pkg/validation/usage_groups.go

+	if config.TenantID == "" {
+		config.TenantID = tenantID
+	}


This feels a bit sketchy to me, we despite not registering it as through "RegisterFlag" in theory you could still set in the pyroscope.yaml config a default usage groups for all tenants.

This bit would then expose the same *UsageGroupConfig to all tenant that don't override it, with a race to that bit which sets the tenant.

I think it would be better to do the tenant field in a wrapped type or even simpler put it to a parameter to the Count*() methods

type UsageGroupConfig // without TenantID type UsageGroupInstance{ *UsageGroupConfig tenantID string }

…trics

bryanhuhta added 2 commits July 16, 2024 15:16

Implement tenant usage groups

17c32d6

Count bytes ingested/dropped per service name

f32faec

bryanhuhta self-assigned this Jul 16, 2024

bryanhuhta changed the title ~~Per-app usage metrics~~ feat: Per-app usage metrics Jul 16, 2024

bryanhuhta commented Jul 16, 2024

View reviewed changes

bryanhuhta requested review from simonswine and petethepig July 16, 2024 20:34

petethepig approved these changes Jul 16, 2024

View reviewed changes

bryanhuhta added 4 commits July 16, 2024 17:04

Fix ingester limiter test

219522d

Implement metrics config format of usage groups

4ecaeb1

Use new metrics for app usage metrics and improve internal API

b70ea52

Adopt more flexible config structure

go mod tidy

c369e79

Merge branch 'main' into app-usage-metrics

71c90f6

bryanhuhta marked this pull request as ready for review July 18, 2024 22:58

bryanhuhta requested a review from a team as a code owner July 18, 2024 22:58

simonswine reviewed Jul 19, 2024

View reviewed changes

bryanhuhta added 6 commits July 19, 2024 14:37

Implement usage group constructor

886b9a7

Implement usage group constructor

c34511a

Implement usage groups

1896f2d

Add tests method to fetch usage groups config

7b02461

go mod tidy

b9354da

Add some more validation and tests

be80e1b

bryanhuhta commented Jul 19, 2024

View reviewed changes

simonswine approved these changes Jul 22, 2024

View reviewed changes

bryanhuhta added 4 commits July 22, 2024 09:11

Disallow "other" to be a valid usage group name

7b97d7b

Disallow invalid utf8 strings as usage group names

5a3dc3f

Pass tenant id to GetUsageGroups to avoid race conditions

76dd6b4

Merge branch 'main' of github.com:grafana/pyroscope into app-usage-me…

e80bd20

…trics

bryanhuhta merged commit e3e2777 into main Jul 22, 2024
18 checks passed

bryanhuhta deleted the app-usage-metrics branch July 22, 2024 21:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Per-app usage metrics #3429

feat: Per-app usage metrics #3429

bryanhuhta commented Jul 16, 2024 •

edited

Loading

bryanhuhta Jul 16, 2024

bryanhuhta Jul 16, 2024

bryanhuhta Jul 16, 2024

bryanhuhta Jul 16, 2024

petethepig commented Jul 16, 2024

bryanhuhta commented Jul 18, 2024 •

edited

Loading

simonswine left a comment

simonswine Jul 19, 2024

bryanhuhta Jul 19, 2024

simonswine Jul 19, 2024

simonswine Jul 19, 2024

simonswine Jul 19, 2024

bryanhuhta Jul 19, 2024

simonswine Jul 19, 2024

bryanhuhta left a comment •

edited

Loading

bryanhuhta Jul 19, 2024

simonswine Jul 22, 2024

bryanhuhta Jul 19, 2024

simonswine left a comment

simonswine Jul 22, 2024

simonswine Jul 22, 2024

		// TODO(bryanhuhta): We should probably validate the usage group name
		// is a valid label value.

feat: Per-app usage metrics #3429

feat: Per-app usage metrics #3429

Conversation

bryanhuhta commented Jul 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

petethepig commented Jul 16, 2024

bryanhuhta commented Jul 18, 2024 • edited Loading

simonswine left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bryanhuhta left a comment • edited Loading

Choose a reason for hiding this comment

Design

Implementation

Remarks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

simonswine left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bryanhuhta commented Jul 16, 2024 •

edited

Loading

bryanhuhta commented Jul 18, 2024 •

edited

Loading

bryanhuhta left a comment •

edited

Loading