Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Per-app usage metrics #3429

Merged
merged 17 commits into from
Jul 22, 2024
Merged

feat: Per-app usage metrics #3429

merged 17 commits into from
Jul 22, 2024

Conversation

bryanhuhta
Copy link
Contributor

@bryanhuhta bryanhuhta commented Jul 16, 2024

related https://github.com/grafana/pyroscope-squad/issues/162

Warning

This PR body is outdated, but I want to leave it here for posterity's sake. Please refer to this comment to get the most up-to-date and accurate summary of the changes in the PR.

This implements per-app usage for bytes ingested/dropped (both metrics required to calculate billable bytes). I took the following metrics

  • distributor_received_decompressed_bytes
  • discarded_bytes_total

and added the service_name label to them. This label value is typically blank, but when the tenant overrides have the following property set

overrides:
  "1234":
    distributor_usage_groups:
      - service_a
      - service_b

any series which has a service_name label that matches one of distributor_usage_groups will have service_name set on the ingest/dropped metrics. This should allow us to create recording rules to send to the billing cortex and let customers view per-app usage breakdowns.

Note that there is a cost to adding an app to the allowlist. We should be prudent when adding services, especially for large customers.

@bryanhuhta bryanhuhta self-assigned this Jul 16, 2024
@bryanhuhta bryanhuhta changed the title Per-app usage metrics feat: Per-app usage metrics Jul 16, 2024
@@ -746,7 +754,7 @@ func (g *groupsWithFingerprints) add(stringTable []string, lbls phlaremodel.Labe
})
}

func extractSampleSeries(req *distributormodel.PushRequest, relabelRules []*relabel.Config) (result []*distributormodel.ProfileSeries, bytesRelabelDropped, profilesRelabelDropped float64) {
func extractSampleSeries(req *distributormodel.PushRequest, usageGroups *validation.TenantUsageGroups, relabelRules []*relabel.Config) (result []*distributormodel.ProfileSeries, bytesRelabelDropped, profilesRelabelDropped float64) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unfortunate this function signature had to change. This function previously returned all dropped bytes, so I couldn't use the return value as there may be series present that got dropped which need to specially labeled.

Comment on lines 884 to 886
// note(bryanhuhta): We don't need to label this metric with service
// name as rate limited requests don't count towards billable bytes.
validation.DiscardedBytes.WithLabelValues(string(validation.RateLimited), tenantID).Add(float64(req.TotalBytesUncompressed))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a valid assumption. When calculating billable bytes, the recording rules seem to subtract out any bytes that were rate limited.

https://github.com/grafana/deployment_tools/blob/6b6e9832e7abb0fa2b8aeafc3aa6ef75d69880f6/ksonnet/lib/billing-mixin/recording_rules/profiles.libsonnet#L48-L50

Comment on lines 1000 to 1002
ug := &validation.TenantUsageGroups{
TenantID: "",
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only here because the extractSampleSeries signature changed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation of the service name allowlist.

@petethepig
Copy link
Member

Looks good. Few random thoughts:

  1. We could copy the config format from Cloud metrics. Theirs is more flexible because you can set different selectors (not just service_name), but more complex
  2. I might be overly paranoid, but I would maybe create a new metric instead of using the existing one. This way if something goes wrong (e.g cardinality explosion) it would be easier to fix.
  3. I would avoid empty strings in label values. I'd call it "unknown" or "other" or something like that. This also has a benefit of showing up in the dashboard later so that it's clear for users what's going on.

I don't feel very strongly about 1 and 2, I do feel strongly on the third point.

@bryanhuhta
Copy link
Contributor Author

bryanhuhta commented Jul 18, 2024

After @petethepig's comments, I made the following changes:

  • Created new metrics to track per-app usage
    • pyroscope_usage_group_received_decompressed_total => pyroscope_distributor_received_compressed_bytes
    • pyroscope_usage_group_discarded_bytes_total => pyroscope_discarded_bytes_total
  • Implemented a more flexible config structure (if not exactly the same as metrics, it's highly similar)
  • Grouped all other values that don't correspond to a usage group into an "other" bucket

Here's it working in fire-dev-001: https://ops.grafana-ops.net/goto/yu16UVXIg?orgId=1

Screenshot 2024-07-18 at 5 52 27 PM

Using the following usage group definitions:

distributor_usage_groups:
  - pyroscope: '{service_name=~"fire-dev-001/.*"}'
  - cortex-dev-01/ingester: '{service_name="cortex-dev-01/ingester"}'

This groups all the fire-dev-001/* apps under the pyroscope usage group and puts cortex-dev-01/ingester into its own usage group.

@bryanhuhta bryanhuhta marked this pull request as ready for review July 18, 2024 22:58
@bryanhuhta bryanhuhta requested a review from a team as a code owner July 18, 2024 22:58
Copy link
Contributor

@simonswine simonswine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have a few concerns which I outlined in the line by line comments. Happy to dive into them a bit more if stuff is unclear.

It is also worth taking a look at how Mimir does it, because their implementaiton is obviously a bit more tested: https://github.com/grafana/mimir/blob/main/pkg/ingester/activeseries/custom_trackers_config.go

return nil, fmt.Errorf("no matchers for usage group %q and tenant %q", name, tenantID)
}

amMatchers, err := amlabels.ParseMatchers(matchersString)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I am missing something, but we probably should parse those the same way we do in the query path: https://github.com/simonswine/pyroscope/blob/ed0b5643f48792ac988d975e17e285b807d5c1f9/pkg/phlaredb/head.go#L440

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, a lot of this file was copied from how Mimir implemented this. So no, you aren't missing anything, I just happened to use Mimir's parse metrics approach instead of our own. I'll switch this.


// DistributorUsageGroups returns the usage groups that are enabled for this
// tenant.
func (o *Overrides) DistributorUsageGroups(tenantID string) (*UsageGroupConfig, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any returned error in this method will lead to profiling traffic being dropped. This is not ideal.
This seems quite a severe reaction to "I have wrong accounting usage groups".

I do suggest we should move all those errors into the parsing part of the overrides.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separately this method does a lot and is in the hot path of ingestion. We should also move as much as we can into the "loading the overrides" part of the code.

E.g. the parsing of label matchers should only happen only when the override is parsed and not for every profile ingested.

@@ -50,6 +50,9 @@ type Limits struct {
MaxProfileStacktraceDepth int `yaml:"max_profile_stacktrace_depth" json:"max_profile_stacktrace_depth"`
MaxProfileSymbolValueLength int `yaml:"max_profile_symbol_value_length" json:"max_profile_symbol_value_length"`

// Distributor per-app usage breakdown.
DistributorUsageGroups []map[string]string `yaml:"distributor_usage_groups" json:"distributor_usage_groups"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind explaining why this is a slice of maps, rather than just a map, which I would expect?

Here the yaml "explanation of my question":

current_usage_groups:
- my_team: '{namespace="cool-stuff"'
- other_team: '{namespace="boring-stuff"}'

why_not_usage_groups:
 my_team: '{namespace="cool-stuff"}'
 other_team: '{namespace="boring-stuff"}'	

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted the labels to be applied deterministically (the last usage group matched will be used). I originally used the why_not_usage_groups approach, but if two usages groups matched a label set, we wouldn't know which one would get chosen.

distributor_usage_groups:
  specific_cool_stuff: '{namespace="cool-stuff", service_name="thing"}'
  general_cool_stuff: '{namespace="cool-stuff"}'

Depending on how each key got hashed into the map, we'd get one or the other of the two usage groups. We could simplify the config and matching logic if we relaxed the determinism constraint, but I worry this might skew results oddly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally get the order and deterministic result problem (now that you mentioned it😆 ), but I think the map inside the slice will still be non-deterministic in the current implementation.

I think maybe a slice of two string might be a better solution:

name string
selector string
}

or just sorting by the map key (name of the rule group).

Probably it is best to aim for consistency with other Grafana Products and do what Mimir does.

Copy link
Contributor Author

@bryanhuhta bryanhuhta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation has had a few iterations and has changed significantly since I opened this PR. This is a recap of the implementation to-date.

Design

The idea of the usage group at this point is identical (or nearly so) to Mimir's approach. A usage group has a name and matcher(s). When a profile is ingested, it's labels are parsed and matched against each usage group. Any usage group it matches will be collected and the profile size will be counted by the pyroscope_usage_group_received_decompressed_total metric. Later, if that profile is dropped for any reason, it the drop will be counted by the pyroscope_usage_group_discarded_bytes_total metric.

As noted before, a profile can match 0 or more usage groups.

  • If 0, then it's bucketed into a default usage group called "other"
  • If 1 or more, the pyroscope_usage_group_received_decompressed_total will be counted with each matching usage group

As an example, consider the following labels attached to a profile:

[
  { "service_name": "foo" }
  { "namespace": "barbaz" }
]

along with the following usage groups:

distributor_usage_groups:
  app/foo: '{service_name="foo"}'
  namespace/bar: '{namespace=~"bar.*"}'

the pyroscope_usage_group_received_decompressed_total metric will be counted once for app/foo and once for namespace/bar:

pyroscope_usage_group_received_decompressed_total{usage_group="app/foo"}
pyroscope_usage_group_received_decompressed_total{usage_group="namespace/bar"}

Implementation

I created a struct UsageGroupConfig which is responsible for

  • unmarshalling the usage group config (from both yaml and json)
  • validating the group count (limit 50 for now)
  • normalizing usage group names
  • parsing and validating the matchers

This struct is built once at start up and any errors it generates will cause the app to fail to run. I believe this is reasonable.

UsageGroupConfig has a method GetUsageGroups(phlaremodel.Labels) which accepts a label set. It will match the label set against all the usage groups and return a list of all that match via the proxy object UsageGroupMatch. UsageGroupMatch has methods that can report a number as "bytes received" or "bytes dropped". This is where the logic of emitting a metric label for each usage group occurs.

Remarks

@simonswine identified one of the former approaches as being detrimental to the hot path both in aggressive failure cases and lots of new computation. Both these concerns are addressed by doing the bulk of the error-prone parsing and validation at startup, leaving simple accounting to the hot path.

Additionally, there was an ask to make this look and feel more like Mimir's approach. Whilst it isn't exactly the same, I'm hoping with the most recent changes we move much, much closer to their implementation. I took a lot of inspiration from how they implemented this feature, so the user-facing look-and-feel and the code of this feature should feel very similar to Mimir's.

Lastly, It's important to note that this approach isn't designed to be accurate down to the byte and penny. @petethepig and I agreed to take this approach because it should be approximately accurate and provide proportionally accurate results. So even if this feature doesn't report penny-exact numbers, the error should be proportionally the same across all usage groups. Big apps will produce big numbers, small apps will produce small numbers.

Comment on lines 144 to 145
// TODO(bryanhuhta): We should probably validate the usage group name
// is a valid label value.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't figure out where prometheus does its validation of label values, but wherever it is, we should copy it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is any valid utf8 string: https://github.com/prometheus/client_golang/blob/9f203a098ec630d179ccef4efbaa8c3341291d00/prometheus/labels.go#L158

I suggest we also disallow "other" to avoid undefined behaviour.

Comment on lines +161 to +164
// It should never be nil, but check just in case!
if config == nil {
config = &UsageGroupConfig{}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the current code, the config will never be nil. However, I think we should leave this check here so we treat a nil config like a empty config, in case the config could be nil at this point.

Copy link
Contributor

@simonswine simonswine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this and also for providing so much detailed comments and ensure this is properly tested. ❤️

I would like you to take a look at my two suggestions regarding validation and tenant_id flow. But I think this is already for a LGTM.

Comment on lines 144 to 145
// TODO(bryanhuhta): We should probably validate the usage group name
// is a valid label value.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is any valid utf8 string: https://github.com/prometheus/client_golang/blob/9f203a098ec630d179ccef4efbaa8c3341291d00/prometheus/labels.go#L158

I suggest we also disallow "other" to avoid undefined behaviour.

Comment on lines 168 to 170
if config.TenantID == "" {
config.TenantID = tenantID
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels a bit sketchy to me, we despite not registering it as through "RegisterFlag" in theory you could still set in the pyroscope.yaml config a default usage groups for all tenants.

This bit would then expose the same *UsageGroupConfig to all tenant that don't override it, with a race to that bit which sets the tenant.

I think it would be better to do the tenant field in a wrapped type or even simpler put it to a parameter to the Count*() methods

type UsageGroupConfig // without TenantID

type UsageGroupInstance{
*UsageGroupConfig
tenantID string
}

@bryanhuhta bryanhuhta merged commit e3e2777 into main Jul 22, 2024
18 checks passed
@bryanhuhta bryanhuhta deleted the app-usage-metrics branch July 22, 2024 21:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants